Skip to main content
AI in Production 2026 is now open for talk proposals.
Share insights that help teams build, scale, and maintain stronger AI systems.
items
Menu
  • About
    • Overview 
    • Join Us  
    • Community 
    • Contact 
  • Training
    • Overview 
    • Course Catalogue 
    • Public Courses 
  • Posit
    • Overview 
    • License Resale 
    • Managed Services 
    • Health Check 
  • Data Science
    • Overview 
    • Visualisation & Dashboards 
    • Open-source Data Science 
    • Data Science as a Service 
    • Gallery 
  • Engineering
    • Overview 
    • Cloud Solutions 
    • Enterprise Applications 
  • Our Work
    • Blog 
    • Case Studies 
    • R Package Validation 
    • diffify  

Understanding the Parquet file format

Author: Colin Gillespie

Published: September 27, 2021

tags: r, big-data, parquet, feather, storage

This is part of a series of related posts on Apache Arrow. Other posts in the series are:

  • Understanding the Parquet file format
  • Reading and Writing Data with {arrow}
  • Parquet vs the RDS Format

Apache Parquet is a popular column storage file format used by Hadoop systems, such as Pig, Spark, and Hive. The file format is language independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension .parquet. This blog post aims to understand how parquet works and the tricks it uses to efficiently store data.

Key features of parquet are:

  • it’s cross platform
  • it’s a recognised file format used by many systems
  • it stores data in a column layout
  • it stores metadata

The latter two points allow for efficient storage and querying of data.

Column Storage

Suppose we have a simple data frame:

tibble::tibble(id = 1:3, 
              name = c("n1", "n2", "n3"), 
              age = c(20, 35, 62))
#> # A tibble: 3 × 3
#>      id name    age
#>   <int> <chr> <dbl>
#> 1     1 n1       20
#> 2     2 n2       35
#> 3     3 n3       62

If we stored this data set as a CSV file, what we see in the R terminal is mirrored in the file storage format. This is row storage. This is efficient for file queries such as,

SELECT * FROM table_name WHERE id == 2

We simply go to the 2nd row and retrieve that data. It’s also very easy to append rows to the data set - we just add a row to the bottom of the file. However, if we want to sum the data in the age column, then this is potentially inefficient. We would need to determine which value on each row is related to age, and extract that value.

Parquet uses column storage. In column layouts, column data are stored sequentially.

1 2 3
n1 n2 n3
20 35 62

With this layout, queries such as

SELECT * FROM dd WHERE id == 2

are now inconvenient. But if we want to sum up all ages, we simply go to the third row and add up the numbers.

Reading and writing parquet files

In R, we read and write parquet files using the {arrow} package.

# install.packages("arrow")
library("arrow")
packageVersion("arrow")
#> [1] '14.0.0.2'

To create a parquet file, we use write_parquet()

# Use the penguins data set
data(penguins, package = "palmerpenguins")
# Create a temporary file for the output
parquet = tempfile(fileext = ".parquet")
write_parquet(penguins, sink = parquet)

To read the file, we use read_parquet(). One of the benefits of using parquet, is small file sizes. This is important when dealing with large data sets, especially once you start incorporating the cost of cloud storage. Reduced file size is achieved via two methods:

  • File compression. This is specified via the compression argument in write_parquet(). The default is snappy.
  • Clever storage of values (the next section).

Do you use Professional Posit Products? If so, check out our managed Posit services

Parquet Encoding

Since parquet uses column storage, values of the same type are number stored together. This opens up a whole world of optimisation tricks that aren’t available when we save data as rows, e.g. CSV files.

Run length encoding

Suppose a column just contains a single value repeated on every row. Instead of storing the same number over and over (as a CSV file would), we can just record “value X repeated N times”. This means that even when N gets very large, the storage costs remain small. If we had more than one value in a column, then we can use a simple look-up table. In parquet, this is known as run length encoding. If we have the following column

c(4, 4, 4, 4, 4, 1, 2, 2, 2, 2)
#>  [1] 4 4 4 4 4 1 2 2 2 2

This would be stored as

  • value 4, repeated 5 times
  • value 1, repeated once
  • value 2, reported 4 times

To see this in action, lets create a simple example, where the character A is repeated multiple times in a data frame column:

x = data.frame(x = rep("A", 1e6))

We can then create a couple of temporary files for our experiment

parquet = tempfile(fileext = ".parquet")
csv = tempfile(fileext = ".csv")

and write the data to the files

arrow::write_parquet(x, sink = parquet, compression = "uncompressed")
readr::write_csv(x, file = csv)

Using the {fs} package, we extract the size

# Could also use file.info()
fs::file_info(c(parquet, csv))[, "size"]
#> # A tibble: 2 × 1
#>          size
#>   <fs::bytes>
#> 1         808
#> 2       1.91M

We see that the parquet file is tiny, whereas the CSV file is almost 2MB. This is actually a 500 fold reduction in file space.

Dictionary encoding

Suppose we had the following character vector

c("Jumping Rivers", "Jumping Rivers", "Jumping Rivers")
#> [1] "Jumping Rivers" "Jumping Rivers" "Jumping Rivers"

If we want to save storage, then we could replace Jumping Rivers with the number 0 and have a table to map between 0 and Jumping Rivers. This would significantly reduce storage, especially for long vectors.

x = data.frame(x = rep("Jumping Rivers", 1e6))
arrow::write_parquet(x, sink = parquet)
readr::write_csv(x, file = csv)
fs::file_info(c(parquet, csv))[, "size"]
#> # A tibble: 2 × 1
#>          size
#>   <fs::bytes>
#> 1         909
#> 2       14.3M

Delta encoding

This encoding is typically used in conjunction with timestamps. Times are typically stored as Unix times, which is the number of seconds that have elapsed since January 1st, 1970. This storage format isn’t particularly helpful for humans, so typically it is pretty-printed to make it more palatable for us. For example,

(time = Sys.time())
#> [1] "2024-02-01 13:02:03 CET"
unclass(time)
#> [1] 1706788923

If we have a large number of time stamps in a column, one method for reducing file size is to simply subtract the minimum time stamp from all values. For example, instead of storing

c(1628426074, 1628426078, 1628426080)
#> [1] 1628426074 1628426078 1628426080

we would store

c(0, 4, 6)
#> [1] 0 4 6

with the corresponding offset 1628426074.

Other encodings

There are a few other tricks that parquet uses. Their GitHub page gives a complete overview.

If you have a parquet file, you can use parquet-mr to investigate the encoding used within a file. However, installing the tool isn’t trivial and does take some time.

Feather vs Parquet

The obvious question that comes to mind when discussing parquet, is how does it compare to the feather format. Feather is optimised for speed, whereas parquet is optimised for storage. It’s also worth noting, that the Apache Arrow file format is feather.

Parquet vs RDS Formats

The RDS file format used by readRDS()/saveRDS() and load()/save(). It is file format native to R and can only be read by R. The main benefit of using RDS is that it can store any R object - environments, lists, and functions.

If we are solely interested in rectangular data structures, e.g. data frames, then reasons for using RDS files are

  • the file format has been around for a long time and isn’t likely to change. This means it is backwards compatible
  • it doesn’t depend on any external packages; just base R.

The advantages of using parquet are

  • the file size of parquet files are slightly smaller. If you want to compare file sizes, make sure you set compression = "gzip" in write_parquet() for a fair comparison.
  • parquet files are cross platform
  • in my experiments, parquet files, as you would expect, are slightly smaller. For some use cases, an additional saving of 5% may be worth it. But, as always, it depends on your particular use cases.

References

  • The default compression algorithm used by Parquet.
  • A nice talk by Raoul-Gabriel Urma.
  • Parquet-tools for interrogating Parquet files.
  • The official list of file optimisations
  • Stackoverflow questions on Parquet: Feather & Parquet and Arrow and Parquet

Jumping Rivers Logo

Recent Posts

  • Start 2026 Ahead of the Curve: Boost Your Career with Jumping Rivers Training 
  • Should I Use Figma Design for Dashboard Prototyping? 
  • Announcing AI in Production 2026: A New Conference for AI and ML Practitioners 
  • Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November! 
  • Get Involved in the Data Science Community at our Free Meetups 
  • Polars and Pandas - Working with the Data-Frame 
  • Highlights from Shiny in Production (2025) 
  • Elevate Your Data Skills with Jumping Rivers Training 
  • Creating a Python Package with Poetry for Beginners Part2 
  • What's new for Python in 2025? 

Top Tags

  • R (236) 
  • Rbloggers (182) 
  • Pybloggers (89) 
  • Python (89) 
  • Shiny (63) 
  • Events (26) 
  • Training (23) 
  • Machine Learning (22) 
  • Conferences (20) 
  • Tidyverse (17) 
  • Statistics (14) 
  • Packages (13) 

Authors

  • Colin Gillespie 
  • Amieroh Abrahams 
  • Aida Gjoka 
  • Shane Halloran 
  • Russ Hyde 
  • Gigi Kenneth 
  • Osheen MacOscar 
  • Sebastian Mellor 
  • Myles Mitchell 
  • Keith Newman 
  • Pedro Silva 
  • Tim Brock 
  • Theo Roe 

Keep Updated

Like data science? R? Python? Stan? Then you’ll love the Jumping Rivers newsletter. The perks of being part of the Jumping Rivers family are:

  • Be the first to know about our latest courses and conferences.
  • Get discounts on the latest courses.
  • Read news on the latest techniques with the Jumping Rivers blog.

We keep your data secure and will never share your details. By subscribing, you agree to our privacy policy.

Follow Us

  • GitHub
  • Bluesky
  • LinkedIn
  • YouTube
  • Eventbrite

Find Us

The Catalyst Newcastle Helix Newcastle, NE4 5TG
Get directions

Contact Us

  • hello@jumpingrivers.com
  • + 44(0) 191 432 4340

Newsletter

Sign up

Events

  • North East Data Scientists Meetup
  • Leeds Data Science Meetup
  • Shiny in Production
British Assessment Bureau, UKAS Certified logo for ISO 9001 - Quality management British Assessment Bureau, UKAS Certified logo for ISO 27001 - Information security management Cyber Essentials Certified Plus badge
  • Privacy Notice
  • |
  • Booking Terms

©2016 - present. Jumping Rivers Ltd