Skip to main content
AI in Production 2026 is now open for talk proposals.
Share insights that help teams build, scale, and maintain stronger AI systems.
items
Menu
  • About
    • Overview 
    • Join Us  
    • Community 
    • Contact 
  • Training
    • Overview 
    • Course Catalogue 
    • Public Courses 
  • Posit
    • Overview 
    • License Resale 
    • Managed Services 
    • Health Check 
  • Data Science
    • Overview 
    • Visualisation & Dashboards 
    • Open-source Data Science 
    • Data Science as a Service 
    • Gallery 
  • Engineering
    • Overview 
    • Cloud Solutions 
    • Enterprise Applications 
  • Our Work
    • Blog 
    • Case Studies 
    • R Package Validation 
    • diffify  

Reading and Writing Data with {arrow}

Author: Colin Gillespie

Published: January 18, 2024

tags: r, arrow, parquet

This is part of a series of related posts on Apache Arrow. Other posts in the series are:

  • Understanding the Parquet file format
  • Reading and Writing Data with {arrow} (This post)
  • Parquet vs the RDS Format

What is (Apache) Arrow?

Apache Arrow is a cross-language development platform for in-memory data. As it’s in-memory (as opposed to data stored on disk), it provides additional speed boosts. It’s designed for efficient analytic operations, and uses a standardised language-independent columnar memory format for flat and hierarchical data. The {arrow} R package provides an interface to the ‘Arrow C++’ library - an efficient package for analytic operations on modern hardware.

There are many great tutorials on using {arrow} (see the links at the bottom of the post for example). The purpose of this blog post isn’t to simply reproduce a few examples, but to understand some of what’s happening behind the scenes. In this particular post, we’re interested in understanding the reading/writing aspects of {arrow}.

Getting started

The R package is installed from CRAN in the usual way

install.packages("arrow")

Then loaded using

library("arrow")

This blog post uses the NYC Taxi data. It’s pretty big - around ~40GB in total. To download it locally,

data_nyc = "data/nyc-taxi"
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
    dplyr::filter(year %in% 2012:2021) |> 
    write_dataset(data_nyc, partitioning = c("year", "month"))

Once this has completed, you can check everything has downloaded correctly by running

nrow(open_dataset(data_nyc))
## [1] 1150352666

Loading in data

Unsurprisingly, the first command we come across is open_dataset(). This opens the data and (sort of) reads it in.

library("arrow")
open_dataset(data_nyc)
## FileSystemDataset with 120 Parquet files
## vendor_name: string
## pickup_datetime: timestamp[ms]
## dropoff_datetime: timestamp[ms]
## passenger_count: int64
## trip_distance: double
## ...

Reading is a lazy action. This allows us to manipulate much larger data sets than R could typically deal with. The default print method lists the columns in the data set, with their associated type. These data types come directly from the C++ API so don’t always have a corresponding R type. For example, the year column is an int32 (a 32 bit integer), whereas passenger_count is int64 (a 64 bit integer). In R, these are both integers.

As you might guess, there’s a corresponding function write_dataset(). Looking at the (rather good) documentation, we come across a few concepts that are worth exploring further.

Whether you want to start from scratch, or improve your skills, Jumping Rivers has a training course for you.

File formats

The main file formats associated are

  • parquet: a format designed to minimise storage - see our recent blog post that delves into some of the details surrounding the format;
  • arrow/feather: in-memory format created to optimise vectorised computations;
  • csv: the world runs on csv files (and Excel).

The common workflow is storing your data as parquet files. The Arrow library then loads the data and processes the data in the arrow format.

Storing data in the Arrow format

The obvious thought (to me at least) was, why not store the data as arrow? Ignoring for the moment that Arrow doesn’t promise long-term archival storage using the arrow format, we can do a few tests.

Using the NYC-taxi data, we can create a quick subset

# Replace format = "arrow" with format = "parquet" 
# to create the correspond
# parquet equivalent
open_dataset(file.path(data_path, "year=2019")) |>
  write_dataset("data/nyc-taxi-arrow", partitioning = "month", 
                format = "arrow")

A very quick, but not particularly thorough test suggests that

  • the arrow format requires ten times more storage space. So for the entire nyc-taxi data set, parquet takes around ~38GB, but arrow would take around 380GB.
  • storing as arrow makes some operations quicker. For the few examples I tried, there was around a 10% increase in speed.

The large storage penalty was enough to convince me of the merits of storing data as parquet, but there may be some niche situations where you might switch.

Hive partitioning

Both open_dataset() and write_dataset() functions mention “Hive partitioning” - in fact we sneakily included a partioning argument in the code above. For the open_dataset() function, it guesses if we use Hive partitioning, whereas for the write_dataset() function we can specify the partition. But what actually is it?

Hive partitioning is a method used to split a table into multiple files based on partition keys. A partition key is a variable of interest in your data, for example, year or month. The files are then organised in folders. Within each folder, the key has a value is determined by the name of the folder. By partitioning the data in this way, we can make it faster to do queries on data slices.

Suppose we wanted to partition the data by year, then the file structure would be

taxi-data
   year=2018
     file1.parquet
     file2.parquet
   year=2019
     file4.parquet
     file5.parquet

Of course, we can partition by more than one variable, such as both year and month

taxi-data
   year=2018
     month=01
      file01.parquet
     month=02
      file02.parquet
      file03.parquet
    ...  
    year=2019
      month=01
        ...

See the excellent vignette on datasets in the {arrow} package.

Example: Partitioning

Parquet files aren’t the only files we can partition. We can also use the same concept with CSV files. For example,

tmp_dir = tempdir()
write_dataset(palmerpenguins::penguins, 
              path = tmp_dir, 
              partitioning = "species", 
              format = "csv")

This looks like

list.files(tmp_dir, recursive = TRUE, pattern = "\\.csv$")
## [1] "species=Adelie/part-0.csv"    "species=Chinstrap/part-0.csv"
## [3] "species=Gentoo/part-0.csv"

You can also partition using the group() function from {dplyr}

palmerpenguins::penguins |>
  dplyr::group_by(species) |>
  write_dataset(path = tmp_dir, format = "csv")

In my opinion, while it makes conceptual sense to partition CSV files, in practice it’s probably not worthwhile. Any CSV files that you partition to get speed benefits, you might as well use parquet.

Single files vs dataset APIs

When reading in data using Arrow, we can either use the single file function (these start with read_) or use the dataset API (these start with open_).

For example, using read_csv_arrow() reads the CSV file directly into memory. If the file is particularly large, then we’ll run out of memory. One thing to note, is the as_data_frame argument. By default this is set to TRUE, meaning that read_csv_arrow() will return a tibble. The upside of this is that we have a familiar object. The downside is that it takes up more room than Arrow’s internal data representation (an Arrow Table)

This blog post by François Michonneau goes into far more detail, and discusses the R and Python implementations of the different APIs.

Acknowledgements

This blog was motivated by the excellent Arrow tutorial at Posit Conf 2023, run by Steph Hazlitt and Nic Crane. The NYC dataset came from that tutorial, and a number of the ideas that I explored were discussed with the tutorial leaders. I also used a number of resources found on various corners of the web. I’ve tried to provide links, but if I’ve missed any, let me know.


Jumping Rivers Logo

Recent Posts

  • Start 2026 Ahead of the Curve: Boost Your Career with Jumping Rivers Training 
  • Should I Use Figma Design for Dashboard Prototyping? 
  • Announcing AI in Production 2026: A New Conference for AI and ML Practitioners 
  • Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November! 
  • Get Involved in the Data Science Community at our Free Meetups 
  • Polars and Pandas - Working with the Data-Frame 
  • Highlights from Shiny in Production (2025) 
  • Elevate Your Data Skills with Jumping Rivers Training 
  • Creating a Python Package with Poetry for Beginners Part2 
  • What's new for Python in 2025? 

Top Tags

  • R (236) 
  • Rbloggers (182) 
  • Pybloggers (89) 
  • Python (89) 
  • Shiny (63) 
  • Events (26) 
  • Training (23) 
  • Machine Learning (22) 
  • Conferences (20) 
  • Tidyverse (17) 
  • Statistics (14) 
  • Packages (13) 

Authors

  • Amieroh Abrahams 
  • Colin Gillespie 
  • Pedro Silva 
  • Russ Hyde 
  • Gigi Kenneth 
  • Tim Brock 
  • Aida Gjoka 
  • Shane Halloran 
  • Osheen MacOscar 
  • Sebastian Mellor 
  • Myles Mitchell 
  • Keith Newman 
  • Theo Roe 

Keep Updated

Like data science? R? Python? Stan? Then you’ll love the Jumping Rivers newsletter. The perks of being part of the Jumping Rivers family are:

  • Be the first to know about our latest courses and conferences.
  • Get discounts on the latest courses.
  • Read news on the latest techniques with the Jumping Rivers blog.

We keep your data secure and will never share your details. By subscribing, you agree to our privacy policy.

Follow Us

  • GitHub
  • Bluesky
  • LinkedIn
  • YouTube
  • Eventbrite

Find Us

The Catalyst Newcastle Helix Newcastle, NE4 5TG
Get directions

Contact Us

  • hello@jumpingrivers.com
  • + 44(0) 191 432 4340

Newsletter

Sign up

Events

  • North East Data Scientists Meetup
  • Leeds Data Science Meetup
  • Shiny in Production
British Assessment Bureau, UKAS Certified logo for ISO 9001 - Quality management British Assessment Bureau, UKAS Certified logo for ISO 27001 - Information security management Cyber Essentials Certified Plus badge
  • Privacy Notice
  • |
  • Booking Terms

©2016 - present. Jumping Rivers Ltd