Reproducible Analytical Pipelines

Here’s the new data. Could you summarise it like Alice did last year, and send me a report?
The civil service and public bodies in the UK publish lots of datasets. These datasets can be really helpful when experimenting with data visualisation and presentation tools. As data consumers, what we rarely see is the amount of work that goes into preparing those datasets, or how they are used to make decisions about, or understand trends within the country. That work has to be coordinated across multiple people, each with different skills.
Much like teams do, software and data evolve over time. The raw data that feeds into the above datasets, and any products that are built upon them (reports, applications and so on), may only be collected and processed every few years - and a lot can change in a few years. So, teams within those departments need a way to reliably generate those datasets and data products from newly-collected raw data that is robust (or at least flexible) enough to accommodate changes in:
- data quality,
- the structure/schema of the raw data,
- personnel within the team and departmental restructuring,
- software tooling,
- output data format or usage.
It is becoming more common for this kind of data processing to be handled by a Reproducible Analytical Pipeline (RAP). A RAP is a, largely, automated process written in code. An aim of using RAPs here, is to reduce the amount of manual and ad-hoc input into the data processing, so that when given the same input data you would generate the same downstream products and so that the process should work successfully and predictably when given new data. By placing the processing decisions in code, RAPs make data processing more easily auditable and more transparent.
The UK Civil Service and the NHS have guidelines on their aims for RAPs and how to create these pipelines.
Now, you might not be working for one of those institutions, and the data processing and analysis that you perform might not be public facing or subject to a national audit. But, if you’re doing data science or data processing as part of your job, the ideas surrounding RAPs may help you work more efficiently.
Let’s start with the basics:
- where does your data come from?
- where does it go to?
- what is your main tool when working with it?
- and who else either depends upon, or is also responsible for, your work?
The RAP guidelines for the UK Civil Service promote the use of open-source tools, version control, and automation. Which tools should you choose, what should you automate, and who needs to know about or approve what you are doing?
If you’ve inherited an Excel workbook with last year’s data embedded inside it and you need to process this year’s data, you may not know enough about the processes that occurred before last year’s data was copied into the spreadsheet or any manual tweaks that happened after it was imported (how were missing values handled etc). You could automate the early, data ingestion, stages.
If you’re inherited some SQL scripts that make database queries and you have to copy-paste the resulting values into a report, you could automate the report-generation step.
If you have a collection of analysis steps or scripts, that have to be called in a particular order, or where you have to manually edit the scripts (fixing the filepaths, for example) for them to work with a new raw-data release, you could think about how to orchestrate running those scripts or how to configure the project so that it requires less manual intervention to run next time. Editing code and calling commands in a programming environment are manual processes, too.
You may not be able to automate everything at once. So try to make strategic wins on those areas of your data workflow that are the least clear, or that involve the most manual input.
The push towards automation requires programming skills, and a choice over a programming language. In data science this typically means SQL plus either R or Python. Which you choose for a project, depends on the skills across your team and the infrastructure that is available to you. Don’t use your favourite language, or a language you want to experiment with, if no-one else on the team can review your code or take over the project from you.
One of the best resources that I found while researching this blog post was the book “Building reproducible analytical pipelines with R” by Bruno Rodrigues. That book covers many of the topics mentioned above: how to set up a project with version control, how to generate automated reports, how to orchestrate multiple analytical processes together. It is a very R-focussed book, but the ideas hold whether you work in Python or another language.
Reproducibility in data science has a long-standing counterpart in science more generally. If you write a scientific paper, the data upon which it is based, and the data-processing steps involved should be made available. But they should be created in such a way that they can be reused. If someone wants to regenerate your results, and they can download your data and code, the code should be written in such a way that this is guaranteed. Just releasing a script on GitHub isn’t enough - the precise version of any used scripts and project-specific data should be tagged; the programming environment should be matched as closely as possible (for example, matching the version of R or Python used, using the same versions of any installed packages); any supporting data sources should be pinned to specific versions and so on.
For us though, RAPs are more about ensuring that data-processing is predictable and transparent, and that processes can be reused at a subsequent date and with updated data. Your team may need to level-up their programming skills, or their knowledge of your programming environment, to take advantage of improved automation. But doing so will reduce the amount of repetitive manual tasks, simplify on-boarding new team members, and make maintenance easier.
Also, automating stuff is really fun.
