The Jumping Rivers Blog

Using R to Teach R: Lessons for Software Development

Thu, 09 Apr 2026 23:59:00 +0000

As we approach the decennial (10-year) anniversary since Jumping Rivers was founded in 2016, it’s a good time to reflect on what we have achieved in that time and share some lessons learned.

If you have read our blogs previously then you will be aware that Jumping Rivers is a consultancy and training provider in all things data science. But did you know that we offer over 50 different courses spanning R, Python, Git, SQL and more?

In this blog we will provide a glimpse into our internal process and share how we have streamlined the task of maintaining so many courses. Along the way we will share some good practices applicable to any big coding project, including packaging of source code and automated CI/CD.

The challenge

Let’s start by laying out the key challenges which face us.

1. Multilingual support

Our course catalogue consists of over 50 courses. The majority of these are either based on R or Python or both:

50% R
30% Python
5% R and Python
15% other (Git, SQL, Tableau, Posit and more)

At the very least, any solution that we come up with for standardising our courses must be compatible with both R and Python. Ideally it should also support some less taught languages including SQL and Git.

2. Maintenance

The world of R and Python is constantly changing. The languages themselves receive frequent updates, as do publicly available R packages on CRAN and Python packages on PyPI.

This has the consequence that code which worked one year ago (or even one day) may no longer be functional with the latest package versions. We will need some way to track this and ensure that the code examples covered in our courses remain relevant and error-free.

3. Demand

We deliver over 100 courses per year. For a relatively small team of data scientists, this can be a lot to juggle!

In an ideal world, the process of building the course materials, setting up the cloud environment for training, and managing all of the administration that goes along with this should be automated. That way, the trainer can focus on providing the highest quality experience for the attendees without having to worry about things going wrong on the day.

The solution

Our team is used to setting up data science workflows for clients, including automated reporting and migration of source code into packages. We have therefore applied these techniques in our internal processes, including training.

Automated reporting

You write a document which has to be updated on a regular basis; this might include a monthly presentation showing the latest company revenues. Does this scenario sound familiar?

We could regenerate the plots and data tables and manually copy and paste these into the report document. Even better, we can take advantage of free-to-use automated reporting frameworks including R Markdown and Quarto.

R Markdown and Quarto both work as follows:

We provide a “YAML header” at the top of the report document with configuration and formatting options:
```
---
title: "Introduction to Python"
authors:
- "Myles Mitchell"
date: "2026-04-02"
output: pdf
---
```

The report body is formatted as Markdown and supports a mixture of plain text and code:

## Introduction
At it's most basic, Python is essentially a calculator.
We can run basic calculations as follows:
```{python}
2 + 1
```
We can also assign the output of a calculation to a
variable so that it can be reused later:
```{python}
x = 2 + 1
print(x)
```

Notice that we have included chunks of Python code. By making use of chunk options we can configure code chunks to be executed when rendering the report. Any outputs from the code (plots, tables, summary statistics) can then be displayed.

By migrating the code logic into the report itself, we can update our report assets at the click of a button whenever the data changes.

We have taken inspiration from this approach with our course notes and presentation slides. This forces us to be rigorous with the code examples. Any runtime errors that are produced by faulty or outdated code would be visible in the course notes and by extension to the attendees of our courses.

Crucially for us, R Markdown and Quarto are both compatible with R and Python. They also support syntax highlighting for languages like Git and SQL, as well as a variety of output formats including HTML and PDF.

Internal R packages

So we have settled on a solution for building our course notes. But we have 50 different courses, and setting these up from scratch each time is going to get tedious!

A good practice in any coding project is to avoid duplication as much as possible. Instead of copying and pasting code, we should really be migrating code into functions which are self contained, reusable and easy to test. This will mean fewer places to debug when things inevitably go wrong.

Following a similar philosophy for our training infrastructure, we have migrated any reusable assets for our courses—including logos, template files and styling—into a collection of internal R packages.

When building a new course, the developer can now focus on the aspects that are unique to that course:

Code examples
Notes
Exercises
Presentation slides

Everything else is taken care of automatically:

The appearance of the course notes and presentation slides.
Build routines including converting the R Markdown / Quarto text files into HTML.

In addition to course templates, we also have internal packages for managing the administrative side of training, including:

Calculating pricing quotes for clients.
Generating post-course certificates.
Spinning up a bespoke Posit Workbench environment for the course.
Summarising attendee feedback.

And the list goes on!

GitLab CI/CD

With automated reporting and packaging of source code, we have created standardised routines that can be applied to any of our courses.

This does not change the fact that we have over 50 courses to maintain. We still need a way of testing our courses and tracking issues. This is where CI/CD (Continuous Integration / Continuous Development and Deployment) comes in.

CI/CD defines a framework for software development, including:

Automated unit testing.
Branching of source code and code review.
Versioning and deployment of software.

If you maintain software then you have likely come across version control with Git. Cloud platforms like GitLab and GitHub provide tools for collaborative code development. Not only do they provide a cloud backup of your source code, they also provide the following features:

CI/CD tools for automated testing, build and deployment.
Branch rules for enforcing good practices like code review and unit testing.
Versioning and tagging of source code.

Each of our courses is maintained via it’s own GitLab repository. The CI/CD pipelines for our courses are defined in a separate repository along with the internal R packages mentioned above.

When setting up a new course, the course repository will be automatically populated with the template CI/CD rules. All courses are therefore subject to the same stringent checks, including:

Ensuring that the course notes build without errors.
Enforcing code review of any course updates before these are merged into the main branch.
Building and storing the artifacts (the rendered HTML notes and coding scripts) for the latest version of the course.

These checks are triggered by any updates to a course. We also schedule monthly CI/CD pipelines for all courses, with any issues immediately flagged to our trainers.

We have also taken advantage of GitLab’s folder-like structure for organising code repositories. Within the Jumping Rivers project on GitLab, we have a subproject called “training”. All of our course-related repositories are located “downstream” from this project. This means that any settings or environment variables defined at the “training” level are automatically applied to all of our courses.

In summary

The take-home lessons from this blog are applicable to any big coding project:

Avoid duplication: migrate any reusable logic or assets into standalone packages.
Utilise CI/CD workflows using GitLab, GitHub or similar.
Focus on what matters by automating as much of the process as possible.

Our training infrastructure has taken 10 years to build and is still constantly evolving; we have not even covered the full process in this blog! For a deeper dive, check out this talk by Myles at SatRdays London 2024.

For more on automated reporting, check out:

For more on packaging of source code, check out:

Writing a personal R package.
Three-part series: Creating a Python package.
Four-part series: R package quality.

For updates and revisions to this article, see the original post

Why Learning R is a Good Career Move in 2026

Thu, 26 Mar 2026 23:59:00 +0000

Over the course of my career as a Data Scientist, I’ve worked on projects ranging from simple code reviews, to large application builds. For the most part, I have used R to do this.

If you’re getting into coding or data science, one question you’re probably asking yourself is “Which language should I learn?”

This blog aims to show you why R might be a good decision.

R was built for data (not just programming)

Unlike general purpose languages (such as Python), R was designed specifically for statistics and data analysis.

That means:

Built in statistical tools
Powerful visualisation capabilities
Research level methods available immediately

With packages like the tidyverse, you can clean, analyse, and visualise data with surprisingly little code.

High demand in analytics, research, and healthcare

R is especially popular in many sectors such as:

Healthcare & biostats
Academic research
Government departments
Finance & risk modeling
Pharmaceutical companies

Here are some examples of R in production use:

The {bbplot} R package. Yes, the BBC use R to create graphics for their website!
Health and wellbeing profiling app for the NHS
During the Covid-19 pandemic, the financial times had a stats tracker in which the graphs were built with R.

Knowing some R will give you a competitive edge if you’re looking at working within these sectors.

Open source with the backing of Posit

R is open source. This means that:

It’s free, and always will be!
Anyone can view the source code the makes up R, there are.
Each R package (a folder containing code) has to live on GitHub.com, for everyone to see.
It has a large community of contributors. There are great forums to get help such as Stack Overflow, Posit Community and the R weekly newsletter and tonnes more.
There are thousands more available functionalities compared to paid softwares such as SPSS, SAS or Excel.

Posit, who maintain the free to use RStudio and Positron IDEs (integrated development environment), have many full time staff working solely on maintaining and creating new functionality within R. This means we get:

Defined accountability
Predictable release cycles
Bugs can be solved quicker

Incredible data visualisation possibilities

Being able to communicate your findings with stakeholders is very important in data science, and one of R’s biggest strengths is visualisation and reporting.

With the {ggplot2} package, you can create publication ready charts with very little code. The R Graph Gallery has some amazing examples of what is possible with {ggplot2}.

With the {quarto} and {shiny} packages, you are able to build reproducible reports and interactive dashboards. All this without needing to know any HTML, CSS or JavaScript.

Beginner friendly learning curve

This is very much my own opinion. Compared to other languages, I think R is fairly intuitive and feels rewarding much earlier on in the journey. It also has (in my opinion), the most beginner friendly programme to code in, called RStudio.

Most people attend only two days worth of training with Jumping Rivers, and say they feel ready to start tackling their own data problems.

So… is R worth learning in 2026?

I think so. If you want pure software engineering or large-scale production systems, you may need Python. But for becoming a strong data thinker, and giving you an edge in your analysis, R is one of the best starting points.

For updates and revisions to this article, see the original post

Reproducible Analytical Pipelines

Thu, 19 Mar 2026 23:59:00 +0000

Here’s the new data. Could you summarise it like Alice did last year, and send me a report?

The civil service and public bodies in the UK publish lots of datasets. These datasets can be really helpful when experimenting with data visualisation and presentation tools. As data consumers, what we rarely see is the amount of work that goes into preparing those datasets, or how they are used to make decisions about, or understand trends within the country. That work has to be coordinated across multiple people, each with different skills.

Much like teams do, software and data evolve over time. The raw data that feeds into the above datasets, and any products that are built upon them (reports, applications and so on), may only be collected and processed every few years - and a lot can change in a few years. So, teams within those departments need a way to reliably generate those datasets and data products from newly-collected raw data that is robust (or at least flexible) enough to accommodate changes in:

data quality,
the structure/schema of the raw data,
personnel within the team and departmental restructuring,
software tooling,
output data format or usage.

It is becoming more common for this kind of data processing to be handled by a Reproducible Analytical Pipeline (RAP). A RAP is a, largely, automated process written in code. An aim of using RAPs here, is to reduce the amount of manual and ad-hoc input into the data processing, so that when given the same input data you would generate the same downstream products and so that the process should work successfully and predictably when given new data. By placing the processing decisions in code, RAPs make data processing more easily auditable and more transparent.

The UK Civil Service and the NHS have guidelines on their aims for RAPs and how to create these pipelines.

Now, you might not be working for one of those institutions, and the data processing and analysis that you perform might not be public facing or subject to a national audit. But, if you’re doing data science or data processing as part of your job, the ideas surrounding RAPs may help you work more efficiently.

Let’s start with the basics:

where does your data come from?
where does it go to?
what is your main tool when working with it?
and who else either depends upon, or is also responsible for, your work?

The RAP guidelines for the UK Civil Service promote the use of open-source tools, version control, and automation. Which tools should you choose, what should you automate, and who needs to know about or approve what you are doing?

If you’ve inherited an Excel workbook with last year’s data embedded inside it and you need to process this year’s data, you may not know enough about the processes that occurred before last year’s data was copied into the spreadsheet or any manual tweaks that happened after it was imported (how were missing values handled etc). You could automate the early, data ingestion, stages.

If you’re inherited some SQL scripts that make database queries and you have to copy-paste the resulting values into a report, you could automate the report-generation step.

If you have a collection of analysis steps or scripts, that have to be called in a particular order, or where you have to manually edit the scripts (fixing the filepaths, for example) for them to work with a new raw-data release, you could think about how to orchestrate running those scripts or how to configure the project so that it requires less manual intervention to run next time. Editing code and calling commands in a programming environment are manual processes, too.

You may not be able to automate everything at once. So try to make strategic wins on those areas of your data workflow that are the least clear, or that involve the most manual input.

The push towards automation requires programming skills, and a choice over a programming language. In data science this typically means SQL plus either R or Python. Which you choose for a project, depends on the skills across your team and the infrastructure that is available to you. Don’t use your favourite language, or a language you want to experiment with, if no-one else on the team can review your code or take over the project from you.

One of the best resources that I found while researching this blog post was the book “Building reproducible analytical pipelines with R” by Bruno Rodrigues. That book covers many of the topics mentioned above: how to set up a project with version control, how to generate automated reports, how to orchestrate multiple analytical processes together. It is a very R-focussed book, but the ideas hold whether you work in Python or another language.

Reproducibility in data science has a long-standing counterpart in science more generally. If you write a scientific paper, the data upon which it is based, and the data-processing steps involved should be made available. But they should be created in such a way that they can be reused. If someone wants to regenerate your results, and they can download your data and code, the code should be written in such a way that this is guaranteed. Just releasing a script on GitHub isn’t enough - the precise version of any used scripts and project-specific data should be tagged; the programming environment should be matched as closely as possible (for example, matching the version of R or Python used, using the same versions of any installed packages); any supporting data sources should be pinned to specific versions and so on.

For us though, RAPs are more about ensuring that data-processing is predictable and transparent, and that processes can be reused at a subsequent date and with updated data. Your team may need to level-up their programming skills, or their knowledge of your programming environment, to take advantage of improved automation. But doing so will reduce the amount of repetitive manual tasks, simplify on-boarding new team members, and make maintenance easier.

Also, automating stuff is really fun.

For updates and revisions to this article, see the original post

Three Posit Platform Features Worth Knowing About

Fri, 13 Mar 2026 23:59:00 +0000

We recently ran a session on Posit platform updates, the kind of features that don’t always make it onto your radar but can make a real difference once you know they’re there.

This post covers the three highlights: speeding up R package installation with Posit Package Manager, a new way to explore example apps on Connect, and Workbench Jobs for long-running tasks.

R package installs don’t have to take 26 minutes

If you’ve ever kicked off a Tidyverse install and gone to make a coffee (and come back to find it still running), this one’s for you. When installing from source, which is what happens if you point R at a plain CRAN mirror on Linux — R downloads the source tarball and compiles everything from scratch. That takes time. A lot of it. In our test, a clean Tidyverse install on R 4.4 took 26 minutes.

The fix is to point R at a binary-supporting mirror, which is exactly what Posit Package Manager provides. With binaries, that same install dropped to under two minutes, no compilation, no hunting down system dependencies.

If you’re on R 4.5, it gets better. R 4.5 introduced parallel package downloads, which cuts that two-minute install down to around 40 seconds. Throw in parallel CPU usage for installation as well via the Ncpus argument, and you’re looking at 15 seconds for a full Tidyverse install in a clean environment.

There’s also a preview feature to keep an eye on: ManyLinux support in Package Manager. The idea is to bundle more of the system-level dependencies into the package itself, which means less dependency management for sysadmins. Downloads are a bit larger, but the maintenance overhead is lower. If you want a deeper dive into PPM itself, we have a Managing Packages with Posit Package Manager training course that covers this in detail.

The short version: use binaries + R 4.5 + parallel installs. You can go from half an hour to about 15 seconds.

Connect Gallery: example apps without the setup friction

If you’ve used Posit Connect for a while, you might remember the quick-start popup that appeared on first login — a set of example apps you could try out. That’s been replaced by Connect Gallery, which lives in the interface rather than popping up in front of you.

What’s changed isn’t just where it lives. Installing an example app is now one click. Previously you’d follow a set of instructions to get it running; now it just deploys.

Two examples worth highlighting from the gallery:

Usage Metrics — shows you which content on your Connect server is actually being used, filtered by time period and user. It uses a visitor key, so the app shows each viewer only the content they have permission to see. Useful for admins wondering what’s getting traction and what isn’t.

Command Center for Publishers — a dashboard built with Python that reimplements much of the Connect admin interface inside an app. You can rename deployed content, lock it, and manage it through the Connect API. Worth looking at both as a tool and as an example of how to build admin functionality on top of Connect.

If you’re new to Connect or want to get more from it, our Introduction to Posit Workbench training course covers the full Posit environment including how Workbench and Connect work together.

Workbench Jobs: run something long and close your session

This one comes up as a question fairly often: if I start a background job in Posit Workbench and close my session, will it keep running?

The old answer was no. Background jobs were child processes of your session, close the session and the job goes with it.

Workbench Jobs are different. They run independently of your session. You can start a job, close RStudio Pro or VS Code entirely, and the job keeps going. When you open a new session, you can still see it running, check its live output, and monitor resource usage.

This is handy for anything that takes longer than you want to babysit: data processing pipelines, model training runs, file exports. The job has access to your data sources and connections, and you can pick up wherever you left off.

There’s also an auditing option for Workbench Jobs. When enabled, the output gets a cryptographic signature, useful if you need to demonstrate not just that the job ran, but exactly what it produced.

Workbench Jobs vs scheduled content on Connect

A quick note on when to use which. If you need to run something once from inside your current workflow and you want access to local files, data connections, and everything in your working environment, a Workbench Job makes sense. It’s more hands-on.

If you need to schedule something to run repeatedly, share the results with other people, or get an email when it’s done, that’s what Connect is for. The two tools complement each other rather than compete.

If any of this is relevant to your setup, whether you’re looking at speeding up your package environment, making better use of Connect, or running longer jobs in Workbench — get in touch. As a certified Posit Partner, we help teams get the most from their Posit investment from infrastructure setup to long-term managed support.

AI in Production — 4–5 June 2026, Newcastle

If you’re thinking about how AI fits into production data science environments, this is the conference for it. Two days of real-world talks and hands-on workshops from practitioners across engineering and ML; covering deployment, monitoring, scaling, and what actually works when AI leaves the prototype stage.

Register now at ai-in-production.jumpingrivers.com

For updates and revisions to this article, see the original post

Is Your Dashboard User Friendly?

Thu, 12 Mar 2026 23:59:00 +0000

For a while we, at Jumping Rivers, have offered a Dashboard Health Check (DHC) largely focused around backend features and other facets the end-user doesn’t see: things like version control, documentation and deployment. However, the DHC also included a few checks related to user experience and accessibility. While we’ve always believed these are useful additions, we would like to offer more in-depth guidance to our clients on how they can make their applications more user-friendly. To facilitate this, we are now introducing the Frontend Dashboard Health Check (FDHC).

What could an FDHC help me with?

So what kind of advice can you get from us from a Frontend Dashboard Healthcheck, you might wonder. Here are just a few of the possibilities:

Tools like Shiny and Dash make it relatively quick and easy to build data dashboards. These can often start out as a fixed single page of data and, over time, morph into something much more complex and interactive with multiple views. Such applications can be incredibly powerful, but with great power comes great ~~responsibility~~ complexity. For a dashboard to be successful, users need to understand how to use it effectively to answer their questions. This can mean discovering and/or learning many features from basic navigation between views to how to interrogate the data contained within using techniques like search, filter, sort, partition, drill-down and summarise. We can suggest places where users may get stuck or confused, and suggest means of amelioration.
A successful, production-ready, dashboard also needs to be robust. At minimum that means resilient to unexpected user input and to its own (perhaps temporary) inability to provide the output its supposed to (if a server is down, for example). An app that just hangs when something goes wrong is going to confuse and frustrate users and can lead to wasted time and even loss of work. We can show you where your app may fall over so that you can take action to prevent it.
These days we consume pages from the world wide web using all manner of devices. Does your app work on 4k and 5k monitors? More importantly, at the other end of the scale, there is now usually the expectation that things should work on mobile and other touchscreen devices. We can show you at which dimensions your app layout may become difficult or impossible to use and where users using specific input methods - e.g. mouse, touch, keyboard - may have difficulties.

What deliverables would I get from an FDHC?

The principle deliverable from an FDHC is a detailed spreadsheet indicating what issues we’ve found and where they can be found (or how to reproduce them). Wherever practical we will also include annotated screenshots (or occasionally recordings) giving a visual outline of a problem (see below). We will also strive to suggest possible remedies.

An example of annotated screenshots highlighting an issue with the page layout for certain width-ranges for an old version of our own Litmus Dashboard application.

An example of an annotated screenshot highlighting an issue with input labelling for an old version of our own Litmus Dashboard application.

What about the old DHC?

We will continue to offer a separate, report-based, health check for data dashboards. This “Backend Dashboard Health Check” (BDHC) will cover things like version control, documentation, deployment as before. We are, of course, more than happy to run a BDHC and an FDHC on the same application.

How do I find out more?

Please get in touch via this contact form or drop us an email at hello@jumpingrivers.com.

For updates and revisions to this article, see the original post

AI in Production 2026 Workshops: What’s Coming in June

Wed, 11 Mar 2026 23:59:00 +0000

We are excited to share more details about the workshops taking place at AI in Production 2026, which will be held in Newcastle upon Tyne on 4–5 June 2026.

AI in Production is a two-day conference. Day 1 (Thursday 4 June) is dedicated to hands-on workshops, followed by a full day of conference talks on Friday 5 June.

The workshop sessions are designed to give attendees practical exposure to the tools, patterns, and decisions involved in running AI systems in production. From large language models and data platforms to modern development tools, the focus is on how AI systems are actually built, deployed, and maintained.

How Day 1 Works

Thursday 4 June is divided into morning and afternoon workshop sessions, allowing attendees to take part in two half-day workshops.

Attend two workshops across the day
Lunch is included
All workshop tickets include access to the Thursday evening dinner reception

Workshops are delivered by Jumping Rivers consultants and invited speakers, including practitioners working with platforms such as Databricks and modern AI tooling in production environments.

Morning Workshops

09:30 – 12:45

Prompt Craft and AI Integration: Building LLM Driven Workflows in R and Python

This workshop focuses on integrating large language models into R and Python workflows.

Participants will explore prompt design, calling LLMs from production code, and handling common challenges such as inconsistent outputs and failure cases. The emphasis is on understanding where LLMs add value and how they fit into reliable systems.

From Nothing to Gold: Productionising with Databricks Using the Medallion Architecture

This workshop walks through the Medallion Architecture as it is used in production environments.

Participants will explore how data moves from raw ingestion to analytics and machine learning ready layers, with a focus on structure, quality, and scalability. The session draws on hands-on experience using Databricks to support data engineering and AI workloads.

Improving Your Workflow with Positron and Claude

This session explores how modern development tools support day-to-day data science and engineering work.

The workshop focuses on using Positron alongside AI assistance to write, explore, and refine code more efficiently while maintaining clarity and control. It is particularly relevant for teams working in R and Python.

Afternoon Workshops

13:45 – 17:00

Shiny Meets LLMs: Smarter App Experiences

This workshop explores how large language models can be integrated into Shiny applications to create more interactive user experiences.

Topics include designing effective user interactions, managing latency and cost, and thinking through reliability when deploying AI-enabled apps to users.

The Power of Databricks Genie Rooms: Data Discovery and Questions with Minimal Effort

This workshop focuses on Databricks Genie Rooms and their role in natural language driven data discovery.

Participants will explore how this approach works in practice, when it is effective, and where its limitations lie, particularly in production settings that support analysts and business users.

Self Hosted LLMs: Running Your Own Inference Infrastructure

This workshop focuses on running large language models on your own infrastructure.

The session covers infrastructure choices, performance and cost trade-offs, and operational considerations, factoring in constraints such as privacy, regulation, and reliability.

Drinks Reception

Day 1 concludes with a dinner and drinks reception from 17:00 to 19:30, hosted in the atrium of The Catalyst building.

This reception is included with all workshop tickets and offers an opportunity to meet speakers, connect with other attendees, and continue conversations before the conference talks on Friday.

Who the Workshops Are For

The Day 1 workshops are designed for:

Data scientists and machine learning practitioners
Engineers working on AI and data platforms
Analysts moving closer to production work
Technical leads responsible for AI systems

Each workshop stands on its own, allowing you to choose sessions that best match your interests.

Join Us in Newcastle

AI in Production 2026 takes place on 4–5 June 2026 at The Catalyst in Newcastle upon Tyne.

Workshop places are limited to keep sessions interactive. If you are interested in AI in production, Databricks, LLMs, R, Python, or modern data platforms, you can learn more and register on the AI in Production conference website.

For updates and revisions to this article, see the original post

Data Processing in Pandas and Polars: Free Jumping Rivers Webinar

Thu, 05 Mar 2026 23:59:00 +0000

Python offers a wide range of tools for data manipulation, but choosing the right one often depends on performance needs, workflow preferences, and dataset size.

On 19 March, Jumping Rivers is hosting a webinar focused on a practical question: how does data processing compare between pandas and polars?

The session will be led by Russ Hyde, Senior Data Scientist at Jumping Rivers, who regularly works with Python based data workflows across analysis, training, and production environments.

You can register for the webinar using this form.

What You’ll Learn

This session will implement the same data processing pipeline in both pandas and polars to highlight how each library approaches common tasks. Russ will walk through:

Key syntax differences between pandas and polars
How each library handles core data manipulation steps
Performance and usability considerations
New functionality introduced in pandas 3.0

The goal is to give Python users a clearer understanding of when each tool makes sense and how recent updates affect day to day workflows.

Why This Matters

Pandas remains the standard library for in memory tabular data analysis, and familiarity with its syntax is essential for many data science roles. At the same time, polars is gaining attention due to its performance focused design and efficient execution model.

Understanding the strengths and trade offs of both libraries helps teams:

Improve data processing performance
Choose tools aligned with project requirements
Write clearer and more maintainable code
Stay current with developments in the Python ecosystem

As expectations around performance and scalability increase, informed tooling decisions become more important.

Continued Learning Benefits

Jumping Rivers encourages teams to stay engaged with the wider webinar series:

Attend two webinars and receive 20% off tickets to the AI in Production 2026 conference
Attend more than two webinars and receive 20% off any Jumping Rivers public training course

Event Details

Date: 19 March 2026
Time: 1:15 PM (UK time)
Venue: Online

Register for the webinar using this form to secure your spot.

For updates and revisions to this article, see the original post

Jumping Rivers Now Approved to Sell Services Through DOS7: Crown Commercial Services

Tue, 17 Feb 2026 23:59:00 +0000

Jumping Rivers has been approved to sell our services through the Crown Commercial Service CCS.

For UK public sector organisations, this is an important milestone. It means there is now a simpler, compliant way to work with us without going through lengthy procurement processes from scratch.

What the Crown Commercial Service does

The Crown Commercial Service supports public sector organisations by creating and managing procurement frameworks. These frameworks are designed to help teams buy services in a way that is compliant, transparent, and efficient.

Instead of running a full tender every time a service is needed, public sector buyers can use CCS frameworks to access pre approved suppliers who have already met specific standards around capability, pricing, and compliance.

What our approval means

Being approved on a CCS framework means our services have been reviewed and assessed against the requirements expected for public sector procurement.

This includes areas such as:

Technical capability and experience
Value for money
Compliance with public sector procurement standards
Clear and transparent service offerings

For public sector teams, this reduces risk and removes a lot of the administrative burden that often comes with procurement.

Why this matters for public sector teams

If you work in the public sector, you are often balancing delivery pressure with strict procurement rules. Even when a team knows who they want to work with, the process of getting approval can slow things down.

By being available through CCS, we can now be procured through an established framework that many organisations already use.

This means

Faster access to our services
Less time spent on procurement paperwork
Confidence that the supplier has already been vetted
A clearer route to starting work

Who this is relevant for

This approval is particularly useful for teams across

Local and central government
Healthcare and NHS organisations
Education and research institutions
Other public sector bodies using CCS frameworks

If your organisation already buys services through CCS, you can now engage us directly using that route.

What this means for working with us

Nothing changes about how we work day to day. We still focus on understanding your context, your constraints, and what you actually need support with.

What has changed is how easy it is to get started.

If you are required to buy through CCS, this approval removes friction and shortens the path from first conversation to delivery.

Not sure where to start

If you are unsure which CCS framework applies to your organisation or whether this route is right for you, we are happy to talk it through.

We can help you understand

Which framework to use
Whether your organisation is eligible
How to move from initial discussion to procurement

Our goal is to make working together as straightforward as possible. If you would like to get in touch, you can contact us.

For updates and revisions to this article, see the original post

Keeping Posit Environments Reliable in Production: Free Jumping Rivers Webinar

Wed, 11 Feb 2026 23:59:00 +0000

Teams often treat software updates as something to postpone until absolutely necessary. But with analytics platforms, falling behind can introduce avoidable risk, compatibility issues, and missed improvements.

On 19 February, Jumping Rivers is hosting a webinar focused on a simple question: why should teams keep their Posit software up to date?

The session will be led by Sebastian Mellor, Head of Engineering at Jumping Rivers, who supports organisations running Posit tools in production environments.

You can register for the webinar using this form.

What You’ll Learn

This session will highlight concrete examples pulled directly from recent Posit release notes.

Sebastian will walk through:

What changed and why it matters
How updates affect reliability, security, and performance
The risks of delaying upgrades

The goal is to give teams a clearer picture of what they gain by staying current and what can happen when updates are ignored.

Why This Matters

Keeping Posit software updated is not just about new features. It supports:

Stability across environments
Compatibility with evolving tooling
Security improvements
Better performance for data science teams

Even small version gaps can compound over time, making upgrades harder and increasing operational friction.

Continued Learning Benefits

Jumping Rivers encourages teams to stay engaged with the wider webinar series:

Attend two webinars and receive 20% off tickets to the AI in Production 2026 conference
Attend more than two webinars and receive 20% off any Jumping Rivers public training course

Event Details

Date: 19 February 2026
Time: 1:15 PM (UK time)
Venue: Online

Register for the webinar using this form to secure your spot.

For updates and revisions to this article, see the original post

Building a Robust .gitconfig

Thu, 05 Feb 2026 23:59:00 +0000

Getting started with Git is easy (ha!), but once you’ve mastered the basics, it’s natural for developers to start thinking about customising their git process. Most Git settings live in the .gitconfig file. In this blog post, I’ll discuss what you should consider setting in your config file to make a more efficient development environment.

Adding and Removing Variables

You can edit your global .gitconfig using any standard editor. It should live in your home directory. If you have difficulty finding it, try

git config --edit --global

Standard Settings

These settings are probably(?) suitable for everyone. First, your name and email address you use when committing

[user]
 name = Colin Gillespie
 email = colin@jumpingrivers.com

At Jumping Rivers, we enforce that the email address matches a particular pattern, (@jumpingrivers.com), when committing to the corporate repo. This standardises our internal commit history. However, most people require a couple of identities - see the end of this post for details.

[core]
 excludesfile = ~/.gitignore
 editor = emacs

The excludesfile setting points to a global .gitignore file and allows you to exclude files regardless of the project. My global ignore file is fairly light. It contents file names, such as .Rhistory, ^tmp\\.*, and \.vscode, that I never want to commit.

The editor setting determines which text editor Git opens for commit messages and interactive operations. I still cling to Emacs, but most people probably prefer other options are vim, nano, or code --wait for Visual Studio Code.

Enabling colour output makes Git’s terminal output more readable. Different elements (additions, deletions, branch names, etc.) are highlighted in different colours, making it easier to scan and understand what’s happening at a glance.

[color]
 ui = 1

Almost all of the repositories I deal with use main for their default branch. This can be set via

[init]
 defaultBranch = main

Setting pull.rebase = true makes git pull rebase your local commits on top of the upstream changes rather than creating merge commits, resulting in a cleaner history - but can be very annoying!

[pull]
 rebase = true

The autoSetupRemote = true setting automatically sets up remote tracking when you push a new branch, eliminating the need for git push -u origin branch-name drama.

[push]
 default = simple # This is the default in modern Git versions
 autoSetupRemote = true

Branch Management

This setting changes how Git sorts branches when you run commands like git branch. By default, Git sorts branches alphabetically - but alphabetically is rarely useful for me.

[branch]
 sort = -committerdate

Useful Aliases

You can also set git aliases

[alias]
 root = rev-parse --show-toplevel

The root alias provides quick access to the repository’s root directory - useful when you’re deep in a nested folder structure and need to reference files relative to the project root. You can use it simply with git root, which will output the absolute path to your repository’s top-level directory.

I’ve also created a zsh alias - gcd that does this, as I found it really handy.

Security Settings

This section tackles the following problems

We use ssh for checking out repositories
The ssh key is stored securely and is Password protected, i.e. encrypted
Git commits are signed

Over the years, I’ve tried a few different methods, but as we (Jumping RIvers), use 1Password to manage credentials, I want to use the same system.

The ssh key is stored in 1Password.

[commit]
 gpgsign = true
[user]
 signingkey = ssh-ed25519 ABCD
[gpg]
 format = ssh
[gpg "ssh"]
 program = "/opt/1Password/op-ssh-sign"

Commit signing verifies that commits genuinely come from you, which is increasingly important in professional environments. This configuration uses SSH keys rather than traditional GPG keys (note the format = ssh setting). I’ve used GPG keys in the past, and they can be tricky.

The configuration integrates with 1Password’s SSH agent, allowing seamless signing without managing separate GPG keys. When you make a commit, 1Password handles the signing process automatically. It also means, that you don’t have to constantly enter your password to decrypt your ssh key.

Another feature of this set-up, is that it’s much easier to have multiple ssh keys, rather than “one key to rule them all”.

Pack Optimisation

This setting controls how aggressively Git compresses repository data. A depth of 20 balances storage efficiency against the computational cost of packing and unpacking objects. Lower values mean faster operations but larger repositories; higher values save space but slow down operations slightly.

[pack]
 depth = 20

To be honest, this has been “stolen” from a long forgotten blog post/tweet/StackOverflow question.

Conditional Includes for Work/Personal Separation

This, in my humble opinion, is an elegant solution for managing different Git identities. When working in repositories under /home/colin/jumpingrivers/, Git loads additional configuration from .gitconfig-work.

[includeIf "gitdir:/home/colin/jumpingrivers/"]
 path = .gitconfig-work

The .gitconfig-work file contains any new or updated variables, i.e.

[user]
 email = colin@jumpingrivers.com

This is perfect for using different email addresses on different projects.

Putting It All Together

A well-configured .gitconfig file transforms Git from a tool you fight with into one that works seamlessly with your workflow. Avoid copying and pasting configuration options you don’t understand. Instead, consider each change turn, and add to your .gitconfig file. Remember, you can view your current configuration at any time with git config --list and edit it with git config --global --edit.

For updates and revisions to this article, see the original post

Using {ellmer} for Dynamic Alt Text Generation in {shiny} Apps

Thu, 22 Jan 2026 23:59:00 +0000

Alt Text

First things first, if you haven’t heard of or used alt text before, it is a brief written description of an image that explains context and purpose. It is used to improve accessibility by allowing screen readers to describe images, or provide context if an image fails to load. For writing good alt text see this article by Havard, but some good rules of thumb are:

Keep it concise and relevant to the context of why the image is being used.
Screen reader will already say “Image of …” so we don’t need to include this unless the style is important (drawing, cartoon etc).

Alt Text within Apps and Dashboards

I don’t need to list the positives of interactive apps and dashboards, however one of the main ones is interactivity and allowing users to explore data in their own way. This is a great thing most of the time, but one pitfall that is often overlooked is interactivity can overshadow accessibility. Whether it’s a fancy widget that’s hard (or impossible) to use via keyboard or interactive visualisations without meaningful alternative text.

In this post, we’ll look at a new approach to generating dynamic alt text for ggplot2 charts using {ellmer}, Posit’s new R package for querying large language models (LLM) from R. If you are using Shiny for Python then chatlas will be of interest to you.

Why Dynamic Alt Text Needs Care

Automatically generating alt text is appealing, but production Shiny apps have constraints:

Plots may re-render frequently
API calls can fail or be rate-limited
Accessibility should degrade gracefully, not break the app
A good implementation should be consistent, fault-tolerant, and cheap to run.

Using {ellmer} in a Shiny App

The first step is setting up a connection to your chosen LLM, I am using Google Gemini Flash-2.5 as there is a generous free tier but other model and providers are available. In a Shiny app, this can done outside the reactive context:

library(ellmer)
gemini <- chat_google_gemini()

## Using model = "gemini-2.5-flash".

Note: You should have a Google Gemini key saved in you .Renviron file as GEMINI_API_KEY, this way the {ellmer} function will be able to find it. More information on generating a Gemini API key can be found, in the Gemini docs.

Then we have the function for generating the alt text:

library(ggplot2)

generate_alt_text = function(ggplot_obj, model) {
 temp <- tempfile(fileext = ".png")
 on.exit(unlink(temp))

 ggsave(
 temp,
 ggplot_obj,
 width = 6,
 height = 4,
 dpi = 150
 )

 tryCatch(
 model$chat(
 "
Generate concise alt text for this plot image.
Describe the chart type, variables shown,
key patterns or trends, and value ranges where visible.
 ",
 content_image_file(temp)
 ),
 error = function(e) {
 "Data visualisation showing trends and comparisons."
 }
 )
}

The function has a few features that will keep the output more reliable:

Consistent image size and resolution - helps model reliability when reading axes and labels.
Explicit cleanup of temporary files - we don’t need to save the images once text is generated.
Error handling - if the model call fails, the app still returns usable alt text. We kept our fallback text simple for demonstration purposes, but you can attempt to add more detail.
External model initialisation - only created once and passed in, rather than re-created on every reactive update.

Examples

In this section will just create a few example plots then see what the LLM generates.

simple_plot = ggplot(iris) +
 aes(Sepal.Width, Sepal.Length) +
 geom_point()
simple_plot

simple_plot_alt = generate_alt_text(simple_plot, gemini)
paste("Alt text generated by AI: ", simple_plot_alt)

Alt text generated by AI:

Scatter plot showing Sepal.Length on the y-axis (ranging from approximately 4.5 to 8.0) versus Sepal.Width on the x-axis (ranging from approximately 2.0 to 4.5). The data points appear to form two distinct clusters: one with Sepal.Width between 2.0 and 3.0 and Sepal.Length between 5.0 and 8.0, and another with Sepal.Width between 3.0 and 4.5 and Sepal.Length between 4.5 and 6.5.

plot = ggplot(iris) +
 aes(Sepal.Width, Sepal.Length, colour = Species) +
 geom_point()
plot

plot_alt =
 generate_alt_text(plot, gemini)
paste("Alt text generated by AI: ", plot_alt)

Alt text generated by AI:

Scatter plot showing Sepal.Length on the y-axis (range 4.5-8.0) versus Sepal.Width on the x-axis (range 2.0-4.5), with points colored by Species. Red points, labeled “setosa”, form a distinct cluster with higher Sepal.Width (3.0-4.5) and lower Sepal.Length (4.5-5.8). Blue points, “virginica”, tend to have higher Sepal.Length (5.5-8.0) and moderate Sepal.Width (2.5-3.8). Green points, “versicolor”, are in between, with moderate Sepal.Length (5.0-7.0) and Sepal.Width (2.0-3.5), overlapping with virginica.

complicated_plot = ggplot(iris) +
 aes(Sepal.Width, Sepal.Length, colour = Species) +
 geom_point() +
 geom_smooth(method = "lm")
complicated_plot

complicated_plot_alt =
 generate_alt_text(complicated_plot, gemini)
paste("Alt text generated by AI: ", complicated_plot_alt)

Alt text generated by AI:

Scatter plot showing Sepal.Length on the y-axis (range 4.0-8.0) versus Sepal.Width on the x-axis (range 2.0-4.5). Points and linear regression lines are colored by Iris species. Red points, “setosa”, cluster with lower Sepal.Length (4.0-5.8) and higher Sepal.Width (2.8-4.4). Green points, “versicolor”, and blue points, “virginica”, largely overlap, showing higher Sepal.Length (5.0-8.0) and moderate Sepal.Width (2.0-3.8), with “virginica” generally having the longest sepals. All three species exhibit a positive linear correlation, indicated by their respective regression lines and shaded confidence intervals, where increasing sepal width corresponds to increasing sepal length.

As we can see the alt text can be very good and informative when using LLMs. One alternative that I want to point out is actually including a summary of the data behind the plot. This way screen reader users can still gain insight from the plot.

Using Dynamic Alt Text in Shiny

Once generated, the alt text can be supplied directly to the UI:

Via the alt argument of plotOutput()
Or injected into custom HTML for more complex layouts

Because the text is generated from the rendered plot, it stays in sync with user inputs and filters.

Other Considerations

Some apps may be more complicated and/or have a high number of users. These type of apps will need a bit more consideration to include features like this:

Caching alt text for unchanged plots to reduce API usage
Prompt augmentation with known variable names or units
Manual overrides for critical visuals

Conclusion

AI-generated alt text works best as a supporting tool, not a replacement for accessibility review. I have also found it helpful to let users know that the alt text is AI generated so they know to take it with a pinch of salt.

Dynamic alt text is a small feature with a big impact on inclusion. By combining Shiny’s reactivity with consistent rendering, error handling, and modern LLMs, we can make interactive data apps more accessible by default whilst not increasing developer burden.

For updates and revisions to this article, see the original post

Why Submit to AI in Production: Speaking as a Tool for Better Work

Tue, 20 Jan 2026 23:59:00 +0000

We’re accepting abstracts for AI in Production until 23rd January. The conference takes place on 4th–5th June 2026 in Newcastle, with talks on Friday 5th across two streams: one focused on engineering and production systems, the other on machine learning and model development.

We often hear: “My work isn’t ready to talk about yet” or “I’m not sure anyone would be interested.” We want to address that hesitation directly.

Speaking at a conference isn’t primarily about promoting yourself or your organisation.

It’s a practical tool that helps you do better work. Preparing and delivering a talk forces useful reflection, invites feedback from people facing similar challenges, and turns knowledge that lives only in your head into something your team can reuse.

If you’re wondering whether your work qualifies: internal systems count, work in progress counts, partial success counts.

Submit your abstract by 23rd January on the AI in Production website.

Preparing a Talk Clarifies Your Decisions

When you sit down to explain a technical choice to an audience, you have to answer questions you might have glossed over at the time: Why did we build it this way? What constraints shaped our approach? What would we do differently now?

This isn’t about justifying your decisions to others. It’s about understanding them yourself. The process of turning a production system into a coherent narrative forces you to see patterns you were too close to notice while building it. You identify what worked, what didn’t, and why. That clarity is valuable whether or not you ever give the talk.

Many practitioners find that writing an abstract or outline reveals gaps in their thinking. A deployment strategy that seemed obvious in context becomes harder to explain without it. A monitoring approach that felt pragmatic reveals underlying assumptions. This friction is useful. It means you’re learning something about your own work.

Speaking Invites Useful Feedback

The audience at AI in Production will broadly fall across two streams: engineering (building, shipping, maintaining, and scaling systems) and machine learning (model development, evaluation, and applied ML).

Whether you’re working on infrastructure and deployment or on training pipelines and model behaviour, you’ll be in a room with people facing similar constraints: limited resources, shifting requirements, imperfect data, and operational pressures.

When you share what you’ve tried, you get feedback from people who understand the context. Someone has solved a similar problem differently. Someone has run into the same failure mode. Someone asks a question that makes you reconsider an assumption.

This kind of peer feedback is hard to get otherwise. Your team is too close to the work. Online discussions lack context. A conference talk puts your approach in front of people who can offer informed perspectives without having to understand your entire stack or organisational structure first.

In many teams, knowledge about production systems sits with one or two people. They know why certain decisions were made, where the edge cases are, and how to interpret the monitoring dashboards. That concentration of knowledge creates risk.

Preparing a talk is a forcing function for documentation. To explain your system to strangers, you have to articulate what’s currently tacit. That articulation becomes something your team can use: onboarding material, decision records, runbooks.

Speaking also distributes responsibility. When you present work publicly, it stops being just yours. Your team shares ownership of the ideas. Others can critique, extend, or maintain them. This is particularly valuable for platform teams or infrastructure work, where the people who built something may not be the ones operating it six months later.

Turning Tacit Knowledge into Reusable Material

Much of what you know about your production systems isn’t written down. You understand the failure modes, the workarounds, and the operational quirks. You know which metrics matter and which are noise. You remember why you made certain tradeoffs.

A conference talk is an excuse to capture that knowledge. The slides become a reference. The abstract becomes a design document. The Q&A reveals what wasn’t clear and needs better documentation.

Even if the talk itself is ephemeral, the process of preparing it leaves artefacts. You’ve already done the hard work of running the system. Speaking about it turns that experience into something others can learn from, and you can build on.

If you’re maintaining AI systems in production, you’re solving problems worth talking about. Making models reliable under load, keeping training pipelines maintainable, monitoring behaviour when ground truth is delayed or absent, and managing technical debt while shipping features.

These are the problems practitioners face every day. Your approach won’t be perfect, and that’s the point. Talks about work in progress, about things that didn’t work, about compromises made under constraint are often more useful than polished success stories.

We’re looking for honest accounts of how people are actually building and operating AI systems. That might fit the engineering stream (deployment, infrastructure, monitoring, scaling) or the machine learning stream (training, evaluation, model behaviour, responsible data use). If you’re doing work in either area, you have something to contribute.

Submit an Abstract

The deadline is 23rd January. You’ll need a title and an abstract of up to 250 words. You don’t need a perfect story or a finished project. You need a problem you’ve worked on, some approaches you’ve tried, and some lessons you’ve learned.

Think about what would be useful for someone six months behind you on a similar path. Think about what you wish someone had told you before you started. Think about the conversation you’d want to have with peers who understand the constraints you’re working under.

If you’re not sure where to start, consider writing about one decision that shaped your system, one assumption that turned out to be wrong, or one constraint that changed your design. Good abstracts often start with a specific moment or choice rather than a broad overview.

Ready to submit? The deadline is 23rd January. Share one decision, one lesson, or one constraint from your production work:
https://jumpingrivers.com/ai-production/

If you have questions about whether your work fits the conference, reach out at events@jumpingrivers.com. We’re here to help make this easier.

For updates and revisions to this article, see the original post

Retrieval-Augmented Generation: Setting up a Knowledge Store in R

Thu, 08 Jan 2026 23:59:00 +0000

Happy New Year from the team at Jumping Rivers!

Now that we’re well into the second-half of the 2020s, it’s a good time to reflect on the changes that we have seen so far in this decade. In the world of data science nothing has dominated headlines quite like the rapid growth and uptake of generative artificial intelligence (GenAI).

Large language models (LLMs) such as ChatGPT, Claude and Gemini have incredible potential to streamline day-to-day tasks, whether that’s processing vast amounts of information, providing a human-like chat interface for customers or generating code. But they also come with notable risks if not harnessed responsibly.

Anyone that has interacted with these models is likely to have come across hallucination, where the model confidently presents false information as though it is factually correct. This can happen for a variety of reasons:

LLMs often have no access to real-time information: how would a model that was trained last year know today’s date?
The training data may be missing domain-specific information: can we really trust an off-the-shelf model to have a good understanding of pharmaceuticals and medicinal drugs?
The model may be over-eager to come across as intelligent, so it decides to provide a confident output rather than a more nuanced, honest answer.

Often we need to give the model access to additional contextual information before we can make it “production-ready”. We can achieve this using a retrieval-augmented generation (RAG) workflow. In this blog post we will explore the steps involved and set up an example RAG workflow using free and open source packages in R.

What is RAG?

In a typical interaction with an LLM we have:

A user prompt: the text that is submitted by the user.
A response: the text that is returned by the LLM.
(optional) A system prompt: additional instructions for how the LLM should respond (for example, "You respond in approximately 10 words or less").

In a RAG workflow we provide access to an external knowledge store which can include text-based documents and webpages. Additional contextual info is then retrieved from the knowledge store (hence “retrieval”) and added to the user prompt before it is sent. In doing so we can expect to receive a higher quality output.

How does it work?

Before going further, we must first introduce the concept of vectorisation.

Contrary to what you might believe, LLMs do not understand non-numerical text! They are mathematical models, meaning they can only ingest and output numerical vectors.

So how can a user interact with a model using plain English? The trick is that mappings exist which are able to convert between numerical vectors and text. These mappings are called “vector embeddings” and are used to convert the user prompt into a vector representation before it is passed to the LLM.

So, when setting up our RAG knowledge store, we have to store the information using a compatible vector representation. With this in mind, let’s introduce a typical RAG workflow:

Content: we decide which documents to include in the knowledge store.
Extraction: we extract the text from these documents in Markdown format.
Chunking: the Markdown content is split into contextual “chunks” (for example, each section or subsection of a document might become a chunk).
Vectorisation: the chunks are “vectorised” (i.e. we convert them into a numerical vector representation).
Index: we create an index for our knowledge store which will be used to retrieve relevant chunks of information.
Retrieval: we register the knowledge store with our model interface. Now, when a user submits a prompt, it will be combined with relevant chunks of information before it is ingested by the model.

At the retrieval step, a matching algorithm is typically used so that only highly relevant chunks are retrieved from the knowledge store. In this way, we are able to keep the size of the user prompts (and any incurred costs) to a minimum.

Setting up a RAG workflow in R

We will make use of two packages which are available to install via the Comprehensive R Archive Network (CRAN). Both are actively maintained by Posit (formerly RStudio) and are free to install and use.

{ragnar}

The {ragnar} package provides functions for extracting information from both text-based documents and webpages, and provides vector embeddings that are compatible with popular LLM providers including OpenAI and Google.

We will use {ragnar} to build our knowledge store.

{ellmer}

The {ellmer} package allows us to interact with a variety of LLM APIs from R. A complete list of supported model providers can be found in the package documentation.

Note that, while {ellmer} is free to install and use, you will still need to set up an API token with your preferred model provider before you can interact with any models. We will use the free Google Gemini tier for our example workflow. See the Gemini API documentation for instructions on creating an API key, and the {ellmer} documentation for authenticating with your API key from R.

Example RAG workflow

We begin by loading the {ragnar} package.

library("ragnar")

The URL provided below links to the title page of the “Efficient R Programming” textbook, written by Robin Lovelace and our very own Colin Gillespie. We’re going to use a couple of chapters from the book to construct a RAG knowledge store.

url = "https://csgillespie.github.io/efficientR/"

Let’s use {ragnar} to read the contents of this page into a Markdown format.

md = read_as_markdown(url)

We could vectorise this information as it is, but first we should split it up into contextual chunks.

chunks = markdown_chunk(md)
chunks
#> # @document@origin: https://csgillespie.github.io/efficientR/
#> # A tibble: 2 × 4
#> start end context text 
#> * <int> <int> <chr> <chr> 
#> 1 1 1572 "" "# Efficient R programmin…
#> 2 597 2223 "# Welcome to Efficient R Programming" "## Authors\n\n[Colin Gil…

The chunks are stored in a tibble format, with one row per chunk. The text column stores the chunk text (in the interests of saving space we have only included the start of each chunk in the printed output above).

The title page has been split into two chunks and we can see that there is significant overlap (chunk 1 spans characters 1 to 1572 and chunk 2 spans characters 597 to 2223). Overlapping chunks are perfectly normal and provides added context as to where each chunk sits relative to the other chunks.

Note that you can visually inspect the chunks by running ragnar_chunks_view(chunks).

It’s time to build our knowledge store with a vector embedding that is appropriate for Google Gemini models.

# Initialise a knowledge store with the Google Gemini embedding
store = ragnar_store_create(
 embed = embed_google_gemini()
)

# Insert the Markdown chunks
ragnar_store_insert(store, chunks)

The Markdown chunks are automatically converted into a vector representation at the insertion step. It is important to use the appropriate vector embedding when we create the store. A knowledge store created using an OpenAI embedding will not be compatible with Google Gemini models!

Before we can retrieve information from our store, we must create a store index.

ragnar_store_build_index(store)

We can now test the retrieval capabilities of our knowledge store using the ragnar_retreive() function. For example, to retrieve any chunks relevant to the text Who are the authors of “Efficient R Programming”? we can run:

relevant_knowledge = ragnar_retrieve(
 store,
 text = "Who are the authors of \"Efficient R Programming\"?"
)
relevant_knowledge
#> # A tibble: 1 × 9
#> origin doc_id chunk_id start end cosine_distance bm25 context text 
#> <chr> <int> <list> <int> <int> <list> <lis> <chr> <chr>
#> 1 https://csgi… 1 <int> 1 2223 <dbl [2]> <dbl> "" "# E…

Note that the \ operators in \"Efficient R Programming\" have been used to print raw double quotes in the character string.

Without going into too much detail, the cosine_distance and bm25 columns in the returned tibble provide information relating to the matching algorithm used to identify the chunks. The other columns relate to the location and content of the chunks.

From the output tibble we see that the full content of the title page (characters 1 to 2223) has been returned. This is because the original two chunks both contained information about the authors.

Let’s add a more technical chapter from the textbook to the knowledge store. The URL provided below links to Chapter 7 (“Efficient Optimisation”). Let’s add this to the knowledge store and rebuild the index.

url = "https://csgillespie.github.io/efficientR/performance.html"

# Extract Markdown content and split into chunks
chunks = url |>
 read_as_markdown() |>
 markdown_chunk()

# Add the chunks to the knowledge store
ragnar_store_insert(store, chunks)

# Rebuild the store index
ragnar_store_build_index(store)

Now that our knowledge store includes content from both the title page and Chapter 7, let’s ask something more technical, like What are some good practices for parallel computing in R?.

relevant_knowledge = ragnar_retrieve(
 store,
 text = "What are some good practices for parallel computing in R?"
)
relevant_knowledge
#> # A tibble: 4 × 9
#> origin doc_id chunk_id start end cosine_distance bm25 context text 
#> <chr> <int> <list> <int> <int> <list> <lis> <chr> <chr>
#> 1 https://csgi… 1 <int> 1 2223 <dbl [2]> <dbl> "" "# E…
#> 2 https://csgi… 2 <int> 1 1536 <dbl [1]> <dbl> "" "# 7…
#> 3 https://csgi… 2 <int> 22541 23995 <dbl [1]> <dbl> "# 7 E… "## …
#> 4 https://csgi… 2 <int> 23996 26449 <dbl [2]> <dbl> "# 7 E… "The…

Four chunks have been returned:

One chunk from the title page of the textbook.
One chunk from the start of Chapter 7.
Two chunks from Section 7.5 (“Parallel Computing”).

It makes sense that we have chunks from Section 7.5, which appears to be highly relevant to the question. By including the title page and the start of Chapter 7, the LLM will also have access to useful metadata in case the user wants to find out where the model is getting its information from.

Now that we have built and tested our retrieval tool, it’s time to connect it up to a Gemini interface using {ellmer}. The code below will create a chat object allowing us to send user prompts to Gemini.

chat = ellmer::chat_google_gemini(
 system_prompt = "You answer in approximately 10 words or less."
)

A system prompt has been included here to ensure a succinct response from the model API.

We can register this chat interface with our retrieval tool.

ragnar_register_tool_retrieve(chat, store)

To check if our RAG workflow has been set up correctly, let’s chat with the model.

chat$chat("What are some good practices for parallel computing in R?")
#> Use the `parallel` package, ensure you stop clusters with `stopCluster()` (or 
#> `on.exit()`), and utilize `parLapply()`, `parApply()`, or `parSapply()`.

The output looks plausible. Just to make sure, let’s check where the model found out this information.

chat$chat("Where did you get that answer from?")
#> I retrieved the information from "Efficient R programming" by Colin Gillespie 
#> and Robin Lovelace.

Success! The LLM has identified the name of the textbook and if we wanted to we could even ask about the specific chapter. A user interacting with our model interface could now search online for this textbook to fact-check the responses.

In the example workflow above, we manually selected a couple of chapters from the textbook to include in our knowledge store. It’s worth noting that you can also use the ragnar_find_links(url) function to retrieve a list of links from a given webpage.

Doing so for the title page will provide the links to all chapters.

ragnar_find_links("https://csgillespie.github.io/efficientR/")
#> [1] "https://csgillespie.github.io/efficientR/" 
#> [2] "https://csgillespie.github.io/efficientR/building-the-book-from-source.html"
#> [3] "https://csgillespie.github.io/efficientR/collaboration.html" 
#> [4] "https://csgillespie.github.io/efficientR/data-carpentry.html" 
#> [5] "https://csgillespie.github.io/efficientR/hardware.html" 
#> [6] "https://csgillespie.github.io/efficientR/index.html" 
#> [7] "https://csgillespie.github.io/efficientR/input-output.html" 
#> [8] "https://csgillespie.github.io/efficientR/introduction.html" 
#> [9] "https://csgillespie.github.io/efficientR/learning.html" 
#> [10] "https://csgillespie.github.io/efficientR/performance.html" 
#> [11] "https://csgillespie.github.io/efficientR/preface.html" 
#> [12] "https://csgillespie.github.io/efficientR/programming.html" 
#> [13] "https://csgillespie.github.io/efficientR/references.html" 
#> [14] "https://csgillespie.github.io/efficientR/set-up.html" 
#> [15] "https://csgillespie.github.io/efficientR/workflow.html"

You could then iterate through these links, extracting the contents from each webpage and inserting these into your RAG knowledge store. Just note, however, that including additional information in your store will likely increase the amount of text being sent to the model, which could raise costs. You should therefore think about what information is actually relevant for your LLM application.

Summary

In summary, we have introduced the concept of retrieval-augmented generation for LLM-powered workflows and built an example workflow in R using open source packages.

Before finishing, we are excited to announce that our new course “LLM-Driven Applications with R & Python” has just been added to our training portfolio. You can search for it here.

If you’re interested in practical AI-driven workflows, we would love to see you at our upcoming AI In Production 2026 conference which is running from 4-5 June in Newcastle-Upon-Tyne. If you would like to present a talk or workshop, please submit your abstracts before the deadline on 23 January.

For updates and revisions to this article, see the original post

Machine Learning Powered Naughty List: A Festive Jumping Rivers Story

Thu, 18 Dec 2025 23:59:00 +0000

Introduction

Ho ho ho! 🎅 The holiday season is here, and at Jumping Rivers, we’re decking the halls with data, not just tinsel. While elves are busy checking their lists twice, we thought: why not bring a little machine learning magic to Christmas? After all, what’s more festive than combining predictive modeling with candy canes, cookies, and a sprinkle of office mischief?

This blog is your all-access pass to a code-powered journey where we find out who’s been naughty, who’s nice, and who’s just mischievously hovering in between.

We’ll walk you through the process step by step: gathering the team data, inventing the most festive features, training our ML model, and revealing the results with a cheeky, holiday twist. So grab a mug of cocoa, put on your favorite Christmas socks, and let’s dive into the Jumping Rivers ML-Powered Naughty List adventure!

Note: All data, labels, and results in this post are entirely fictional and randomly generated for festive fun.

Step 1: Data Collection and Team Introduction

Our first step was gathering our dataset. We used the Jumping Rivers team as the participants, assigning playful, holiday-themed features to reflect their potential ‘naughty’ traits. Here’s a concise, festive overview in a side-by-side table format:

Each participant is assigned four playful features that represent holiday mischief:

Ate too many cookies 🍪
Forgot to send Christmas cards 💌
Sang off-key during carols 🎶
Gift wrapping disasters 🎁

Every name on this list is now in the running for the ultimate festive title: Naughty, Nice, or Mildly Mischievous. Rumor has it that Santa’s Intern Elf already claimed the top spot for cookie mischief, while Rudolph keeps dashboards squeaky clean, and Frosty the Snow Analyst is maintaining a perfectly balanced winter score.

Step 2: Feature Engineering

For ML purposes, names were encoded numerically. This is not meaningful in a real-world ML context but serves as a demonstration of preprocessing. The features for modeling include:

Name (encoded)
Ate too many cookies
Forgot to send Christmas cards
Sang off-key
Gift wrapping disasters

Step 3: Model Training

We chose a Random Forest classifier in R for its simplicity and interpretability. The model was trained on the dataset to predict the ‘naughty’ label based on the four behavioral features and the encoded name. Although the dataset is small and playful, this demonstrates a proper ML workflow: data collection, preprocessing, model training, prediction.

library(tidyverse)
library(randomForest)
library(ggplot2)

The first thing we need to do is set up a vector containing the team members along with some Christmas temp workers Santa’s Intern Elf, Rudolph the Data Reindeer and Frosty the Snow Analyst.

# Team members
team = c(
 "Esther Gillespie",
 "Colin Gillespie",
 "Sebastian Mellor",
 "Martin Smith",
 "Richard Brown",
 "Shane Halloran",
 "Mitchell Oliver",
 "Keith Newman",
 "Russ Hyde",
 "Gigi Kenneth",
 "Pedro Silva",
 "Carolyn Wilson",
 "Myles Mitchell",
 "Theo Roe",
 "Tim Brock",
 "Osheen MacOscar",
 "Emily Wales",
 "Amieroh Abrahams",
 "Deborah Washington",
 "Susan Smith",
 "Santa's Intern Elf",
 "Rudolph the Data Reindeer",
 "Frosty the Snow Analyst"
)

Now we have the team members we will randomly generate some values for the model features.

# Randomly generate playful 'naughty traits'
set.seed(51)
df = tibble(
 name = team,
 ate_too_many_cookies = sample(0:1, length(team), replace = TRUE),
 forgot_to_send_cards = sample(0:1, length(team), replace = TRUE),
 sang_off_key = sample(0:1, length(team), replace = TRUE),
 wrapping_disaster = sample(0:1, length(team), replace = TRUE),
 naughty = sample(0:1, length(team), replace = TRUE)
)


# Encode names as numeric
df$name_encoded = as.numeric(factor(df$name))

Next on the list is to set up a vector of features we want to use, and then train the model. We can then use the model to predict our fictitious naughtiness score for each team member! We can see Theo is at the top of the list, closely followed by Osheen.

features = c(
 "name_encoded",
 "ate_too_many_cookies",
 "forgot_to_send_cards",
 "sang_off_key",
 "wrapping_disaster"
)


# Train Random Forest
rf_model = randomForest(x = df[, features],
 y = as.factor(df$naughty),
 ntree = 100)


# Predict naughtiness
df$predicted_naughty = predict(rf_model, df[, features])
df$naughtiness_score = predict(rf_model, df[, features],
 type = "prob")[, 2]


# Create the Naughty List
naughty_list = df %>%
 arrange(desc(naughtiness_score)) %>%
 select(name, naughtiness_score, predicted_naughty)

print(naughty_list)

## # A tibble: 23 × 3
## name naughtiness_score predicted_naughty
## <chr> <dbl> <fct>
## 1 Theo Roe 0.76 1
## 2 Osheen MacOscar 0.74 1
## 3 Myles Mitchell 0.72 1
## 4 Esther Gillespie 0.68 1
## 5 Deborah Washington 0.66 1
## 6 Tim Brock 0.59 1
## 7 Amieroh Abrahams 0.55 1
## 8 Santa's Intern Elf 0.48 0
## 9 Carolyn Wilson 0.38 0
## 10 Susan Smith 0.2 0
## # ℹ 13 more rows

The last thing to do is visualise our results with {ggplot2}:

# Fun bar plot
ggplot(naughty_list,
 aes(x = reorder(name, naughtiness_score),
 y = naughtiness_score,
 fill = as.factor(predicted_naughty))) +
 geom_col() +
 coord_flip() +
 scale_fill_manual(values = c("0" = "forestgreen",
 "1" = "darkred"),
 labels = c("Nice", "Naughty")) +
 labs(title = "🎅 Jumping Rivers ML-powered Naughty List 🎄",
 x = "Team Member",
 y = "Naughtiness Score",
 fill = "Status",
 alt = "Jumping Rivers Naughty List") +
 theme_minimal(base_family = "outfit")

Step 4: Analysis and Notes

After generating predictions, we can interpret the Naughty List. The highest naughtiness scores indicate which participants are most mischievous according to our playful model.

Observations from this analysis include:

Cookie Enthusiasts: Participants with multiple cookie infractions scored higher.
Gift Wrapping Chaos: Those whose presents looked like abstract art contributed to higher scores.
Musical Mishaps: Off-key carolers were highlighted as naughty.
Forgotten Cards: Small lapses in festive correspondence nudged some up the naughty rankings.

Special mentions:

Theo unsurprisingly tops the naughty list.
Santa’s Intern Elf performed well, staying mostly nice.
Shane had the best score and I’m sure Santa will be very nice to him this year!

This analysis provides both a technical demonstration of ML workflow and a fun story that engages readers during the festive season.

Step 5: Conclusion

This project demonstrates how machine learning can be used in creative ways outside of traditional business use cases. By combining features with a proper ML workflow, we created a light-hearted, festive story suitable for a blog, while also reinforcing good practices in data collection, preprocessing, modeling, and visualization.

Ultimately, the Jumping Rivers ML-Powered Naughty List is a celebration of data science, team culture, and holiday fun. Whether you’re naughty or nice, we hope this inspires creative applications of ML in festive contexts.

For updates and revisions to this article, see the original post

Make Your Shiny Apps Accessible to Everyone – Free Jumping Rivers Webinar!

Mon, 08 Dec 2025 23:59:00 +0000

Date & Time (BST): 11 December 2025, 13:00 Topic: Accessible Shiny: Designing for All Users

Are you ready to make your Shiny applications more inclusive, user-friendly, and professional? Join Jumping Rivers for our free monthly webinar series, designed for data professionals at all levels. In just 55 minutes, you’ll learn how to create Shiny apps that are accessible to all users, meet modern accessibility standards, and provide a seamless experience for everyone – all from the comfort of your own desk.

Why Attend?

Learn practical accessibility techniques to make your Shiny apps usable for everyone.
Enhance your professional skills and make your dashboards more inclusive and impactful.
Connect with a network of data scientists, analysts, and developers.
Learn flexibly online with no cost.

Unlock Exclusive Benefits

Attend 2 webinars → 20% off tickets to the AI in Production conference. Attend more than 2 webinars → 20% off any of our high-quality public training courses.

Whether you’re looking to improve your Shiny skills or make your data applications accessible to all users, this webinar is your chance to level up your expertise and stand out in 2026.

Ready to Join?

For updates and revisions to this article, see the original post

Creating a Python Package with Poetry for Beginners Part 3

Thu, 04 Dec 2025 23:59:00 +0000

Intro

This it the third part of a blog series. In the previous posts we have addressed: creating a package with Poetry, managing our development environment and adding a function in part one; and package documentation, testing and how to publish to PyPI in part two.

In those previous posts, I developed a function for summarising the successes (and failures) of the teams in a fantasy football league. That function makes various API calls which in theory could all be made in parallel to speed up the runtime.

In this blog I aim to parallelise the function get_season_league which I wrote in the first blog.

Starting Function

Here is the function written in part one:


import requests
import pandas as pd
import json

def get_season_league(league_id = "485842"):
 api_url = "https://fantasy.premierleague.com/api/"
 url = api_url+ "leagues-classic/" + league_id + "/standings/"
 response = requests.get(url)
 data = json.loads(response.text)
 league = pd.DataFrame(data['standings']['results'])

 df = pd.DataFrame([])
 for index, row in league.iterrows():
 player_query = api_url + "entry/" + str(row['entry']) + "/history"
 player_response = requests.get(player_query)
 player_data = json.loads(player_response.text)
 player_df = pd.DataFrame({
 'name': row['player_name'],
 'team_name': row['entry_name'],
 'event': pd.json_normalize(
 player_data['current']
 )['event'],
 'points': pd.json_normalize(
 player_data['current']
 )['total_points']
 })
 df = pd.concat([df, player_df])
 return df

The logic is as follows:

Query API to get current league data
Loop over each member of the league
- Query API for individual player
- Return relevant data

The way it is currently written is how any normal for loop will run, where the current iteration must finish before the next one starts. With this code we shouldn’t need to wait for the previous API call, there is no dependency or anything like that. In theory we could run all of the individual player queries at once and the function would be a lot faster.

Measuring function calls in Python

We can measure how long it takes to run a piece of Python code using the time package. For example measuring my get_season_league function:

import time
from get_league import get_season_league
start_time = time.time()
x = get_season_league()
print("--- %s seconds ---" % (time.time() - start_time))

My function was taking ~3.5 seconds for the default league. Which has 13 players and there has been 11 game weeks. An average of 0.27 seconds per player (including the single original API call).

I also tested it for a larger league of 50 people and seems to take ~13 seconds but with more variance. This is a similar 0.26 seconds per player.

So this is why I want to parallelise the function, as if the non-dependent API calls could be made all at once, or at least multiple at once the function could be sped up massively. For example for the league of 50 taking the time per player at 0.26 seconds if I introduce two processes at once then it could take ~6.5 seconds, or 4 processes ~3.25. These values are approximate, but hopefully you can see the value of splitting up the parallelisable parts of the workload.

Optimising the Function

Before starting on the asynchronous side there is a few things we can address first.

`iterrows()` Alternative

The iterrows() function is pretty inefficient for this use case (generally as well). This blog explains it well and why there are better alternatives like itertuples. However I am just going to loop over a zip of the values I need.

# Old:
for index, row in league.iterrows():
 player_id = row['entry']
 player_name = row['player_name']
 team_name = row['entry_name']

# New:
for player_id, player_name, team_name in zip(
 league['entry'],
 league['player_name'],
 league['entry_name']
):

Concatenating DataFrames

Another area I could improve the function is switching away from concatenating dataframes from within the for loop, towards either concatenating once at the end or creating a list of dictionaries then converting to a DataFrame at the end.

The reason for this is the way Pandas handles DataFrame memory allocation, more detail on this Saturn Cloud blog.

# Old:
 df = pd.DataFrame([])
 for index, row in league.iterrows():
 player_query = api_url + "entry/" + str(row['entry']) + "/history"
 player_response = requests.get(player_query)
 player_data = json.loads(player_response.text)
 player_df = pd.DataFrame({
 'name': row['player_name'],
 'team_name': row['entry_name'],
 'event': pd.json_normalize(
 player_data['current']
 )['event'],
 'points': pd.json_normalize(
 player_data['current']
 )['total_points']
 })
 df = pd.concat([df, player_df])
 return df

# New:
 list_to_df = []

 for player_id, player_name, team_name in zip(
 league["entry"], league["player_name"], league["entry_name"]
 ):
 player_query = api_url + "entry/" + str(player_id) + "/history"
 player_response = requests.get(player_query)
 player_data = json.loads(player_response.text)
 player_df = pd.DataFrame({
 'name': player_name,
 'team_name': team_name,
 'event': pd.json_normalize(
 player_data['current']
 )['event'],
 'points': pd.json_normalize(
 player_data['current']
 )['total_points']
 })
 list_to_df.append(player_df)

 df = pd.concat(list_to_df, ignore_index=True)
 return df

These changes do seem to have sped up the function by a few seconds (for the league of 50) but the bulk time is taken by the API queries so these best practices aren’t going to speed it up too much, but are worth implementing nevertheless.

Asynchronising the Code

Before I start on this section I will give a brief background on asynchronous programming but if you want more detail please read this blog.

There is two main routes we can go down here:

concurrent.futures.ThreadPoolExecutor will use multiple threads, so the code is technically synchronous it will just be running at the same time in different use cases. This will be easier to implement with the current code however the time gains wouldn’t scale as much as the alternative. This approach will use more computational power as we’ll need additional processors.
asyncio will use a single threaded multi-tasking, truly asynchronous code. The syntax is more complex and doesn’t integrate very well with my current function for example I will need to replace requests with aiohttp. This would definitely be the better option if I was making lots of api calls, but on a smaller scale the gains wouldn’t be as significant.

concurrent.futures.ThreadPoolExecutor

For this blog I will be going with concurrent.futures.ThreadPoolExecutor as it integrates nicely with my existing code and the bigger gains from asyncio won’t really suit my use case.

The first thing I need to do (which could’ve been done earlier) is extract the per player logic to a separate function. This function will take a players details then use the player ID to query the API and grab the players season data. It will then nicely return it as a DataFrame.

def get_player_data(player_info, api_url):
 """Fetch data for a single player and return as DataFrame"""
 player_id = player_info['entry']
 player_name = player_info['player_name']
 team_name = player_info['entry_name']

 player_query = api_url + "entry/" + str(player_id) + "/history"
 player_response = requests.get(player_query)
 player_data = json.loads(player_response.text)

 # Create DataFrame for this player
 player_df = pd.DataFrame({
 'name': player_name,
 'team_name': team_name,
 'event': pd.json_normalize(player_data['current'])['event'],
 'points': pd.json_normalize(player_data['current'])['total_points']
 })

 return player_df

I will also need to adapt how I iterate over the player data. I know I’ve already switched from iterrows to a for loop over a zip of the relevant data but, then new function will use a different method of iteration. So I am creating a ‘records’ dictionary of the relevant data which I can then pass directly to my new get_player_data function.

players = league[['entry', 'player_name', 'entry_name']].to_dict('records')

Next comes the ThreadPoolExecutor, this is what allows us to run multiple API calls at once. It allows to create and send code to other Python threads (workers). I will first initialise an empty list to write my player dataframes to. Then I’ll use ThreadPoolExecutor(max_workers=10) to create 10 workers that we can send code to (I am using 10 as an example, this will be an argument the user can change in the final function). exector is the object used to send code to the new workers, I can use executor.map to map get_player_data over the players dictionary and save the output to our initialised list.

from concurrent.futures import ThreadPoolExecutor

def get_season_league(league_id = "485842"):
 # ...
 player_dfs = []

 with ThreadPoolExecutor(max_workers=10) as executor:
 results = executor.map(get_player_data, players)
 player_dfs = list(results)

Finally we use the change mentioned above of using a single pd.concat so we only run it once rather than n many times.

df = pd.concat(player_dfs, ignore_index=True)

So our final functions will look like this, with get_player_data defined inside get_season_league so the api_url is available:

def get_season_league(league_id="485842", max_workers=10):
 api_url = "https://fantasy.premierleague.com/api/"

 url = api_url + "leagues-classic/" + league_id + "/standings/"
 response = requests.get(url)
 data = json.loads(response.text)
 league = pd.DataFrame(data['standings']['results'])

 def get_player_data(player_info):
 """Fetch data for a single player and return as DataFrame"""
 player_id = player_info['entry']
 player_name = player_info['player_name']
 team_name = player_info['entry_name']

 player_query = api_url + "entry/" + str(player_id) + "/history"
 player_response = requests.get(player_query)
 player_data = json.loads(player_response.text)

 # Create DataFrame for this player
 player_df = pd.DataFrame({
 'name': player_name,
 'team_name': team_name,
 'event': pd.json_normalize(player_data['current'])['event'],
 'points': pd.json_normalize(player_data['current'])['total_points']
 })

 return player_df

 players = league[['entry', 'player_name', 'entry_name']].to_dict('records')

 player_dfs = []
 with ThreadPoolExecutor(max_workers=max_workers) as executor:
 results = executor.map(get_player_data, players)
 player_dfs = list(results)


 df = pd.concat(player_dfs, ignore_index=True)

 return df

When I run the function on the league of 50, it now takes ~1.5 seconds rather than the original ~13 seconds.

Summary

So we’ve optimised the function to a good degree using a few adjustments to the orginial function, then using multiple threads to run API calls at the same time. There is still some things left on the table like using asyncio instead or even executor.submit() to have more control of the individual player queries (handling errors etc). So perhaps in a future blog we will look at speeding the function up a little bit more.

For updates and revisions to this article, see the original post

Beginner’s Guide to Submitting Conference Abstracts

Tue, 02 Dec 2025 23:59:00 +0000

Submitting a conference abstract can feel intimidating, especially if it is your first time. Most people worry about whether their topic is good enough, whether their experience is “senior enough”, or if they are even writing the abstract the “right” way.

The truth is that most conferences want a wide range of voices. Organisers want speakers who can explain something clearly, not speakers with the fanciest job titles. This guide will walk you through:

what an abstract is
how to write one
what reviewers look for
where you can submit your first talk

If you are looking for a place to start, we are accepting submissions for AI in Production 2026 until 23 January. More details are at the end of this post.

What is a conference abstract?

An abstract is a short summary of what you want to talk about. It tells reviewers:

what the topic is
why it matters
what the audience will learn
how you plan to deliver it

It does not need to be perfect prose. It just needs to be clear.

You are qualified to speak (yes, you)

You do not have to be the world’s leading expert on something to speak about it. Some of the best talks come from people explaining what they learned while building, fixing, or reviewing a system.

Choose something you understand well enough to explain without jargon. For example:

a project you worked on
a problem your team solved
a lesson you learned along the way
a method, tool, or approach you wish you had known sooner

If you can explain the why we did it and what we discovered, you have a potential talk.

Conferences welcome new speakers. You only need:

something useful to explain
a clear abstract
willingness to share your experience

If you have never spoken before, say so. Reviewers appreciate honesty and fresh perspectives.

How to write your abstract

Most conferences ask for around 200 to 250 words. Some ask for even less. Here is a simple structure that works.

Set the context
One sentence that explains the setting or problem.
Explain what you did
Was it a system you built, a model you deployed, or an analysis you improved?
Highlight what the audience will learn
Reviewers want to know what people will take away.
Keep the language clear
Avoid buzzwords and complicated claims. Good abstracts are straightforward.

A short example

Our team needed a way to monitor model drift across multiple deployments. I will share the steps we took, the checks we added, and the mistakes we made on the way. Attendees will leave with practical checks they can add to their own model monitoring process.

You can adapt this pattern for your own work:

one sentence for the problem
one or two sentences for what you did
one sentence for what people will learn

Practical tips

Choose your format

Most conferences offer at least two formats:

Lightning talks (around 5 to 6 minutes)
Good for one focused idea, a small tool, or a single lesson.
Standard talks (around 20 to 25 minutes)
Better for a full story that includes context, process, and outcomes.

If you are unsure which to pick, choose the standard slot. Reviewers often adjust formats based on the strength of the topic.

Show who benefits

At the end of your abstract, add a simple sentence such as:

“This talk is suited for engineers working with deployment and monitoring.”
“This talk will help data scientists who want a clearer approach to evaluation.”

This makes it easier for reviewers to place your talk in the programme and helps attendees decide whether it is relevant for them.

What reviewers look for

Reviewers often focus on three questions.

Is the topic clear?

Can they understand what you are talking about without insider knowledge of your company or project?

Avoid internal code names or acronyms only your team uses.

Will the audience learn something useful?

Strong abstracts make it obvious what attendees will take away. They often include:

concrete examples
specific techniques or tools
clear lessons learned

Does it fit the conference?

Show how your talk connects to the audience and themes. One or two sentences are enough:

This talk will be useful for people who deploy models into production and need simple ways to spot drift before it causes problems.

Good abstracts are not about impressive credentials or perfect writing. They are about clarity and usefulness.

Submit to AI in Production 2026

Whether it is your first talk or your tenth, we would be happy to read your abstract for AI in Production 2026.

AI in Production focuses on practical work in two areas.

Engineering
Building, shipping, maintaining, and scaling AI systems and data pipelines.

Machine Learning
Model development, evaluation, responsible use of data, and lessons from real projects.

The conference takes place at The Catalyst in Newcastle city centre, with:

Workshops: 4 June 2026
Talks: 5 June 2026

Key dates

9 January: Super early bird registration deadline
23 January: Abstract submission deadline
6 March: Early bird registration deadline

Ready to share your work? Submit your abstract or register for tickets.
We welcome speakers at all levels.

For updates and revisions to this article, see the original post

Start 2026 Ahead of the Curve: Boost Your Career with Jumping Rivers Training

Thu, 27 Nov 2025 23:59:00 +0000

Ready to make 2026 the year you take your skills to the next level? Our 2026 online training courses are now live, designed to help you stay ahead of the curve, become more hirable, and gain practical skills that make a real impact.

January 2026 Courses

Date	Course	Format	Duration
12th January 2026	Introduction to R	Online	6 hours (3.5 hours Day 1, 3.5 hours Day 2)
19th January 2026	Introduction to Bayesian Inference using RStan	Online	12 hours (6 hours Day 1, 6 hours Day 2)
26th January 2026	Data Wrangling in the Tidyverse	Online	6 hours (3.5 hours Day 1, 3.5 hours Day 2)

Why Attend Jumping Rivers Training?

Hands-on, practical training: Learn with real-world datasets you can use immediately.
Expert instructors: Our trainers make complex concepts simple and actionable.
Comprehensive resources: Course materials, exercises, and ongoing support included.
Certification: Receive a Jumping Rivers certificate on completion, demonstrating your achievement to employers.
Flexible online format: Courses run over two days, 3.5 hours each day—to fit around your schedule.

Additional Perks

We also run free webinars at Jumping Rivers. By attending, you can:

Get early exposure to new topics in data science and analytics
Receive up to 20% discount on training courses
Enjoy up to 20% off Jumping Rivers conferences

Don’t wait—start 2026 by investing in yourself and your career. Book your course today: Jumping Rivers Training

For updates and revisions to this article, see the original post

Should I Use Figma Design for Dashboard Prototyping?

Thu, 20 Nov 2025 23:59:00 +0000

Heard of Figma but not sure what it is? Seen Figma but not sure if it’s worth learning? Never seen or heard of Figma? If the answer to any of these questions is “Yes” then this blog post is for you.

What are Figma and Figma Design?

This is a simple question with a somewhat complex answer, not least because there are multiple products falling under the Figma umbrella, made by developers at the company Figma, Inc (often shortened to Figma). At the time of writing, these products are listed on Wikipedia as:

Figma
FigJam
Figma Slides
Figma Sites
Figma Make
Figma Buzz
Figma Draw

You’ll see the first of these products is listed simply “Figma”. This is the original Figma product that’s been around since the mid-2010’s. (By contrast, the last four products listed were all launched in 2025.) However, because of the existence of these other Figma products, Figma, Inc has now started to refer to the original product as “Figma Design” (or in some places just “Design”). I think this naming is slowly being adopted in general, but you will still find plenty of references to Figma that mean what Figma, Inc now calls Figma Design.

Screenshot of new-file options for a logged-in user at figma.com

So What is Figma Design?

According to Figma, Inc:

Figma Design is for people to create, share, and test designs for websites, mobile apps, and other digital products and experiences. It is a popular tool for designers, product managers, writers and developers and helps anyone involved in the design process contribute, give feedback, and make better decisions, faster.

I’d simplify that to:

Figma Design is cloud-based collaborative software that allows users to create wireframes, high-fidelity mock-ups and working prototypes of websites and mobile applications.

This doesn’t cover everything Figma Design can do or be used for, as I’ll come on to, but I think it covers the main reasons you’d choose to learn Figma Design over other design software.

What Can I Use Figma Design For?

As implied in the previous section, the core offering of Figma Design (in my view, at least) is the ability to quickly make wireframes, high-fidelity designs and interactive prototypes. These can be really helpful when building a complex dashboard.

Example of a wireframe of the top of the Jumping Rivers home page, built with Figma Design.

Screen recording of an interactive prototype built with Figma Design. (The first click at the start of the video is just to move focus to the prototype window. Subsequent clicks are interactions within the prototype.)

I’ve used Figma Design for a number of other things, including:

simple vector art
flow diagrams
annotating screenshots
promotional literature intended for print
very basic image editing

Figma Design is not the best tool available for any of these tasks. But if it is available to you, you know how to use it and it does the job to a satisfactory level, then it could be the most convenient tool you have at your disposal.

Example of a (joke) flow chart, built with Figma Design.

Is Figma Design Free?

Like a lot of (most?) cloud-based software tools, Figma Design (and the rest of the Figma products) is freemium software. What is and isn’t available on the free tier is liable to change so everything that follows in this section should be assumed to be caveated with “at the time of writing”.

While you can certainly use Figma Design for free - and, I think, learn how to use most of its tools - the answer to whether you can use it as desired without paying is, unsurprisingly, “it depends”. If you’re part of a team, the free tier strictly limits the number of collaborative files that can be created and your ability to create shared libraries. If you’re working independently these things may not be much of an issue, but you won’t have access to some other features available in paid tiers like Dev Mode and video imports.

What Tools Does Figma Design Give Me?

The things you’ll use most in Figma Design, alongside the ubiquitous Move tool, are almost certainly the Frame tool and the Text tool. These may not sound very exciting but you can get a long way using only these. Much of their power comes from the ability to finely customise the look of frames (essentially containers for stuff) and text and to build complex layouts by combining and nesting items you’ve created. Frames can also be filled with images, so while there is a separate Image/video tool, you don’t actually need to use it to create your high-fidelity mockups. This is illustrated below, where the top of the Jumping Rivers home page has been recreated using only the Frame and Text tools.

High-fidelity mockup of the top of the Jumping Rivers home page. The only tools from the Figma Design toolbar used to create this were the Frame and Text tools (plus the Move tool).

There are various other tools available associated with vector drawing - Line, Rectangle, Ellipse, Pen - as well as sectioning and commenting.

Screenshot of Figma Design's toolbar with the vector-drawing submenu open.

Conceptual Tools

Alongside the literal (in a digital sense) tools described above, Figma Design gives you the tools (in a broad, conceptual sense) to perform a number of useful tasks.

The most significant of these conceptual tools is the ability to create interactive prototypes. In brief, you can select an item in your design, connect it to another item and then define one or more interactions. This is simple in principle and fairly simple in practice to start with. For complex designs with many interactions I find it quickly becomes quite messy and difficult to decipher: Figma, Inc calls the visual depictions of connections you create between elements “noodles” and I find this apt as it’s quite easy to end up with a sort of noodle soup that’s hard to decipher. Nevertheless the tools are there and, for simple designs it’s quick to set up and then run a working prototype.

Screenshot of a simple four-tab dashboard design. The curved arrows ("noodles") show interactions the user can do: e.g. click on one of the "Flight Delay" buttons to go to the second (top-right) view. Even for this fairly simple prototype, the interlocking noodle pattern can be quite hard to decipher.

Paying users can use the tools made available in Dev Mode. Because it’s not part of the free tier I won’t go into details here, but in brief it’s a suite of tools that should make it easier to convert design files into code.

It’s also easy to export arbitrary parts of a design as JPEG, PNG, SVG or PDF. There’s no native app support for WebP or AVIF export yet, but there are community plugins that offer these.

So, Should I Design My Dashboard with Figma Design?

That is, of course, up to you. If your dashboard is fairly simple and you’re working on your own, it may be easier to just go straight out and build version 1 of your dashboard with your favourite dashboard-building tool. If you’re proficient with a library like Shiny or Dash this can be pretty quick. However, if you’re part of a team building a complex app, Figma Design may make the initial stages of development easier. And, if you want to user-test with simple interactive prototypes then it’s definitely an option worth considering.

For updates and revisions to this article, see the original post

Announcing AI in Production 2026: A New Conference for AI and ML Practitioners

Wed, 19 Nov 2025 23:59:00 +0000

Registration is now open for our first AI in Production conference, taking place on 4 and 5 June 2026 in Newcastle Upon Tyne.

AI in Production is for people who want to see how AI works in day to day environments. The event brings together data scientists, engineers, analysts, researchers, and anyone who wants to learn from real projects rather than theory.

What to expect

The programme is split into two streams so you can follow what is most relevant to your work.

Engineering Stream
Covers deployment, monitoring, scaling, infrastructure, and what it takes to keep AI systems running.

Machine Learning Stream
Covers model development, evaluation, responsible use of data, and lessons from applied ML work across different industries.

Across both days you will hear open discussions about what teams tried, what worked, what failed, and what they learned along the way.

Workshops on Thursday 4 June

The conference opens with a day of hands on workshops delivered by the Jumping Rivers team. These sessions guide you through practical tasks and give you time to ask questions as you go.

All tickets include entry to a relaxed drinks reception from 17:00 to 19:30.

Conference day on Friday 5 June

Talks begin at 09:30 and continue until around 16:15. You can move between the two streams or stay with one focus for the day.

Call for speakers

If you would like to speak at AI in Production 2026, we would love to hear from you!

We welcome both new and experienced speakers. You’ll need to submit:

A talk title
A short abstract (maximum 250 characters)
Your preferred talk format
- Lightning talk (around 6 minutes)
- Standard talk (around 25 minutes)
Whether you are happy for your talk to be recorded
A link to a page that represents you
(personal site, LinkedIn, GitHub or GitLab, Twitter, Mastodon etc.)

The submission deadline is 23 January 2026. Submit your abstract.

Key dates

9 January: Super early bird deadline
23 January: Abstract submission deadline
6 March: Early bird deadline
28 May: General registration deadline
4 June: Conference begins

Speakers

We are also excited to share our first confirmed speakers.

Mac Misiura, Red Hat
George Stagg, Posit Software

More speakers will be announced soon. If you’d like to be one of them, you can submit your abstract today.

Tickets

You can choose a ticket for the conference only or a combined ticket that includes one workshop. Learn more and register for the conference.

Planning your visit

The Catalyst is a short walk from Newcastle Central Station, with regular trains from Edinburgh and London. Newcastle International Airport is around thirty minutes away by Metro.

Sponsorship

If your organisation would like to support the conference, email events@jumpingrivers.com{.email}.

We look forward to welcoming you to Newcastle for two days of focused sessions, open conversations, and practical insight into running AI systems in real settings!

For updates and revisions to this article, see the original post

Elevate Your Skills and Boost Your Career – Free Jumping Rivers Webinar on 20th November!

Mon, 17 Nov 2025 23:59:00 +0000

Are you ready to stay ahead in the fast-evolving world of data? Join Jumping Rivers for our free monthly webinar series designed for data professionals at all levels. In just 55 minutes, you’ll gain practical insights, sharpen your skills, and tackle real-world challenges in R, Python, Shiny, and Posit – all from the comfort of your own desk.

Upcoming Webinar - Machine Learning with Python

Date & Time (BST): 20 November, 13:05

Why Attend?

Gain hands-on experience with the latest tools and best practices.
Make yourself more hireable by boosting your data science skills ahead of 2026.
Connect with a network of fellow data scientists, engineers, and experts.
Learn flexibly online with no cost or commitment.
Unlock exclusive discounts:
- Attend 2 sessions → 20% off AI in Production conference tickets.
- Attend more than 2 sessions → 20% off any of our high-quality public training courses.

Whether you want to improve coding or explore machine learning, this webinar is your chance to stay above the curve and grow your career.

Ready to Join?

For updates and revisions to this article, see the original post

Get Involved in the Data Science Community at our Free Meetups

Thu, 13 Nov 2025 23:59:00 +0000

As a data science consultancy, Jumping Rivers are already known for offering help and training to clients in all things data. But did you know that we also organise free, in-person data science meetups?

In this post we will talk through the typical format and topics at our meetups, along with some details for how you can get involved!

Where to find us?

We organise meetups in Newcastle-Upon-Tyne and Leeds, both of which are advertised on meetup.com:

The North East Data Science meetups (NEDS for short) run every three months in Newcastle-Upon-Tyne.
The Leeds Data Science meetups (LeeDS for short) run every two months in (you guessed it) Leeds.

Check out the webpages linked above to find out more about these meetups including upcoming and past events.

Later this month we will be hosting:

20 November NEDS meetup, featuring a one hour workshop on programming with large language models (LLMs) in R & Python (delivered by our very own Myles).
25 November LeeDS meetup, featuring talks on LLM coding tools and explainable LLMs.

Meetup format

All of our meetups are run between 6pm and 8pm. The first half hour typically involves casual networking while enjoying some pizza and soft drinks.

We then have one or two talks from local data science experts. Our previous speakers have come from a wide range of industries including consultancy, government, banking and utilities. Typical talk topics include LLMs, communication in data science, forecasting demand in public health, code review best practices, and setting up machine learning pipelines on platforms such as databricks and AWS.

The meetup host will also provide announcements about internships, job opportunities and data science events taking place locally. Between the announcements, networking and talks, our meetups are a great place to make friends and connections within the data science community, whether you’re a student looking to get into data science or a seasoned professional.

At some NEDS meetups we also run a “pre-event workshop”, where attendees get a hands-on introduction to a data science topic. Previous workshops have delved into machine learning with Python, machine learning operations (MLOps) and statistical modelling with R. The pre-event workshops run from 5pm to 6pm, but do check if there is a pre-event workshop in the schedule so that you don’t accidentally arrive an hour early!

How to get involved

To sign up to our mailing lists, please join the North East Data Scientists and Leeds Data Science Meetup communities. A meetup.com account is free to set up, and you will have access to lots of great local meetups (not just data science). You will then be notified about upcoming meetups that are taking place from communities that you are a member of.

Although our events are free to attend, we still require you to register in advance via meetup.com so that we have an idea of numbers when planning the room setup and catering.

We are always on the look out for speakers and workshop organisers! If you would like to volunteer yourself for a talk or workshop, or have any announcements to share with the community about job opportunities and events, please reach out to the following addresses:

neds@jumpingrivers.com for the NEDS organising team.
lds@jumpingrivers.com for the LeeDS organising team.

We can’t wait to hear from you!

Contributing to the data science community

Hosting meetups is just one of our ways of contributing to the data science community.

Over the past few years we have also been organising an annual Shiny In Production conference. This typically involves a half day of workshops followed by a full day of talks from prominent speakers on all things Shiny and web dashboards. Check out our recent Shiny In Production 2025 highlights blog to find out about our latest conference that ran in October.

Next year will be particularly exciting, as we organise our first ever AI In Production conference (4-5 June 2026). This will take a similar format with a day of workshops followed by a day of talks. Expect topics including LLMs and MLOps. For more details about this and how to sign up, check out the Eventbrite listing here.

We also organise a free monthly webinar series. Check out this blog with details of what to expect and how to sign up.

Finally, we also develop software that is freely available to the data science community. Have you heard of diffify.com? This is a free-to-use website that we have developed internally, which allows you to compare any two versions of your favourite R or Python packages. We are proud of how diffify has grown over the years, and are excited to bring you more updates very soon, so stay tuned!

That’s all for this post. We look forward to seeing some new faces at our data science meetups in the near future!

For updates and revisions to this article, see the original post

Polars and Pandas - Working with the Data-Frame

Thu, 06 Nov 2025 23:59:00 +0000

Biodiversity. We’d like more of it. More of each thing, and more different types of thing. And more of the things that help make more of the different types of thing.

But can you have too many things?

In Data Science we are often working with rectangular data structures - databases, spreadsheets, data-frames. Within Python alone, there are multiple ways to work with this type of data, and your choice is constrained by data volume, storage, fluency and so on. For datasets that could readily be held in memory on a single computer, the standard Python tool for rectangling is Pandas, which became an open-source project in 2009. Many other tools now exist though. In particular, the Polars library has become extremely popular in Python over recent years. But when Pandas works, is well-supported, and is the standard tool in your team or your domain, and if you are primarily working with in-memory datasets, is there a value in learning a new data-wrangling tool? Of course there is.

But this is a blog post, not a course, so what we’ll do here is compare the Pandas and Polars syntax for some standard data-manipulation code. We will also introduce a new bit of syntax that Pandas 3.0 will be introducing soon.

Let’s talk about pollinators.

There's a nice dataset about pollinators and plants found in areas of the UK available on the UK Centre for Ecology and Hydrology (UKCEH) website. See the full citation below. Briefly, the dataset contains counts of different types of pollinators in a range of 1 km² grids across the UK. With it, we can see trends over time in pollinator numbers.

Installation

We will use separate ‘uv’-based projects to analyse the UKCEH dataset, by installing Polars, Pandas 2 and Pandas 3 into different virtual environments. See our recent summary of 2025-trends in Python to get more information about ‘uv’.

Let’s install some bears inside a snake and analyse some bees:

# Make separate environments for pandas2, pandas3, polars:
uv init pandas2

cd pandas2
uv add "pandas==2.3.3"
uv run python -c "import pandas; print(pandas.__version__)"
# 2.3.3
cd ..

For Pandas 3, we are going to install a development version of the package. One way to do this in uv is using uv pip install

uv init pandas3
cd pandas3
uv venv # explicitly initialise the virtual env
uv pip install --pre \
 --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \
 pandas

# Resolved 5 packages in 2.35s
# Installed 5 packages in 31ms
# + numpy==2.4.0.dev0
# + pandas==3.0.0.dev0+2562.ga329dc353a
# + python-dateutil==2.9.0.post0
# + six==1.17.0
# + tzdata==2025.2
# (Note this venv isn't managed by uv...)

uv run python -c "import pandas; print(pandas.__version__)"
# 3.0.0.dev0+2562.ga329dc353a

cd ..

Finally, we’ll install polars into a separate project. I’ve called this project polars-proj. If the project had been called polars, we couldn’t have installed the polars package within it.

# We can't call this project 'polars',
# as we'll be installing the 'polars' package inside it
uv init polars-proj
cd polars-proj
uv add "polars==1.34.0"
cd ../

So we now have three different projects (‘pandas2’, ‘pandas3’, and ‘polars-proj’).

Download the data

Data was downloaded from ceh.ac.uk and stored in ./data/ukpoms_1kmpantrapdata_2017-2022_insects.csv See the citation below if you wish to work with this data.

As of the start of November 2025, this dataset has been downloaded 29 times.

Data processing

Pandas 2

Pandas 2 is a well-known Python syntax for data-frame work. From the pandas2 project, we can open a Jupyter notebook based on the pandas2 virtual environment:

# [bash]
uv run --with jupyter jupyter lab

We will read in the data and then make some summaries, to produce an output table:

import pandas as pd

pollinators = pd.read_csv(
 "../data/ukpoms_1kmpantrapdata_2017-2022_insects.csv",
 encoding="ISO-8859-1"
)

Bees, and related species, are from the order “Hymenoptera”:

bees = pollinators[pollinators["order"] == "Hymenoptera"]
# 9245 rows, 16 columns

Within bees we find a range of interestingly-named insects: nomad bees, small shaggy bees, the impunctate mini-miner, a few Buffish mining bees and a clutch of heather girdled Colletes, amongst others. So I’m wondering how many bees and how many different species are observed in a given sector.

bees["english_name"].unique()
# array(['Common Yellow-face Bee', 'Red-tailed Bumblebee',
# 'Common Carder Bee', 'Bloomed Furrow Bee', ...

We have a sample_id and an occurrence_id column. There may be multiple rows with the same sample_id, but each row has a unique occurrence_id. The sample_id defines the 1 km² sector in which a given pollinator count was performed - there are multiple rows, because there are typically multiple pollinators present in a sector. Any given sample_id is present for only one year (not shown).

So what we want to do is group the dataset by sample_id and count up the bees within that sector. We will store the year along with the sample_id.

We can count up the observations in each sector as follows. Here we are summing the number of observed insects (aggregating the ‘count’ column using the ‘sum’ function) and counting the number of distinct taxa in the sector (the length of the unique entries in the taxon_standardised column).

bee_counts = (
 bees
 .groupby(["sample_id", "year"])
 .agg({
 "count": "sum",
 "taxon_standardised": lambda x: len(x.unique())
 })
 .rename(columns={
 "count": "n_insects",
 "taxon_standardised": "n_species"
 })
)

With that, we can view the sectors that had the most bees overall:

bee_counts.sort_values("n_insects", ascending=False).head()
# n_insects n_species
# sample_id year 
# 14940524 2021 28 3
# 15465304 2021 28 5
# 6810184 2019 28 1

And that had the most bee diversity:

bee_counts.sort_values("n_species", ascending=False).head()
# n_insects n_species
# sample_id year 
# 11873611 2020 25 11
# 4440178 2018 20 11
# 11745253 2020 24 11

You could do considerably more advanced analysis if you had time.

Polars

We will repeat the above, but using syntax typical for the Polars package.

The syntax for subsetting the rows of a data-frame is different in Polars. Passing a Boolean data-mask, pollinators[pollinators["order"] == "Hymenoptera"], doesn’t work in Polars and the printed error will recommend you use the .filter() method instead:

bees = (
 pollinators
 .filter(pl.col("order") == "Hymenoptera")
)

Inside a data-frame method (like .filter()) we can refer to a column using pl.col("column_name"). This means we don’t have to precompute a data-mask on a concrete data-frame, and can implicitly refer to a column in the current state of the data-frame (in Pandas, pollinators["order"] == "Hymenoptera" returns a Series of Boolean values that can be used to index into the rows of a data-frame; this logical series is a “data-mask”). So we can chain filtering steps together.

The syntax for grouping and summarising data is similar to the Pandas syntax but, again, we can refer to columns using pl.col(). By providing named arguments to .agg() the names of the output columns can be defined in a single step.

bee_counts = (
 bees
 .group_by(["sample_id", "year"])
 .agg(
 n_insects = pl.col("count").sum(),
 n_species = pl.col("taxon_standardised").unique().len()
 )
)

Pandas 3

Pandas 3.0 is introducing a new syntax that can be used for filtering rows, or adding new columns. It is closely related to the Polars pl.col() syntax. For example, filtering to keep only the “Hymenoptera” in the pollinators dataset can be performed using the following code:

# Pandas 3.0
import pandas as pd
pollinators = pd.read_csv(....)

bees = pollinators.loc[pd.col("order") == "Hymenoptera"]

The new part of this syntax is the use of pd.col(), the .loc[] method is actually available in Pandas 2.0, where we use an anonymous function to select the required rows:

# Pandas 2 or 3.0
bees = pollinators.loc[lambda x: x["order"] == "Hymenoptera"]

Summary

In this blog post we have shown the similarities and differences between Pandas and Polars syntax for typical data-manipulation tasks. There are some fundamental differences between Pandas and Polars that go deeper than the syntactic things covered here (and we’ve really only scratched the surface of those differences). Polars is implemented in Rust, whereas Pandas is written in Python on top of Numpy’s C++ code base. The speed of Polars and Pandas can differ on the same tasks as a result of the different implementations. Processing speed is occasionally a good reason to choose one package over another. But if you are considering migrating from Pandas to Polars, you have to accept that your team will all need onboarding to the Polars syntax. From what we’ve seen here, the contrast between Pandas and Polars syntax aren’t that great; the methods have similar names for example. In fact, from discussing the two packages with data scientists, we have found that it is the Polars syntax, rather than it’s speed, that has led some to migrate away from Pandas.

Data Citation

UK Pollinator Monitoring Scheme (2025). Pan trap survey data from the UK Pollinator Monitoring Scheme, 2017-2022. NERC EDS Environmental Information Data Centre. https://doi.org/10.5285/4a565007-d3a1-468d-9f84-70ec7594fafe

The UK Pollinator Monitoring Scheme (UK PoMS) is a partnership funded jointly by the UK Centre for Ecology & Hydrology (UKCEH) and Joint Nature Conservation Committee (JNCC) (through funding from the Department for Environment, Food & Rural Affairs, Scottish Government, Welsh Government and Department of Agriculture, Environment and Rural Affairs for Northern Ireland). UKCEH’s contribution is part-funded by the Natural Environment Research Council formerly as part of the UK-SCAPE programme (award NE/R016429/1) and now as part of the NC-UK programme (award NE/Y006208/1) delivering National Capability. Between 2017 and 2021, PoMS was funded by UKCEH and Defra (England), Welsh Government, Scottish Government, DAERA (Northern Ireland), and JNCC. PoMS is indebted to the many volunteers who carry out surveys and contribute data to the scheme.

For updates and revisions to this article, see the original post

Highlights from Shiny in Production (2025)

Mon, 03 Nov 2025 23:59:00 +0000

This October, Jumping Rivers hosted the fourth installment of our conference “Shiny In Production”. This year, speakers from around the world joined us in Newcastle to see how Shiny, in both Python and R, has solved real data problems for them.

Workshops

An important part of “Shiny In Production” is the afternoon of hands-on workshops. Whether you want your app to look nice, behave correctly, or treat all your users fairly, there was something for you. If only we could attend all of the workshops in parallel, too…

Colin Fay - “Production-Proof Shiny - End-to-end testing with Playwright and {golem}”

Colin Fay from ThinkR presented a workshop on the, now industry-standard tool “Playwright”, for end-to-end testing. Playwright can be used against any browser-based application, so by learning this tool our attendees could go away and test apps whether they are written with Shiny in R or Python, or using any other dashboard framework.

Russ Hyde - “Asynchronous Shiny”

Poor Shiny. 13 years old is practically middle-aged for a web framework, and like many a tech wunderkind, it’s decided that it’s time for a change. Here at Jumping Rivers, we wish Shiny great success as it embarks on a new life as a chef. But as it juggles the multiple orders that come in, and the many different recipes on the menu, Chef Shiny has realised that its years training on the futures, promises, and ExtendedTasks of the data world were the perfect apprenticeship for this brave leap.

Russ, one of our Data Scientists from Jumping Rivers, explained to our workshop attendees how to work with asynchronous programming in R. And how Shiny can make use of this approach to build apps that can serve multiple users without blocking.

Pedro Silva - “Figma and User-Interface Design”

Yes Shiny, you are beautiful, but are you really pairing that typeface with those buttons?

Figma, like Playwright above, is another widely-used tool from the wider world of web development. It can be used to create user interface designs in collaboration with clients and colleagues.

Our Data Scientist Pedro Silva, ably-assisted by Tim Brock and Keith Newman, treated the workshop attendees to an introductory session on Figma. They worked through hands-on exercises to design components of an app and saw how thinking about the user-interface design from outside of your normal IDE can help when you are building applications in Shiny.

Talks

On Day 2 we enjoyed talks from some fabulous speakers across a range of industries!

Colin Fay - “After {shiny} — Bringing R to Mobile with webR”

R evolved from a command-line tool to include GUIs and IDEs. In 2012, Shiny emerged, enabling web app development purely in R, connecting statisticians with users. However, mobile usage wasn’t prioritised initially.

As mobile devices became ubiquitous, new requirements arose. Shiny’s mobile approaches proved limited, leading to the development of {shinymobile}. Despite improvements, it still required internet connectivity, couldn’t be distributed through app stores, and lacked access to native phone APIs.

To address these limitations, the team developed {R-linguo}, a proof-of-concept native app using webR. It offers offline functionality, native performance, mobile-friendly UX, native API access, and app store distribution.

The way the app works is by loading a JavaScript runtime, which loads webR, which then loads R functions as R object proxies that JavaScript can call.

This innovation serves scientists in remote areas, students, and educators who need offline R capabilities beyond what Shiny apps can provide.

Charlie Gao - “Advances in the Shiny Ecosystem”

This talk explores two key advances in the Shiny ecosystem: async programming and OpenTelemetry tracing.

The async section focuses on Promises and the mirai package. Mirai offers a modern foundation with NNG, IPC/TCP/secure TLS support, and cross-language data formats like Arrow. It delivers extreme performance, scaling to millions of tasks, with a production-first approach featuring 100% reliable evaluation and minimal complexity. It deploys everywhere, from local to remote systems and clusters including Slurm, SGE, LSF, and PBC.

OpenTelemetry provides observability at scale through distributed tracing across services, databases, and API gateways. It enables performance optimisation by reducing span length and nesting, real-time error detection, centralised monitoring across processes and machines, and production monitoring. Implementation requires installing {otel} and {otelsdk} packages and configuring environment variables.

The recommended performance workflow involves enabling OpenTelemetry to identify slow spans, using profvis for detailed analysis, then optimising through moving work outside Shiny servers, improving code efficiency, implementing caching, and sometimes utilising non-blocking reactivity.

Colin Gillespie - “Validating Shiny Apps in Regulated Environments”

Colin explored how to validate Shiny apps and what makes them trustworthy. Using audience input, he concluded that professional Shiny apps need tests, documentation, and a good user experience.

He covered the Jumping Rivers Litmusverse suite, which validates R packages, and explained that Shiny apps are harder to validate due to user interaction and variable outputs. Combining Litmus tools with Shiny-specific assessments can produce a validation score.

Colin also covered challenges such as logging user actions, validating downloads, and restricting inputs. He stressed using {renv} or Docker to manage environments, performing end-to-end testing, and separating logic from the app for easier testing. The talk ended with best practices around documentation, validation, workflows, and automation.

Jack Anderson - [“Transforming the reporting of national patient outcomes with Shiny”]

(https://digital.nhs.uk/ndrs/data/data-outputs/cancer-data-hub/30-day-mortality-after-sact)

The Shiny era may be in full swing, but our battle against inadequate Excel spreadsheets wages on.

The National Disease Registration Service (NDRS) reports 30-day mortality post-Systemic Anti-Cancer Therapy (SACT) Case-Mix Adjusted Rates (CMAR) to NHS trusts in England each year. For the first three years, these were shared as Excel files with an accompanying pair of instructions in PDF files. But this leaves you with a number of problems, such as time-consuming admin to prepare and email the results to over 350 trusts around the country, and limited capacity for QA checks.

Replacing this system with a Shiny alternative provided a more intuitive user interface, plots that now make sense when copied into external reports, and the ability for trusts to easily compare against neighbouring regions. Jack guides us through the many benefits—not just for the end-user—but also the NDRS benefitting from drastically reduced admin requirements in publishing results each year.

Spurred on from the success of this application, the NDRS now has a custom Shiny starter template to support the creation of future Shiny applications among the team.

Gabriella De Lima Marin - [“A collaborative initiative for mapping and georeferencing public schools in Brazil”]

(https://gabrielamarin.quarto.pub/shiny-in-production/)

With a goal of identifying digital inequality around Brazil, Gabriela lays out the problem of ensuring schools across Brazil can get a good connection to fibre internet networks. But that can be difficult when 3 in every 10 schools on the list have no geolocation data. It’s even worse when some schools with location data have coordinates that lie in the ocean.

Pulling in data from other sources—such as Google and OpenStreetMap—can help, but even these sources can be missing data or have incorrect entries. So why not allow locals to provide the missing pieces of the puzzle? Gabriela takes us through the creation of a Shiny application containing a {leaflet} map, where users can submit location data for schools.

There are of course challenges which Gabriela had to face: How do you decide which data source to trust? How do you decide who submitted the most accurate location marker? Challenges aside, this remains a cost-effective way to collect the vital information from those with local knowledge. And with this information, you can make impactful improvements to educational services across Brazil.

Cam Race - [“shinyGovstyle: A ‘Shiny’ Secret Weapon for Production-Ready Government Public Services”]

(https://drive.google.com/file/d/1hCbMEZjxq_hoSBWXomj9bw1p_WSQ9Nxx/view?usp=sharing)

The vast majority of UK government service websites use the GOV.UK Design System—an award-winning framework for consistent styling and components on websites.

Cam showcased {shinyGovstyle}, an R package that applies the GOV.UK design system on your Shiny app, and explained the benefits of having a consistent theming package for your applications. While the biggest effect from the package is to apply the standardised CSS to the application, additional functions are added to provide more accessible formats to hyperlinks and widgets. This means some of the styling requirements needed to meet WCAG 2.2 AA are handled automatically, which is a legal requirement for UK government web services to meet. But as Cam points out, while a surprisingly large percentage of users will have some form of accessibility needs, improved accessibility benefits everyone.

Thanks to this common design system in a package, it’s much easier for developers to have their Shiny applications accepted for publication on government sites.

Laura Mawer - [“Duck, Duck, …, Dashboard”](Duck, Duck,… Dashboard! video.mp4)

I write this while seeking cover. As Laura pelts another questioneer with rubber ducks, I’ll try to summarise her talk…

The second Shiny app that Laura (from Datacove) ever built, shows off some super cool stuff. Interactive graphics of duck-related data are one thing. But the artificial intelligence embedded in this app was awesome.

If you haven’t seen the AI tutorials for “Shiny for Python” that Posit have written, they are strongly recommended. But you need a good idea before including AI in an app. Here, Laura included a text-query box that allows users to ask questions about the dataset that is presented in the app. You can ask questions about ducks in general, or about the dataset itself. The idea worked really well.

That’s right. Her second ever app…

If you are learning Data Science in Python, Laura hosts a YouTube series “Pretty Powerful Pandas”. Hopefully she won’t be throwing pandas at us next time…

Nic Crane & Charlotte Hadley - “htmlwidgets are a secret sauce in R - can LLMs make them the perfect condiment?”

The final talk of the day was a duet. If you blend one htmlwidgets enthusiast, and one LLM enthusiast, this interactive session is the result.

{htmlwidgets} powers much of the connection between R and JavaScript widgets. Think DT, leaflet, plotly and profvis: their use in R and Shiny is held together by htmlwidgets. Many of the audience have made use of these tools at one time or another. It’s much less common to meet someone who has created an htmlwidget-based package that wraps up a JavaScript library for use in R.

This talk by Nic and Charlotte, showed us how simple it is to make an htmlwidget.

But it didn’t just do that. There’s already tutorials and books that can explain that process. Here, they showed us how simple it is to get GitHub Co-pilot to make an htmlwidget. Using prompts in VS-Code, they built an initial widget and then refined it and refined it with further prompting, until the resulting R code could create a timeline from an input data-frame.

Sure, there was a bit of manual tweaking required to polish off the code, but the package created by Co-pilot was usable. So it still takes an expert like Nic or Charlotte to temper some of the decisions made by code-generating tools.

Lightning Talks

Like last year the lightning talks had the added challenge of the slides auto-rolling with 10 seconds for each one! We also had a vote for the best talk with the prize being a £100 book voucher donated by the CRC Press. David Carayon from INRAE claimed the prize, avenging his second place finish last year!

David Carayon - “Rescuelog: a Shiny-Based Monitoring System for Lifeguards: Insights from Southwest France”

David started with showing the beautiful beaches of south west France with the caveat that they are some of the most dangerous in the world with thousands of rescues per year. He then outlined his project equipping life guards with tools and knowledge for reporting these rescues, allowing the collection of data to be used in a predictive analytics Shiny app. The project has been a great success with over 15,000 submissions each year and over 80 beaches signed up.

Rhian Davies - “The Accidental Engineers: Managing Shiny Apps, Pipelines, and Tech Debt in the NHS”

Rhian started with outlining the challenge of hospital planning, in terms variables impacting demand like population growth, patient expectations or waiting lists. She then presented a project using Shiny to explore outputs from a Python model.

The struggle of losing a core member of the team and the app going into production. This resulted in a huge surge in demand for the developers time with bugs, feature requests flying in. They have developed a mechanism to treat the model like a software project with sprints including development, quality assurance and launch. She finished with detailing some of the best practices they’ve implemented and important lessons learnt.

Andrie de Vries - “Working with Inforsec to get to production”

Andrie spoke about the importance of the three pillars of Infosec: availability, integrity and confidentiality. He also covered the risks of data leakage, not testing adequately and exposure to the LLM provider. He moved on the some solutions to these problems like authorisation, data security and scoped permissions.

Russ Hyde - “Discoverability and the Data Product”

This talk by our own Russ Hyde spoke about the importance of discoverability of tools and apps, this is often an overlooked part of development. Moving on to Shiny apps specifically there is aspects of metadata you can add like descriptions, tags and documentation.

Russ spoke about helping users by making it easy to find and use products and contact developers. He pointed to the Jumping Rivers dashboard gallery, with some helpful examples of features you can include in Shiny dashboards.

Kia Mack & Euan McKenzie - “Building the Kent BNG Register: Shiny for UI-First Development in a Small Charity Tech Team”

Kia started off with a brief background on something called Biodiversity Net Gain, a government initiative where land developers have to increase biodiversity and a popular way of doing this is by buying biodiversity credits. The Kent Wildlife Trust wanted a way to add visibility to local habitat banks (where the credits can be purchased) to ensure that local developers are investing in local biodiversity. She introduced the Shiny app that they had designed for land developers to see what listings are available from habitat banks and enquire with them.

Euan then detailed the tools used within the app like Bbs4Dash for the layout, leaflet for the open source maps and DT for interactive tables. He also spoke about the auth0 package for in app authentication, with email verification. Having a secure database was a key part of the project so they used the glue_sql and inputValidator functions to prevent SQL injection by sanitising queries. The Golem framework was also very helpful for structuring the app and allowed them to create a separate R package containing the business logic with tests.

Natalia Petersen - [“Hackathon to Streamline the National Disease Registration Service Cancer Treatments Shiny App”](NHSE-NDRS/shiny-app-cancer-treatments: This repository is for the production of the NDRS Cancer Treatments R Shiny app)

Natalia from the National Disease Registration Service spoke about how her team used a hackathon day to develop a shiny app. She started with a background of the project, a publicly available dashboard which presents treatment data for various forms of cancer treatment. She displayed the previous iteration of the dashboards (one for demographics and one for alliance) with a “retro” UI.

She spoke about areas of improvement they targeted for the dashboards like combining the dashboards, removing repeated code and simplification of over-complicated logic. They had 4 tasks, 4 hours and 4 analysts to tackle the issues. She spoke about successes of the session and also what could have improved it, like dedicated “mop-up” time. To finish Natalia showed what the new app looked like and gave an overview of the improvements made.

Andreas Wolfsbauer - “Enhancing Epidemiological Surveillance with a Shiny Application for Standardized Data Analysis”

Andreas, a data scientist at the Austrian Agency for Health and Food Safety in the Institute for Surveillance & Infectious Disease Epidemiology. He showed the Shiny application he has developed for standardised data analysis. The app has two components, the first is a dashboard where users can load the disease data then filter on year, state and age group. The second part of the app is an analysis page where they can filter the data and download or visualise the data.

Andreas spoke about issues with the app like a reliance on excel for the data so he ran a scheduled job each morning to preload the data. He spoke about constraints within the organisation with policies limiting access to tools like Docker and shinyproxy. So he turned to Shiny Server and deployed 3 instances of the app and wrote a gateway app which handles load balancing across the instances. He closed his talk with a list of features he would like to experiment with / add to the app in the future.

What happens next?

Next year, we’re excited to host the very first AI In Production! Join us on June 4th and 5th in Newcastle Upon Tyne for an inspiring lineup of industry-leading speakers and hands-on workshops. Grab your tickets now on Eventbrite to take advantage of the Super Early Bird discount before it’s gone.

Sponsors

For updates and revisions to this article, see the original post

Elevate Your Data Skills with Jumping Rivers Training

Tue, 28 Oct 2025 23:59:00 +0000

In today’s data-driven world, strong analytical and programming skills are essential for success. Whether you’re just starting your data journey or looking to expand your expertise, Jumping Rivers offers training that combines real-world experience with interactive, practical learning.

Expert-Led, Hands-On Learning

At Jumping Rivers, our trainers are experienced data scientists and engineers who work daily on real client projects. This means the skills you learn are grounded in real-world applications, not just theory. Our courses blend live coding, demonstrations, and interactive exercises to ensure an engaging and effective learning experience. Every participant receives:

Comprehensive PDF notes and scripts for continued learning.
Live demonstrations and practical exercises.
Guidance from a trainer- whether online or onsite.

Training is available both online and in-person, with flexible options for individual learners or teams.

Course Topics

Our training portfolio spans a wide range of topics, including:

R for data analysis and reporting
Python for data science and automation
Git and version control
Artificial Intelligence and Machine Learning fundamentals

Each course is designed to help participants apply new skills immediately to their work, with clear examples and hands-on practice throughout.

Public and In-House Training

Our public training programme offers scheduled courses open to all participants. You can view our upcoming sessions here:

https://www.jumpingrivers.com/training/public/

For organisations looking to develop their teams, we also provide bespoke in-house training. These sessions are fully customised, workflows, and skill levels - ensuring that your team gains the most relevant and practical insights possible.

To discuss tailored courses for your organisation, contact training@jumpingrivers.com.

Why Train with Jumping Rivers

With over 1,000 courses delivered worldwide. Jumping Rivers has built a reputation for delivering training that is both impactful and accessible. Our clients include NHS Scotland, Shell, Wessex Water, and the Royal Statistical Society—organisations that trust us to develop their data capability.

We also offer additional benefits such as:

Discounts for group bookings and returning clients.
Reduced rates for attendees of our events and conferences.

Take the Next Step

Whether you’re advancing your own career or developing your team’s capabilities, Jumping Rivers training provides the tools, confidence, and practical knowledge you need to succeed.

Explore our public courses or reach out to discuss bespoke options:

View upcoming training: https://www.jumpingrivers.com/training/public/
Enquire about group training: training@jumpingrivers.com

Invest in your growth with hands-on, expert-led training from Jumping Rivers and take the next step in your data journey.

For updates and revisions to this article, see the original post

Creating a Python Package with Poetry for Beginners Part2

Thu, 23 Oct 2025 23:59:00 +0000

Intro

So far, in the previous blog we covered creating our package with Poetry, managing our development environment and adding a function. In the current blog post we’ll be covering the next steps with package development including documentation, testing and how to publish to PyPI.

Note: I am using my package as an example but not actually publishing it to PyPI.

Documentation

When developing a package, documentation is one of the most important steps. It’s easy to get carried away with the fun of writing packages and functions and forget to document them. There are many reasons to write documentation, some are:

Purpose: Explains what the code does and why, thinking about this as developer can often help with design.
Usability: It helps users (and your future self) understand the code.
Maintenance: It will make debugging and updates easier.
Standards: All good packages have good documentation. It is one of the key metrics of Litmus, our package validation service.

What Documentation Do We Need?

README

A README file is a short, essential guide that explains your Python package at a glance. It typically includes:

Project name and description: What the package does and why it’s useful.
Installation instructions: How to install it (usually with pip).
Usage examples: Simple code snippets showing how to get started.
Features or documentation links: What’s included and where to learn more.
License and contribution info: How others can use or contribute to the project.

In short, the README helps users understand, install, and use your package quickly.

For a good example of a README file, instead of writing one for my package I’m going to point to the pandas README.

Docstrings

Docstrings are short, embedded documentation inside your Python code that explain what functions, classes, or modules do. They typically include:

Purpose: A brief description of what the function, class, or module does.
Parameters: Names, types, and descriptions of inputs.
Returns: The output type and what it is.
Example usage (optional): A small code snippet showing how to use it.

In short, docstrings make your code understandable, help tools like help() or IDEs provide guidance, and serve as the basis for auto-generated API documentation.

For a docstring example I am going to use my function get_season_league. Here, we are using the Sphinx markup language to document the different input parameters and their datatypes, and any returned values. See the Sphinx documentation for further information.

def get_season_league(league_id = "485842"):
 """
 This function will take your league ID, map over all the members
 of your league then return a DF with a week on week league table.

 :type league_id: str
 :param league_id: ID of the league you are targetting

 :returns: Data-frame of the leagues week on week standings
 """
 api_url = "https://fantasy.premierleague.com/api/"

Testing

Testing is another very important part of package development that has many benefits. It can be integrated to version control CI pipelines, meaning you can run the tests every time you push some changes to a remote git repository. Some of the benefits of testing are:

Thinking about tests whilst writing functions will aid development
Well written tests will catch bugs early
Ensure consistency between releases

There is lots of resources out there on writing tests for python packages. We have two previous blogs on pytest, an introductory blog and a more advanced one. There are many testing frameworks available for Python, like unittest pytest, or doctest (which runs docstring-embedded examples as software tests). The type of testing you need will often determine the framework you use. The software literature makes distinctions between different types of tests: unit (which we will focus on), integration, end to end, and acceptance tests. The distinction is based on the scope (how much of the software project is run/touched during the tests), isolation (do the tests rely on external services) and viewpoint (do the tests check features from a user’s perspective, or how the software works internally from a developer’s perspective).

Testing My Package

Thankfully my package only has one function so it will be very easy to write a test.

So to begin I’ll create a test file, tests/test_get_league.py this follows the naming convention of naming the test file test_module_name. You may also see test files named test_function_name, this will depend on how large your modules are. The goal is for it to be consistent, easy to understand and ideally split up based on size.

I have added some simple tests for the class of the output, the columns returned and the first event in my default data as this will remain the same. I’m not going to go into detail on how the tests work as we have already done blogs on this as mention above but this is my test:


def test_get_season_league():
 output = get_season_league()
 # Test pandas DataFrame is produced
 assert isinstance(output, pd.DataFrame)
 # Test columns are correct
 assert list(output.columns) == ["name", "team_name", "event", "points"]
 # Test first event as it will remain the same as the data grows
 first_event = output.query("name == 'Osheen Macoscar' & event == 1")
 assert first_event["points"] == 69

I have written a very surface level test here. My particular function is hard to test as I’m calling an external API, meaning the object will differ each game-week. The API may also go down or the output may change causing the test to fail, when my function hasn’t changed. When touching an external resource ideally I could set up a static response to test (which I could do for certain endpoints) but I can’t with my function as the output is supposed to change throughout the season.

Once we have written our tests we can run pytest whilst in the top level of our package to run the test(s), and it will tell you if they have passed or failed.

Publishing to PyPI

As I mentioned at the start of the blog I am not publishing this package to PyPI, however I will show the helpful poetry function that allows us to do it. Note these is also a TestPyPI that you can publish to first to ensure everything runs smoothly.

The main function for this is poetry publish but there are a few steps we need to take first. Obviously there is a level of authentication before you can publish, this can be set up by adding your user specific PyPI token to your config:

poetry config pypi-token.pypi <token>

After you have done this you are clear to publish and can do so with:

poetry publish --build

The build tag at the end just builds the package by creating a the distributable files (a .tar.gz and a .whl) inside the dist/ directory. This is required before publishing the package.

Next Up

This is where I am going to leave the series for now. We have looked at all the basics you need when developing a python package from writing and documenting functions all the way to testing and publishing the package. In the next iteration I may look at building out this package or parallelising the function I’ve written, but it is not scheduled to be written anytime soon.

For updates and revisions to this article, see the original post

What's new for Python in 2025?

Thu, 16 Oct 2025 23:59:00 +0000

Python 3.14 was released on 7th October 2025. Here we summarise some of the more interesting changes and some trends in Python development and data-science over the past year. We will highlight the following:

the colourful Python command-line interface;
project-management tool uv;
free-threading;
and a brief summary of other developments.

The Python 3.14 release notes also describe the changes to base Python.

Colourful REPL

At Jumping Rivers we have taught a lot of people to program in Python. Throughout a programming career you get used to making, and learning from, mistakes. The most common mistakes made in introductory programming lessons may still trip you up in 10 years time: unmatched parentheses, typos, missing quote symbols, unimported dependencies.

Our Python training courses are presented using Jupyter. Jupyter notebooks have syntax highlighting that makes it easy to identify an unfinished string, or a mis-spelled keyword.

But, most Python learners don’t use Jupyter (or other high-level programming tools) on day one - they experiment with Python at the command line. You can type “python” into your shell/terminal window and start programming into the “REPL” (read-evaluate-print loop).

Any effort to make the REPL easier to work with will be beneficial to beginning programmers. So the introduction of syntax highlighting in the Python 3.14 REPL is really beneficial.

`uv` and package development

One of the big trends in Python development within 2025, is the rise of the project management tool uv. This is a Rust-based command-line tool and can be used to initialise a package / project structure, to specify the development and runtime environment of a project, and to publish a package to PyPI.

At Jumping Rivers, we have used poetry for many of the jobs that uv excels at. Python is used for the data preparation tasks for diffify.com, and we use poetry to ensure that our developers each use precisely the same package versions when working on that project (See our current blog series on Poetry). But, poetry doesn’t prevent developers using different versions of Python. For that, we need a second tool like pyenv (which allows switching between different Python versions) or for each developer to have the same Python version installed on their machine.

uv goes a step further than poetry and allows us to pin Python versions for a project. Let’s use uv to install Python 3.14, so that we can test out features in the new release.

First follow the instructions for installing uv.

Then at the command line, we will use uv to create a new project where we’ll use Python 3.14.

# [bash]
cd ~/temp
mkdir blog-py3.14
cd blog-py3.14

# Which versions of Python 3.14 are available via uv?
uv python list | grep 3.14
# cpython-3.14.0rc2-linux-x86_64-gnu <download available>
# cpython-3.14.0rc2+freethreaded-linux-x86_64-gnu <download available>

You’ll see something similar regardless of the operating system that you use. That lists two versions of Python 3.14 - one with an optional system called “Free Threading” (see later). We’ll install both versions of Python:

uv python install cpython-3.14.0rc2-linux-x86_64-gnu
uv python install cpython-3.14.0rc2+freethreaded-linux-x86_64-gnu

Users of pyenv will be able to install Python 3.14 in a similar manner.

We can select between the two different Python versions at the command line. First using the version that does not have free threading:

uv run --python=3.14 python
# Python 3.14.0rc2 (main, Aug 18 2025, 19:19:22) [Clang 20.1.4 ] on linux
# ...
>>> import sys
>>> sys._is_gil_enabled()
# True

Then using the version with free threading (note the t suffix)

uv run --python=3.14t python
# ...
# Python 3.14.0rc2 free-threading build (main, Aug 18 2025, 19:19:12) [Clang 20.1.4 ] on linux
# ...
>>> import sys
>>> sys._is_gil_enabled()
# False

Project creation and management with `uv`

uv is capable of much more than allowing us to switch between different versions of Python. The following commands initialise a Python project with uv:

# From ~/temp/blog-py3.14

# Indicate the default python version for the project
uv python pin 3.14

# Initialise a project in the current directory
uv init .

# Check the Python version
uv run python --version
# Python 3.14.0rc2

This adds some files for project metadata (pyproject.toml, README.md) and version control:

tree -a -L 1
# .
# ├── .git
# ├── .gitignore
# ├── main.py
# ├── pyproject.toml
# ├── .python-version
# ├── README.md
# ├── uv.lock
# └── .venv
#
# 2 directories, 6 files

Now we can add package dependencies using uv add <packageName> and other standard project-management tasks. But one thing I wanted to highlight is that uv allows us to start a Jupyter notebook, using the project’s Python interpreter, without either adding jupyter as a dependency or explicitly defining a kernel for jupyter:

uv run --with jupyter jupyter lab

Creating a new notebook using the default Python 3 kernel in the JupyterLab session that starts, should ensure you are using the currently active Python 3.14 environment.

Threading

Python 3.13 introduced an experimental feature, ‘Free-threading’, that is now officially supported as of 3.14.

First though, what is a ’thread’? When a program runs on your computer, there are lots of different tasks going on. Some of those tasks could run independently of each other. You, as the programmer, may need to explain to the computer which tasks can run independently. A thread is a way of cordoning-off one of those tasks; it’s a way of telling the computer that your software is running on, that this task here can run separately from those tasks there, and the logic for running this task too. (Basically).

Python has allowed developers to define threads for a while. If you have a few tasks that are largely independent of each other, each of these tasks can run in a separate thread. Threads can access the same memory space, meaning that they can access and modify shared variables in a Python session. In general, this also means that a computation in one thread could update a value that is used by another thread, or that two different threads could make conflicting updates to the same variable. This freedom can lead to bugs. The CPython interpreter was originally written with a locking mechanism (the Global Interpreter Lock, GIL) that prevented different threads from running at the same time (even when multiple processors were available) and limited the reach of these bugs.

Traditionally, you would have used threads for “non-CPU-bound tasks” in Python. These are the kinds of tasks that would be unaffected by having more, or faster, processors available to the Python instance: network traffic, file access, waiting for user input. For CPU-bound tasks, like calculations and data-processing, you could use Python’s ‘multiprocessing’ library (although some libraries like ‘numpy’ have their own low-level mechanisms for splitting work across cores). This starts multiple Python instances, each doing a portion of the processing, and allows a workload to be partitioned across multiple processors.

The main other differences between threading and multiprocessing in Python are in memory and data management. With threading, you have one Python instance, with each thread having access to the same memory space. With multiprocessing, you have multiple Python instances that work independently: the instances do not share memory, so to partition a workload using multiprocessing, Python has to send copies of (subsets of) your data to the new instances. This could mean that you need to store two or more copies of a large dataset in memory when using multiprocessing upon it.

Simultaneous processing across threads that share memory-space is now possible using the free-threaded build of Python. Many third-party packages have been rewritten to accommodate this new build and you can learn more about free-threading and the progress of the changes in the “Python Free-Threading Guide”.

As a simple-ish example, lets consider natural language processing. There is a wonderful blog post about parallel processing with the nltk package on the “WZB Data Science Blog”. We will extend that example to use free-threading.

ntlk provides access to some of the Project Gutenberg books, and we can access this data as follows:

# main.py
import nltk

def setup():
 nltk.download("gutenberg")
 nltk.download("punkt_tab")
 nltk.download('averaged_perceptron_tagger_eng')
 corpus = { f_id: nltk.corpus.gutenberg.raw(f_id)
 for f_id in nltk.corpus.gutenberg.fileids()
 }
 return corpus

corpus = setup()

The key-value pairs in corpus are the abbreviated book-title and contents for 18 books. For example:

corpus["austen-emma.txt"]
# [Emma by Jane Austen 1816]
#
# VOLUME I
#
# CHAPTER I
#
#
# Emma Woodhouse, handsome, clever, and rich, with a comfortable home ...

A standard part of a text-processing workflow is to tokenise and tag the “parts-of-speech” (POS) in a document. We can do this using two nltk functions:

# main.py ... continued
def tokenise_and_pos_tag(doc):
 return nltk.pos_tag(nltk.word_tokenize(doc))

A function to sequentially tokenise and POS-tag the contents of a corpus of books can be written:

# main.py ... continued
def tokenise_seq(corpus):
 tokens = {
 f_id: tokenise_and_pos_tag(doc)
 for f_id, doc in corpus.items()
 }
 return tokens

You need to install or build Python in a particular way to make use of “Free-threaded” Python. In the above, we installed Python “3.14t” using uv, so we can compare the speed of free-threaded and sequential, single-core, processing.

We will use the timeit package to analyse processing speed, from the command line.

# Activate the threaded version of Python 3.14
uv python pin 3.14t

# Install the dependencies for our main.py script
uv add timeit nltk

# Time the `tokenise_seq()` function
# -- but do not time any setup code...
PYTHON_GIL=0 \
 uv run python -m timeit \
 --setup "import main; corpus = main.setup()" \
 "main.tokenise_seq(corpus)"

# [lots of output messages]
# 1 loop, best of 5: 53.1 sec per loop

After some initial steps where the nltk datasets were downloaded and the corpus object was created (neither of which were timed, because these steps were part of the timeit --setup block), tokenise_seq(corpus) was run multiple times and the fastest speed was around 53 seconds.

A small note: we have used the environment variable PYTHON_GIL=0 here. This makes it explicit that we are using free-threading (turning off the GIL). This wouldn’t normally be necessary to take advantage of free-threading (in Python “3.14t”), but was needed because one of the dependencies of nltk hasn’t been validated for the free-threaded build yet.

To write a threaded-version of the same, we introduce two functions. The first is a helper that takes (filename, document-content) pairs and returns (filename, processed-document) pairs:

def tupled_tokeniser(pair):
 file_id, doc = pair
 return file_id, tokenise_and_pos_tag(doc)

The second function creates a Thread-pool, taking advantage of as many CPUs as there are available on my machine (16, counted by multiprocessing.cpu_count()). Each document is processed as a separate thread and we wait for all of the documents to be processed before returning results to the caller:

import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor, wait
# ...
def tokenise_threaded(corpus):
 with ThreadPoolExecutor(max_workers=mp.cpu_count()) as tpe:
 try:
 futures = [
 tpe.submit(tupled_tokeniser, pair)
 for pair in corpus.items()
 ]
 wait(futures)
 finally:
 # output is a list of (file-id, data) pairs
 tokens = [f.result() for f in futures]
 return tokens

# Time the `tokenise_threaded()` function
# -- but do not time any setup code...
PYTHON_GIL=0 \
 uv run python -m timeit \
 --setup "import main; corpus = main.setup()" \
 "main.tokenise_threaded(corpus)"
# [lots of output messages]
# 1 loop, best of 5: 32.5 sec per loop

I could see that every core was used when processing the documents, using the htop tool on Ubuntu. At points during the run, each of the 16 CPUs was at near to 100% use (whereas only one or two CPUs were busy at any time during the sequential run):

But, despite using 16x as many CPUs, the multithreaded version of the processing script was only about 40% faster. There was only 18 books in the dataset and some disparity between the book lengths (the bible, containing millions of words was processed much slower than the others). Maybe the speed up would be greater with a larger or more balanced dataset.

In the post on the WZB Data Science blog, there is a multiprocessing implementation of the above. Running their multiprocessing code with 16 CPUs gave a similar speed up to multithreading (minimum time 31.2 seconds). Indeed, if I was writing this code for a real project, multiprocessing would remain my choice, because the analysis for one book can proceed independently of that for any other book and data volumes aren’t that big.

Other News

Python 3.14 has also introduced some improvements to exception-handling, a new approach to string templating and improvements to the use of concurrent interpreters. See the Python 3.14 release notes for further details.

In the wider Python Data Science ecosystem, a few other developments have occurred or are due before the end of 2025:

The first stable release of the Positron IDE was made in August;
Pandas 3.0 is due before the end of the year, and will introduce strings as a data-type, copy-on-write behaviour, and implicit access to columns in DataFrame-modification code;
Tools that ingest DataFrames are becoming agnostic to DataFrame library through the Narwahls project. See the Plotly write-up on this subject.

Python data science progresses at such a speed that we can only really scratch the surface here. Have we missed anything in the wider Python ecosystem (2025 edition) that will make a huge difference to your data work? Let us know on LinkedIn or Bluesky.

For updates and revisions to this article, see the original post

Upcoming Free Webinar: Understanding Posit - Ecosystem and Use Cases

Mon, 13 Oct 2025 23:59:00 +0000

Date: Thursday, 23rd October 2025

Time: 13:05 (UK Time)

Duration: 55 minutes

Cost: Absolutely free!

Reserve your spot now: https://jumpingrivers.typeform.com/to/UmdyNbAs

Ready to get more out of your Posit tools and understand how they can drive value across your organisation? Join us for this month’s free Jumping Rivers webinar, “Understanding Posit: Ecosystem and Use Cases.” In this live session, our experts will take you beyond the basics - exploring how Posit’s ecosystem (including Connect, Workbench, and Package Manager) supports scalable, secure, and collaborative data workflows. Whether you’re managing analytical environments, deploying Shiny apps, or looking to integrate R and Python workflows across teams, this session will show you how to make the most of your Posit investment.

What you’ll gain:

A clear understanding of how the Posit ecosystem fits into modern data infrastructure
Guidance for managing and scaling data science environments
A chance to ask questions directly to our experts

Exclusive attendee perks: Attend two or more webinars and receive a 30% discount for our AI in Production Conference (June 2026) — where data scientists, engineers, and innovators meet to share ideas, network, and explore the future of AI. 👉 Register for the conference here

For updates and revisions to this article, see the original post

Creating a Python Package with Poetry for Beginners

Thu, 09 Oct 2025 23:59:00 +0000

Intro

In this blog series (this and the next blog) I am going to demonstrate how to use Poetry to create a Python package, set up testing infrastructure and install it. I am going to be creating a wrapper around the Fantasy Premier League API and creating a function which can create a weekly league table.

Before we look at creating a package, why might we want one? There is a multitude of reasons for wrapping your code up but to me the main three are:

Code wrapped up in a package is reusable, meaning we just need to install the package to use the exported functions instead of copy-and-pasting or reimplementing the same code in your projects.
The code is very easy to share once wrapped up in a package. Just publish to a package index or share the repository privately and other people will be able to use it.
Maintenance of a package is also very easy with all the development tools available. Centralisation of bug fixes, updates, documentation, testing and more will make your life a whole lot easier.

We will come back to the value of distributing a package later in the blog series. When a publishable package is ready, it can be published in the Python Package Index (PyPI) and from here it can be installed by other users.

Set Up

The first thing you’ll need to do is come up with a name for your package (often the hardest bit) and then we will use Poetry to create the initial infrastructure. Note that other packaging and dependency management tools are available like Setuptools, Flit or Hatch. As I said though in this blog we are focusing on Poetry so once we have a name for the package (my package for this blog is called fpl-league) we can run:

poetry new fpl-league

This will create a directory called fpl-league with the structure:

fpl-league
├── poetry.lock
├── pyproject.toml
├── README.md
├── src
│   └── fpl_league
│   ├── get_league.py
│   └── __init__.py
└── tests
└── __init__.py

The purpose of these files is as follows:

pyproject.toml - A kinda config file for your package, contains information like name, version, author, license and any dependencies or build tools used.
README.md - Not python specific, just a file containing an overview of how to use / install the package and any other relevant information.
src/ - This directory will contain any of the actual source (src) code of your package.
tests/ - Contains any code or data used for testing your package.
__init__.py - This file marks the presence of python code and is used to control what gets exported from your package. You’ll notice there is one of these in tests/ and src/, the use is similar in each but in tests/ it makes code importable for testing and in src/ it makes code importable for users of the package.

Note: Testing will be covered briefly in the next blog and we also have some other blogs on the subject like ‘First Steps in Python Testing’ and ‘Advanced Testing in Python’.

Okay we’ve now got the skeleton of our package! Here is where we start fleshing things out. I know that for my package I’m going to be querying an API with the requests package. That means requests should be a dependency of my package, and that anybody who wants to use my package will also need requests installed.

To add requests as a dependency of my package we are again going to turn to Poetry and run poetry add whilst at the root level of the package:

poetry add requests

This will update our pyproject.toml to include requests as a dependency and create a new file called poetry.lock which contains all dependencies and sub-dependencies of our package with exact versions. The poetry.lock file is helpful for ensuring the code will work on any machine whilst developing the package. Here is what the toml file will look like after adding requests:

[project]
name = "fpl-league"
version = "0.1.0"
description = ""
authors = [
{name = "osheen1",email = "osheen@jumpingrivers.com"}
]
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"requests (>=2.32.5,<3.0.0)"
]
[tool.poetry]
packages = [{include = "fpl_league", from = "src"}]
[build-system]
requires = ["poetry-core>=2.0.0,<3.0.0"]
build-backend = "poetry.core.masonry.api"

The change made here is the dependencies field has been updated to include requests.

Python Environments

Python Environments could be a blog post by itself so I will only cover the background briefly and why it’s important for package development. If you want to learn more about what they are, check out this Jumping Rivers blog comparing Python Environments and Barbie is helpful and if you want to know if you should be using one, this StackOverflow question should tell you.

A Python Environment or virtual environment (venv) is similar to any kind of environment in the data science world: it is a box where you have exactly what you need installed for the specific project it’s associated with. During package development using a venv ensures the reproducibility of the development environment across a team, as all developers will be using the same package versions. Like with everything in Python there are multiple packages and ways to set up a venv like venv or pipenv, but for this blog we are sticking with Poetry.

To use a virtual environment while developing your package:

poetry install

This will ensure all package dependencies are installed.

poetry env activate

This will give you a command to activate the venv, a source call to the path of the activation file for the venv. Alternatively you can run this which will also evaluate the command returned:

eval $(poetry env activate )

Once you have activated the venv your terminal will display the name of that environment in brackets like this:

We can test the venv by ensuring the packages installed are the same as the poetry.lock file by entering a Python session and looking at package versions vs system versions. As I have only installed requests at this point:

See when I enter the venv I am using the package dependency of “requests (>=2.32.5,<3.0.0)” which is defined in the pyproject.toml and poetry.lock files rather than my system version which is “2.31.0”.

Then to exit the venv you can use:

deactivate

Adding a Function (& Intro to FPL)

Now we’ve learnt a bit about developing a Python package, the next thing to do is add the one and only function I’ll be putting in this package. The function I’m adding will be a wrapper around the Fantasy Premier League API. If you don’t already know fantasy premier league (FPL) is an online game where you and other players pick real life footballers in a team and you score points based on actions in the real life games, more information can be found on the website. There are multiple endpoints available for accessing things like player data and fixture difficulty (great summary of the API here), in fact there is an existing Python package which uses them, check that out here.

I am focusing on something that is not covered by the other packages (as far as I’m aware) and that’s the league data. There is an endpoint for accessing the league table if you know your unique league ID:

https://fantasy.premierleague.com/api/leagues-classic/league_id/standings/

However, I want a summary of the league across the season so I can see progression throughout. This data could then be used to create some season summaries. Conveniently, at the time of writing I am at the top of the league I’ve entered with my friends, so I will appear at the top of the dataset my that function will return.

To actually add my function I’ll create a file in src/fpl-league called get_league.py and in here I’ll define my function along with any packages I’ll need to run it:


import requests
import pandas as pd
import json

def get_season_league(league_id = "485842"):
 api_url = "https://fantasy.premierleague.com/api/"
 url = api_url+ "leagues-classic/" + league_id + "/standings/"
 response = requests.get(url)
 data = json.loads(response.text)
 league = pd.DataFrame(data['standings']['results'])

 df = pd.DataFrame([])
 for index, row in league.iterrows():
 player_query = api_url + "entry/" + str(row['entry']) + "/history"
 player_response = requests.get(player_query)
 player_data = json.loads(player_response.text)
 player_df = pd.DataFrame({
 'name': row['player_name'],
 'team_name': row['entry_name'],
 'event': pd.json_normalize(
 player_data['current']
 )['event'],
 'points': pd.json_normalize(
 player_data['current']
 )['total_points']
 })
 df = pd.concat([df, player_df])
 return df

Without going into too much detail on the code, I am querying the API to get the current standing of the league, then mapping over each player and grabbing their weekly scores. The final output should have 5 rows (as there has only been 5 gameweeks so far) and look like this:

name	team_name	event	points
Osheen Macoscar	What’s the Mata?	1	69
Osheen Macoscar	What’s the Mata?	2	137
Osheen Macoscar	What’s the Mata?	3	202
Osheen Macoscar	What’s the Mata?	4	284
Osheen Macoscar	What’s the Mata?	5	337

Using the Function

Now we’ve defined our function in the package, to use it we must enter the virtual environment and import our function from the module:

from fpl_league.get_league import get_season_league

We can also edit the __init__.py so we don’t need to explicitly load the function from the get_league module. So if we add the above code to the __init__.py file then all we need to do to load the function is:

from fpl_league import get_season_league

This makes it easier for users as they won’t have to remember the module name and have to type a little bit less.

Next Up

So far we’ve covered creating our package with Poetry, managing our development environment and adding a function. In the next blog post we’ll be covering the next steps with package development including documentation, testing and publishing to PyPI.

For updates and revisions to this article, see the original post

Testing with {testthat}

Thu, 25 Sep 2025 23:59:00 +0000

One of our main projects at Jumping Rivers in the last year has been building the litmus platform for validation of R packages. Among other metrics of interest, an important component when assessing the quality of code within a package is unit tests. In this blog we discuss the main features of the {testthat} package, as a convenient way for testing R code.

Testing in R

Testing is an important step when developing code in R or any other language. If you are a Python user, you can consider reading our previous blogs in pytest. Writing tests helps us make sure that the code is working as expected. In the R ecosystem, the testthat package is one of the most used frameworks. In this blog we will explore some of the main properties of {testthat} highlighting some of the most useful functions with some examples.

Before starting, although it is possible to use {testthat} outside of an R package it works best within an R package so the directory structure of the code and testing code should look like this:

./testthatExample/
├── R/
│ ├── function1.R
│ ├── function2.R
├── tests/
│ ├── testthat.R
│ └── testthat/
│ ├── test-function1.R
│ ├── test-function2.R
└── DESCRIPTION

where the main functions, in our case function1.R, function2.R are stored in R/ and the tests are stored under tests/. All tests should be contained in files that start with test. Then automatically, when we run testthat::test_local() from the root directory, or using devtools::test() the tests are recognised accordingly.

Installing and Loading testthat

First, let’s install and load the package:

# Install testthat 
install.packages("testthat")

# Load the package
library(testthat)

Basic testthat Structure

The testthat package is built around three main components:

Expectations: The building blocks that check if a result matches what you expect
Tests: Groups of expectations that test a specific function or behavior
Test files: Collections of tests, typically organised by the functions they’re testing

Let’s start with the most commonly used expectations:

Testing Equality

expect_equal() function tests for near equality, and it is good for floating point numbers, while expect_identical() tests for the exact equality.

expect_equal(2 + 2, 4)

expect_identical(c(1L, 2L, 3L), 1:3)

Testing Errors and Warnings

expect_error() checks if the code throws an error, expect_warning() checks for warnings and expect_silent() checks that code runs without errors or warnings. Although it is better practice to test for specific error and warning messages, we don’t have to. See in the code below, the first example of expect_error and expect_warning we haven’t passed a specific message to check for. This means if the code returns an error / warning respectively then the test will pass.

expect_error(log("not a number"))
expect_error(stop("Something went wrong"), "Something went wrong")

expect_warning(log(-1))
expect_warning(as.numeric(c("1", "2", "not_a_number")))

expect_silent(2 + 2)

Testing Data Types

The expect_type() and expect_[s3|s4|s7]_class() functions check if the code returns an object inherits from the expected base type or from a specified S3, S4 or s7 class.

expect_type(c(1, 2, 3), "double")
expect_type(1:3, "integer")
expect_s3_class(data.frame(x = 1:3), "data.frame")

Testing a simple function

Let us have a look at a function which is stored inside function1.R file and has the following structure:

# Function to calculate the sum of a vector
get_sum = function(x) {
 total = sum(x)
 total
}

The tests that we can write for the above function would be:

# Tests for get_sum function
test_that("get_sum calculates the sum correctly", {
 expect_equal(get_sum(x = c(1, 2, 3)), 6)
 expect_equal(get_sum(c(0, 0)), 0)
})

test_that("get_sum handles invalid inputs", {
 expect_error(get_sum(NULL), "The argument of the function must be a number")
})

Here we have created a test case with some description. We start with the test_that function call, providing both a description of the test followed by the testing block.

Testing Plots

Here we make an example of testing a ggplot2 output and a base R plot.

ggplot2 plots are easier to test because they return structured objects with accessible components:

# ggplot2 
p = ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point()

test_that("ggplot structure is correct", {
 expect_s3_class(p, "ggplot")
 expect_equal(rlang::as_name(p$mapping$x), "mpg")
})

Note: this example may change in the future, as {ggplot2} has been rewritten to use S7 classes internally so that would require expect_s7_class.

Base R plots are harder to test because they produce immediate visual output without returning testable objects. A useful package to use when we test base R plot is the {vdiffr} package and the expect_doppelganger function (which also works for ggplot objects). This allows us to perform a semblance of snapshot testing for our plot, where on the initial test run an image is saved and then compared against in future tests.

Assume the following code is used to make a plot:

library(vdiffr)

# Function that creates base R plot
create_base_scatter = function(data) {
 plot(data$mpg, data$hp,
 main = "MPG vs Horsepower",
 xlab = "Miles per Gallon",
 ylab = "Horsepower",
 col = "blue",
 pch = 16)
 abline(lm(hp ~ mpg, data = data), col = "red")
}

And the testing code for the above function would be:

test_that("base R scatter plot visual output is correct", {
 expect_doppelganger("base_scatter_plot", {
 create_base_scatter(mtcars)
 })
})

The way expect_doppelganger works is, an svg of the plot is saved in a sub-directory of the tests directory. Upon future runs of the tests a new image is generated and compared against the original, if they match the test passes but if they differ the test will fail. There are a few issues which can cause doppelganger tests to fail, like randomness in the plot or time / date based variables so keep these in mind when writing your tests.

Conclusion

The testthat package provides a robust and intuitive framework for ensuring code quality in R packages. From basic equality checks to plot validation, these testing strategies help catch bugs early and maintain reliable code as your package evolves. Whether you’re testing simple mathematical functions or complex data visualisations, incorporating comprehensive unit tests into your development workflow is essential for building trustworthy R packages. As demonstrated through the examples in this blog, testthat makes it straightforward to implement testing practices that will benefit both you and your package users in the long run. If you would like some further reading on {testthat}, then check out the website.

For updates and revisions to this article, see the original post

Boost Your Career with Jumping Rivers Free Monthly Webinars – Next Session on 18th September

Mon, 15 Sep 2025 23:59:00 +0000

Our free monthly webinar series is back, and the first session on 21 August – “Reports that Write Themselves: Automated Reporting with Quarto” was a fantastic success! It was wonderful to see the Jumping Rivers community grow, with so many data professionals joining, engaging, and sharing ideas. Next Webinar:

18 September, 13:05 BST – Building Scalable Shiny Apps with Asynchronous Programming

Full Webinar Schedule:

Date & Time (BST)	Topic
18 September	Building Scalable Shiny Apps with Asynchronous Programming
23 October	Understanding Posit: Ecosystem and Enterprise Use Cases
20 November	Machine Learning with Python
11 December	Accessible Shiny: Designing for All Users

Note: All webinars take place on the second last Thursday of each month at 13:05 UK time.

Why Attend:

Gain practical, hands-on skills in R, Python, Shiny, and Posit.
Connect with fellow data professionals and expand your network.
Exclusive discounts: Gain 30% off the Shiny in Production Conference (8–9 October 2025) and 30% off any of our public online courses.

Don’t miss out - register now at this link and join us for the next session on 18 September!

For updates and revisions to this article, see the original post

Beyond the AKS Basics: Practical Tips for Your Kubernetes Journey

Thu, 11 Sep 2025 23:59:00 +0000

Beyond the AKS Basics: Practical Tips for Your Kubernetes Journey

I recently completed Microsoft’s Kubernetes on Azure course (here is an archived version) and while it provided a solid foundation, I wanted to share some practical insights and debugging techniques that weren’t covered. This post dives into real-world scenarios with Azure Kubernetes Service (AKS), offering tips for debugging containers and nodes, tackling tricky issues like Posit Workbench session failures, and leveraging tools like Packer. Plus, we’ll show you an example of how we debugged an initially perplexing and frustrating development issue using useful commands.

Following this course, I deployed a large Azure-hosted Posit Workbench deployment. This involved VMs coordinating Kubernetes jobs for user development environments (like RStudio), behind a reverse proxy.

Azure Free Trial: A Great Starting Point

First things first, if you’re new to Azure, don’t forget to take advantage of the various free trials offered. Sometimes one and twelve month free trials are offered. It’s an excellent way to get hands-on experience with AKS. See here.

Level Up Your Debugging Skills

Peeking Inside Containers

Ever had a Kubernetes session refuse to start and wondered what’s going on under the hood? A super useful command is:

kubectl run -it --image <your_image> <your_container_name> -- /bin/bash

This lets you spin up a temporary container based on your image and get a shell inside. For example, when troubleshooting why some Kubernetes jobs for Posit Workbench weren’t starting, this command came in handy:

kubectl run -it --image ${AZURE_CONTAINER_REPOSITORY_NAME}.azurecr.io/${IMAGE_NAME} testme -- /bin/bash

In the above example, we’re using custom-built container images which we pushed to Azure Container Registry, which are based on ones published by Posit and use Packer to build in extra customizations. Keep reading on for more on Packer!

Diving into Nodes

Surprisingly, even with a managed service like AKS, you can debug the underlying nodes!

This proved invaluable when I needed to check the software version in use for implementing an NFS share. It lets you use a shell on the node itself:

kubectl debug node/<your_node_name> -it --image=ubuntu

Learn more about this powerful technique in the Kubernetes documentation.

Logs

Kubernetes logs are essential, but don’t forget the logs of other components in your system outside of Kubernetes also. Many “always-on” Linux-based applications rely on systemctl and journalctl. This allows you to view logs filtered by service unit (your application), time range, and specific keywords.

sudo journalctl -u $SERVICE_UNIT_NAME --since "$TIME_RANGE" -g "$SEARCH_TERM"

For example, when a certain Posit Workbench session (corresponding to a Kubernetes job) was having issues earlier that day, I could quickly find relevant events on the application’s virtual machine using this Linux command:

sudo journalctl -u rstudio-launcher --since today -g $SESSION_ID

This can often provide valuable context that complements your Kubernetes logs.

The Unexpected Culprit: Looking Beyond Kubernetes

Here’s a crucial lesson I learned the hard way. Sometimes, issues aren’t within your Kubernetes cluster at all. We had a setup with a reverse proxy sitting in front of applications on virtual machines with a Kubernetes backend. We anticipated users might experience some initial delay when launching jobs due to system resources. However, we were caught off guard when users started reporting 504 Gateway Timeout errors after exactly two minutes.

Our initial instinct was to deep-dive into the Kubernetes configurations. But after some head-scratching, our client pointed out the consistent two-minute interval. This was the key! It forced us to broaden our investigation to all components in the request path, even those outside the Kubernetes cluster.

Our troubleshooting process involved meticulously listing every component from the Kubernetes node all the way to the user’s browser. We then started checking timeout settings on each. Guess what? The reverse proxy (more specifically an Azure Application Gateway), sitting innocently in front of our VMs and the rest of our system had a default two-minute connection timeout. If allocating a job to a node took longer than that, the proxy would prematurely close the connection, resulting in the dreaded 504 error.

This experience underscored the importance of considering the entire system architecture when debugging. Don’t just focus on Kubernetes – think about load balancers, proxies, firewalls, and any other piece of infrastructure that might be interacting with your cluster. We were lucky the problematic component was one of the first we checked!

Automating Image Creation with Packer

Packer is a fantastic tool for building identical machine images for multiple platforms. These can – for example – then be pushed to Azure Container Registry for use in Azure Kubernetes Service, or used on VMs in Azure.

The real power comes from the ability to then run Ansible playbooks on top of a base image. This allows us to automate the installation of software and configuration, leveraging existing Ansible roles we have developed in-house which weren’t necessarily developed for Kubernetes sessions.

Summary

Kubernetes success goes beyond cluster configs. From debugging containers and nodes to tracing issues through proxies, real-world AKS work demands a full-system view. With the right tools and mindset, you’ll turn tricky problems into valuable lessons.

For updates and revisions to this article, see the original post

Who We Are and What We Do: Inside Jumping Rivers

Tue, 09 Sep 2025 23:59:00 +0000

At Jumping Rivers, we combine engineering, automation, and analytics to streamline your data workflows and make them more efficient. We take care of the tasks you don’t have the time or capacity for, improve processes you might not even know could be optimised, and work alongside your team to make your data and engineering operations easier and more effective. From pioneering startups to established organisations, we help our clients harness the power of data to work smarter and faster.

Who We Are

We’re a team of passionate problem-solvers, technologists, and analytics experts. Jumping Rivers isn’t your typical consultancy; we turn technical expertise and hands-on experience into measurable impact for our clients.

Our Expertise

Our team spans multiple disciplines, making us uniquely positioned to tackle any data challenge:

Data Science: AI, predictive modelling, machine learning, automation, and advanced analytics that uncover actionable insights.
Engineering: Robust, scalable pipelines, infrastructure, and automation to ensure your data is accurate, accessible, and efficient.
Cloud & Infrastructure: Expertise spans leading cloud platforms including AWS, Azure, Kubernetes, and Databricks. We design and manage secure, scalable, and high-performance cloud solutions that support complex pipelines, automated workflows, collaborative analytics, and AI/ML deployment. This ensures your data is reliable, accessible, and ready to drive actionable insights, both now and as your organisation grows.
Dashboards & Shiny: Development, maintenance, and support of Shiny applications and interactive dashboards, delivering insights in an accessible and actionable format.
Training: Bespoke sessions in R, Python, SQL, and more, empowering your team with the skills to thrive in a data-driven world.

Our Partners

We collaborate with industry leaders like Posit and Databricks, giving us and our clients access to cutting-edge tools, platforms, and innovations. These partnerships help us stay ahead of the curve and ensure our work is as forward-thinking as it is practical.

How We Handle Enquiries

We do things differently. When a client reaches out, our process is designed to ensure their specific needs are met:

Prompt Email Response: We acknowledge every enquiry quickly and professionally.
Discovery Meeting: We discuss the client’s challenges, goals, and context to fully understand the problem.
Custom Proposal: We craft a tailored solution that fits the client’s objectives, budget, and timeline and not a one-size-fits-all approach.

This personalised approach is why clients keep coming back. We don’t just deliver projects, we deliver results that make a real difference for teams and organisations.

Shiny in Production Conference

We’re proud to bring the data community together with our annual Shiny in Production conference. Our next conference takes place 8th–9th October in Newcastle, featuring inspiring talks, hands-on workshops, and unrivalled networking opportunities with leaders from large organisations.

Why We’re Different

Jumping Rivers is more than a consultancy. We combine deep technical expertise with a personalised, client-focused approach. We handle the complex, time-consuming, and high-skill tasks that make your team’s work easier and more effective. When you work with us, you’re not just hiring a service, you’re gaining a partner committed to making your data and engineering operations smarter, faster, and more impactful.

Our People: Experience, Growth, and Teamwork

At Jumping Rivers, we believe that great work starts with a great environment. Our team thrives in a supportive, welcoming culture where curiosity is encouraged, collaboration is the norm, and everyone’s voice is valued. We place a strong emphasis on team building and shared learning, creating opportunities for colleagues to grow their skills, share ideas, and tackle challenges together. It’s a place where innovation, positivity, and mutual support drive both personal and professional growth.

For updates and revisions to this article, see the original post

Time Series Forecasting in Python

Thu, 28 Aug 2025 23:59:00 +0000

In this post we will be introducing the concept of time series forecasting, with a focus on the ARIMA framework and how this can be implemented in Python. We will be using a publicly available data set and the following open source packages:

Time series

In time series analysis we are interested in sequential data made up of a series of observations taken at regular intervals. Examples include:

Weekly hospital occupancy
Monthly sales figures
Annual global temperature

In many cases we want to use the observations up to the present day to predict (or forecast) the next N time points. For example, a hospital could reduce running costs if an appropriate number of beds are provisioned.

This is where time series modelling fits in. The most basic time series model is a simple linear regression, where we assume that the time series evolves linearly over time. For non-linear time series we can consider piecewise linear regression.

What about more complex cases where we want to accurately capture subtle variations in the data? We will now demonstrate the ARIMA framework in Python using a real world data set.

ARIMA

ARIMA stands for “Auto-Regressive Integrated Moving Average” and is made up of three key parts:

Auto-regression: captures the relationship between an observation and the last k points (often referred to as “lagged” observations).
Integration: accounts for “non-stationary” trends by taking the difference between consecutive observations (a non-stationary trend could include an overall upward trend where the mean observation is increasing over time).
Moving average: accounts for the relationship between an observation and the residual error that would result from using a moving average model applied to the lagged observations.

The three components (AR, I, MA) are controlled by the parameters (p, d, q). Setting one of these to zero will eliminate that component of the model. For example, if the time series already appears to be stationary we could set d = 0 so that we do not perform differencing.

To demonstrate ARIMA on a real-world example, let’s load in the flights data set from the seaborn library:

import seaborn as sns

flights = sns.load_dataset("flights")
flights.head()

## year month passengers
## 0 1949 Jan 112
## 1 1949 Feb 118
## 2 1949 Mar 132
## 3 1949 Apr 129
## 4 1949 May 121

Let’s visualise the data:

import matplotlib.pyplot as plt

plt.plot(flights["passengers"])
plt.xlabel("month")
plt.ylabel("passengers")

The data includes the number of passengers that flew each month over a period of 12 years. We will start by fitting a model on the full data set, then try holding out some test data for forecasting.

Data inspection

We should begin by exploring the time series. There are a number of questions that could be asked. For example:

Is the trend non-stationary?
Does the plot feature a seasonal variation?

Just from looking at the plot above the answer to both of these questions is a clear “yes”! But what if the data was more noisy and it was not clear from a quick visual inspection? In that case we could try decomposing the time series into the following components:

Trend
Seasonal
Residual (i.e. after we have subtracted the trend and seasonal components)

Fortunately the statsmodels library has a seasonal_decompose() function for this exact purpose:

from statsmodels.tsa.seasonal import seasonal_decompose

y = flights["passengers"] # convenience variable
decomposition = seasonal_decompose(y, period=12)

For convenience we have assigned the passengers column of the original DataFrame to a variable called y. Because we expect a seasonal variation in the data we have chosen a period of 12 months. Let’s inspect the decomposition:

decomposition.plot()

The top panel shows the original raw time series.
In the second panel we see that there is indeed an increasing trend. We will therefore need to include some differencing in the model (d > 0).
The third panel shows the repeating seasonal component.
The fourth panel shows that there is still a non-random residual after the trend and seasonal component have been subtracted. This can result from the fact that the seasonal “peaks” in the original plot appear to grow in amplitude over time (i.e. it is not really a fixed seasonal pattern).

It is also important to study the autocorrelation function (ACF) and partial autocorrelation function (PACF). Again, the statsmodels library has everything we need:

The ACF plot shows how correlated observations are with other observations that are k time points away (we call this “lag-k”):

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

plot_acf(y)
plt.xlabel("$k$")

The PACF plot shows the “direct” correlation between observations at lag-k after removing the linear dependence of intermediate lags:

plot_pacf(y)
plt.xlabel("$k$")

Both plots start with a lag of 0, where the correlation is always 1. The ACF and PACF then typically drop down to close to zero. The point at which this happens can help to inform the values for our p and q parameters:

The value of k at which the ACF reduces to statistically insignificant values is regarded as a good choice for the q parameter. From the plot we see the ACF drops close to the confidence region at approximately k = 10.
The value of k at which the PACF appears to drop close to 0 is a sensible choice for the p parameter. Here the value k = 2 appears reasonable.

These are really just educated guesses. In practice it would be worth experimenting with the ranges 1 < = p < = 3 and 5 < = q < = 15 (we’ll not worry about this here).

The d parameter controls the amount of differencing:

d = 1 means we take the difference between every observation and the previous observation.
d = 2 means we difference the differenced time series again.
… and so on.

The process should continue until the non-stationary trend is regarded as statistically insignificant. This can be done by eye, but a better way is to use the Augmented Dickey-Fuller (ADF) test:

from statsmodels.tsa.stattools import adfuller

adfuller(y)

## (0.8153688792060528, 0.9918802434376411, 13, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 996.6929308390189)

The first two values returned give us the test statistic and p-value for the null hypothesis, respectively. We also get the critical value cutoffs at the 1%, 5% and 10% levels. Without going into too much detail, the general rule is that if the test statistic is greater than the 5% cutoff then the null hypothesis is accepted, meaning that the trend is non-stationary.

In our case we should consider differencing the data and trying again. Let’s use the diff() function from statsmodels:

Taking a single difference results in a test statistic that is comparable to the 5% cutoff:

from statsmodels.tsa.statespace.tools import diff

adfuller(diff(y))

## (-2.829266824169999, 0.05421329028382552, 12, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 988.5069317854084)

Taking a second difference results in a test statistic that is much lower than even the 1% cutoff:

from statsmodels.tsa.statespace.tools import diff

adfuller(diff(y, k_diff=2))

## (-16.38423154246854, 2.7328918500140445e-29, 11, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 988.6020417275605)

We see that a second order difference (d = 2) produces a stationary trend according to the ADF test. It is important to avoid excessively high choices of d since this can introduce artefacts in the final model. So in practice it would be worth experimenting with both d = 1 and d = 2.

What about the seasonal trend in the data? This suggests that we should really be using the SARIMA framework (where the S stands for “seasonal”). That would involve twice the number of parameters, so let’s proceed with our simplified model and see how we get on.

Model fitting

Having analysed the time series we have arrived at a reasonable choice of our parameters: (p, d, q) = (2, 2, 10). As stated above, in practice we should really test a range of values but for now we will not worry.

The statsmodels library provides an ARIMA object for model fitting and forecasting. Let’s call it with our parameter choices:

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(y, order=(2, 2, 10))
model_fit = model.fit()

We can inspect the model using:

model_fit.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## SARIMAX Results
## ==============================================================================
## Dep. Variable: passengers No. Observations: 144
## Model: ARIMA(2, 2, 10) Log Likelihood -673.434
## Date: Wed, 27 Aug 2025 AIC 1372.867
## Time: 14:21:20 BIC 1411.293
## Sample: 0 HQIC 1388.482
## - 144
## Covariance Type: opg
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## ar.L1 0.0376 0.027 1.386 0.166 -0.016 0.091
## ar.L2 -0.9770 0.026 -37.090 0.000 -1.029 -0.925
## ma.L1 -0.4439 153.608 -0.003 0.998 -301.510 300.622
## ma.L2 0.9971 142.058 0.007 0.994 -277.432 279.426
## ma.L3 -0.4440 90.869 -0.005 0.996 -178.543 177.655
## ma.L4 0.2001 163.089 0.001 0.999 -319.449 319.849
## ma.L5 -0.2122 163.925 -0.001 0.999 -321.499 321.075
## ma.L6 0.2396 165.114 0.001 0.999 -323.377 323.857
## ma.L7 -0.0025 79.453 -3.19e-05 1.000 -155.728 155.723
## ma.L8 -0.6393 156.926 -0.004 0.997 -308.208 306.930
## ma.L9 0.1965 136.829 0.001 0.999 -267.984 268.377
## ma.L10 -0.8908 0.125 -7.104 0.000 -1.137 -0.645
## sigma2 660.0238 1.030 640.771 0.000 658.005 662.043
## ===================================================================================
## Ljung-Box (L1) (Q): 0.42 Jarque-Bera (JB): 10.61
## Prob(Q): 0.52 Prob(JB): 0.00
## Heteroskedasticity (H): 6.53 Skew: 0.08
## Prob(H) (two-sided): 0.00 Kurtosis: 4.33
## ===================================================================================
##
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## [2] Covariance matrix is singular or near-singular, with condition number 1.78e+22. Standard errors may be unstable.
## """

The summary of the fit provides the log likelihood, AIC and BIC metrics. If you’re testing different choices of the (p, d, q) parameters it’s worth comparing the AIC and BIC metrics (lower values suggest a better fit).

The model summary also includes a couple of warnings, in this case concerning the covariance matrix. We will not worry about these messages for now, and inspect model residuals and forecasting ability as a way of assessing the quality of the fit.

Using pandas we can inspect the residuals:

import pandas as pd

residuals = pd.DataFrame(model_fit.resid)
residuals.describe()

## 0
## count 144.000000
## mean 0.489740
## std 28.441220
## min -91.170547
## 25% -14.729881
## 50% -0.702994
## 75% 16.190037
## max 112.000000

The residuals appear to be distributed close to zero.

residuals.plot(kind="kde", legend=False)
plt.xlabel("residuals")
plt.ylabel("density")

We can also plot the residuals over time to inspect the outliers.

residuals.plot(legend=False)
plt.hlines(0, 0, 144, color="black")
plt.xlabel("month")
plt.ylabel("residuals")

For an initial model this appears reasonable.

Forecasting

Now that we have a model we can try forecasting future time points. There are a number of possible use cases:

We may only be interested in forecasting the next month. We can simulate this with our data set by using a “rolling forecast” where the model is retrained on all of the data up to the current time point before predicting the next time point.
The model could also be used for quarterly or yearly forecasting, where we predict multiple future time points at once.

Let’s go with approach 1 first. We will start by splitting the time series into an initial training set and a hold-out test set:

y_values = list(y.values)
train, test = y_values[:96], y_values[96:132]

Since time series models are typically used to forecast into the future, a common practice for testing is to remove the end of the time series from the training set and hold it out for testing. Here we have set aside 3 years of data for testing.

We will now simulate 3 years worth of monthly forecasting, where every month we retrain the model with the latest data and produce a forecast for the next month. Forecasts are produced using the .forecast() method, which predicts the next time point by default.

predictions = []
current_params = None
for i in range(len(test)):
 model = ARIMA(train, order=(2, 2, 10))
 model_fit = model.fit(start_params=current_params)
 current_params = model_fit.params # update the parameters
 output = model_fit.forecast()
 predictions.append(output[0]) # store the prediction
 train.append(test[i]) # update the training set

Depending on the model complexity this can take a few minutes to run (the above code chunk took 1-2 minutes). To save some optimisation time, at every time step we have used the best-fit parameters produced for the previous model as the starting parameters for the next model (using the start_params argument).

We now have a list of predictions to compare against our test observations. Let’s plot these together and compute the root-mean-squared error (RMSE):

from sklearn.metrics import mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(test, predictions))
print(f"RMSE: {rmse}")

## RMSE: 30.875187739583858


plt.plot(test, color="blue", label="observed")
plt.plot(predictions, color="red", label="ARIMA")
plt.xlabel("month")
plt.ylabel("passengers")
plt.legend()

The agreement looks reasonable.

Alternatively, we may want to predict all 12 months in the next year. Let’s use the final 12 months of data (which were left out of the above analysis):

test = y_values[132:]

We now retrain the model on the first 11 years worth of data and this time use it to forecast the next 12 months:

model = ARIMA(train, order=(2, 2, 10))
model_fit = model.fit()

output = model_fit.forecast(steps=12) # predict 12 time points
predictions = output[:12] # store the predictions

Let’s compare the predictions with the test observations:

rmse = np.sqrt(mean_squared_error(test, predictions))
print(f"RMSE: {rmse}")

## RMSE: 73.86518520797583


plt.plot(test, color="blue", label="observed")
plt.plot(predictions, color="red", label="ARIMA")
plt.xlabel("month")
plt.ylabel("passengers")
plt.legend()

The agreement is poor here. As noted earlier, ARIMA does not account for seasonal variation and we can see here the model is not able to reproduce the peak at month 6. It would therefore be worth repeating this analysis with the SARIMA method, which is also implemented in statsmodels.

Summary

In summary, we have introduced the ARIMA framework for time series forecasting using a real world example in Python. Along the way, we have learned about popular data visualisations for time series data and explored the time series analysis functions provided by the statsmodels package. Check out the statsmodels documentation for more examples.

It’s worth mentioning that, while ARIMA is a powerful method for time series forecasting, there are a number of other popular frameworks for different use cases:

SARIMA: expands on ARIMA by including a seasonal variation.
Prophet: an alternative time series framework that can capture yearly, weekly and daily seasonality.
DeepAR: an efficient deep learning algorithm designed to fit multiple time series with a single global model. This can outperform ARIMA in scenarios where hundreds of time series have to be modelled.

We may revisit these models in a later post. In the meantime, check out our recent blog series on MLOps, including model versioning, deployment and monitoring using the Vetiver framework.

For updates and revisions to this article, see the original post

Stem Separation - How AI Has Found It's Way Into Music Production

Thu, 14 Aug 2025 23:59:00 +0000

For quite some time, AI had kept it’s grubby little hands out of the music production world. Now, a good percentage of the plugins (a plugin is a piece of software you can “plug in” to an audio track to add effects or generate audio) I see are advertised as “using AI”. From reverb removers (yes, that’s right, you can now remove the reverb from an audio recording), to EQ analysers. Today we’ll focus on stem separation.

What is a stem?

I’m approaching this blog as more of an introduction to stem separation. There might be a follow-up with more technical details later on, but plenty of articles already cover the details in depth.

Before we can separate stems, we need to know what a stem is.

Every song that you or I listen two will likely contain multiple instruments/elements. A classic band line up might have a drummer, bassist, guitarist and singer. Orchestras can have up to 60 musicians! Nowadays, the vast majority of songs are produced in a Digital Audio Workstation (DAW) in which the number of tracks you can have is really only limited by the power of your computer.

A stem is an audio file from one of the above set ups that represents groups of audio tracks that have been recorded for a song. There could be a vocal stem, containing all lead and background vocals combined into one audio track or a drums stem with the kick, snare, hi-hats, etc mixed together.

What is stem separation?

Take a piece of cake. What if I wanted to return it into it’s constituent parts of egg, flour, sugar etc? Well, I can’t. With stem separation, we can take an audio file containing several stems, and separating it up into several audio files - one for each stem. Phew, I can get my eggs back!

Why is this useful?

Stem separation is useful because it unlocks creative, educational, and professional possibilities from a mixed audio track - even when the original session files are unavailable.

There are some legitimate legal uses of stem separation. The best one that comes to mind is the last ever Beatles song, Now And Then. AI was used to extract John Lennon’s vocals from an old demo, and then, Paul McCartney / Ringo Star turned it into the last ever Beatles record.

On the other hand, stem separation gives almost anyone with an internet connection the ability to access the stems of virtually any song - offering music producers a treasure trove of isolated vocals (and lawsuits).

What’s behind the magic?

Machine learning models. Think of it this way - every instrument makes a sound that usually has a fairly identifiable pattern on a spectrogram. The main body of a hi-hat lies around 10-15khz whilst the energy of a bass guitar lies anywhere between 50 - 200hz. Sure, two different hi-hats will have difference waveforms and frequencies but the general pattern is the same.

Frequency graph of a hi-hat.

Frequency graph of a bass guitar.

These models are trained to understand frequency data of songs where the stems are available. Once we know that, we can apply filters to pick and choose which frequencies we want to keep from the original song.

Of course, it’s a bit more complicated than that. For more technical details you can head to this article which focuses on the model behind music.ai’s stem separation (music.ai claim to have the best model).

How accurate is it?

Like any models, to measure it’s accuracy you have to have training data where you have the original stems.

Once the stems are separated, accuracy evaluation is done using SDR - Signal-to-Distortion Ratio. This is basically a measure of how much distortion / artefacts have been introduced during the separation process compared to the original stem. 100% is perfect, 0% is nope!

Anyway, I’ll leave the SDR calculation til the next blog. To test it’s accuracy, why don’t we actually split some stems?

An example

Let’s take this 8 bar loop consisting of drums, bass, guitar and vocal samples that I put together (all royalty free, of course).

I’m using the inbuilt stem splitter from Logic Pro, the native DAW to Mac OS. Generally considered to be lacking compared to other tools such as music.ai, or lalal.ai. But it’s good for an example!

It takes maybe 4 seconds to run an audio clip of this size through the stem splitter, and this is the result

You can clearly hear the distortion and artefacts that have been introduced into each clip. We’re still at the stage where stem separation algorithms struggle with music that has lots of hard transients (i.e. drums) or lots of components that share the same frequency range. It’s easy to hear the audio ducking in the vocals, bass and guitar when the drums are hitting and it has struggled quite badly on the guitar.

For updates and revisions to this article, see the original post

How We Do Training at Jumping Rivers: Seamless, Expert-Led, and Tailored to You

Tue, 12 Aug 2025 23:59:00 +0000

When it comes to data science training, one size doesn’t fit all. At Jumping Rivers, we’ve built our reputation around delivering customised, expert-led training that actually fits your team’s goals, tools, and workflows - whether you’re in healthcare, government, finance, or beyond.

From your first enquiry to post-course follow-up, our training process is fully managed by our experienced admin team. They act as project managers, coordinating every detail to ensure your training runs smoothly.

Checkout our training page for more information or to see our course catalogue and upcoming open courses.

Start with a Free Training Audit

Not sure what your team needs? That’s what our free training audit is for. We’ll assess your current skill levels, challenges, and goals then design a course (or training pathway) that hits the mark. No guesswork, just clarity.

Moving beyond legacy tools?

We’ve helped multiple organisations transition from legacy tools like SPSS, SAS, or proprietary R setups into streamlined workflows using R, Python, and SQL. Whether you’re modernising your analytical toolchain or just starting the journey, we can support smooth and confident transitions.

We also assist with setting up consistent, reproducible documents and reports using Quarto across your team, ensuring your outputs look great and follow best practices in reproducible research.

What Makes Jumping Rivers Training Different?

Tailored Content: Every course is designed around your data, your workflows, and your team’s skill level.
Expert Trainers: Our trainers are experienced data scientists, not just instructors.
Full Admin Support: You get project-managed coordination from enquiry to delivery.
Flexible Delivery: Online, onsite, or hybrid.
Post-Course Follow-Up: We don’t disappear after the session ends. We offer optional office hours and ongoing support.

The Jumping Rivers Training Journey

Step	What Happens	Who’s Involved
1. Enquiry	You contact us via email, website, or framework	You
2. Intro Call	We assess your team’s needs	You & JR Trainer & Admin
3. Proposal	Customised schedule + pricing sent for approval	JR
4. Coordination	We handle all the logistics	JR Admin Team
5. Delivery	Hands-on, engaging session	JR Trainer
6. Follow-Up	Feedback, future planning, optional support	You & JR

What Clients Say

“The Shiny course was excellent. Our team now feels confident building dashboards that actually get used.” — Head of Data, Financial Services Firm

“They handled all the admin and scheduling—it was a completely hassle-free experience.” — L&D Manager, Higher Education

“We wanted training that wasn’t just theoretical. JR helped us apply best practices in real-world settings.” — Data Manager, NHS Trust

Let’s Talk Training

Whether you need a one-off session, a full training pathway, or you’re just not sure where to start, we’re here to help.

📩 Book your free training audit today, reach out to us at training@jumpingrivers.com. We’d love to hear from you.

For updates and revisions to this article, see the original post

Shiny in Production 2025: Sponsors

Tue, 05 Aug 2025 23:59:00 +0000

Shiny in Production Conference wouldn’t be possible without our sponsors, so we wanted to take the time to tell you a little bit about them.

Don’t miss out on this great chance to learn from R experts and network with fellow data science enthusiasts! Tickets are available at Shiny in Production website!

Posit

Posit (formerly known as RStudio) is a software company that builds both open‑source tools and professional solutions enabling teams to create, manage, and share reproducible data work in R, Python, and beyond.

ThinkR

ThinkR is a consultancy company offering development and training on R, RStudio and Shiny.

Datacove

Datacove are a Brighton based Data and analytics consultancy, specialising in Customer Analytics (unearthing your most valuable customers), Marketing Analytics (getting more out of your marketing budget) and Process Automation.

Newcastle University Solve

Newcastle University Solve (NU Solve) has been helping businesses, public sector organisations and industries to find answers to complex challenges for more than three decades. They emerged out of the Industrial Statistics Research Unit, which had successfully engaged with enterprises since 1984.

CRC Press

CRC Press is a scientific publisher that specializes in science, technology, engineering, mathematics, and medicine. They publish books and digital resources for researchers, academics, professionals, and students. CRC Press is part of the Taylor & Francis Group.

R Consortium

The central mission of the R Consortium is to work with and provide support to the R Foundation and to the key organizations developing, maintaining, distributing and using R software through the identification, development and implementation of infrastructure projects.

For updates and revisions to this article, see the original post

Animated Maps with {ggplot2} and {gganimate}

Thu, 31 Jul 2025 23:59:00 +0000

In this blog post, we are going to use data from the {gapminder} R package, along with global spatial boundaries from ‘opendatasoft’. We are going to plot the life expectancy of each country in the Americas and animate it to see the changes from 1957 to 2007.

The {gapminder} package we are using is from the Gapminder foundation, an independent educational non-proﬁt ﬁghting global misconceptions. The cover issues like global warming, plastic in the oceans and life satisfaction.

First we will load the full dataset from the gapminder package, and see what is contained within it.

data("gapminder_unfiltered", package = "gapminder")
names(gapminder_unfiltered)

## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"

Then we will filter the dataset to keep life expectancy data for the years from 1952 to 2007 (in 5-year steps).

A shapefile (*.shp) containing the geographical boundaries of each country can be imported using the {sf} R package.

library(sf)
library(dplyr)
if (getwd() == "/home/osheen/corporate-website"){
 world = st_read("content/blog/2025-animated-map/data/world-administrative-boundaries.shp") |>
 select(-"continent")
} else {
 world = st_read("data/world-administrative-boundaries.shp") |>
 select(-"continent")

}

## Reading layer `world-administrative-boundaries' from data source
## `/home/osheen/corporate-website/content/blog/2025-animated-map/data/world-administrative-boundaries.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 256 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -58.49861 xmax: 180 ymax: 83.6236
## Geodetic CRS: WGS 84

head(world)

## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -58.43861 ymin: -34.94382 xmax: 148.8519 ymax: 51.09111
## Geodetic CRS: WGS 84
## iso3 status color_code name
## 1 MNP US Territory USA Northern Mariana Islands
## 2 <NA> Sovereignty unsettled RUS Kuril Islands
## 3 FRA Member State FRA France
## 4 SRB Member State SRB Serbia
## 5 URY Member State URY Uruguay
## 6 GUM US Non-Self-Governing Territory GUM Guam
## region iso_3166_1_ french_shor
## 1 Micronesia MP Northern Mariana Islands
## 2 Eastern Asia <NA> Kuril Islands
## 3 Western Europe FR France
## 4 Southern Europe RS Serbie
## 5 South America UY Uruguay
## 6 Micronesia GU Guam
## geometry
## 1 MULTIPOLYGON (((145.6333 14...
## 2 MULTIPOLYGON (((146.6827 43...
## 3 MULTIPOLYGON (((9.4475 42.6...
## 4 MULTIPOLYGON (((20.26102 46...
## 5 MULTIPOLYGON (((-53.3743 -3...
## 6 MULTIPOLYGON (((144.7094 13...

One of the nice things about the {sf} package is that it stores geographical data in a specialised data-frame structure which allows us to merge our boundary data with the gapminder statistics using the same functions that we would use to combine more typical data-frames. Here we join the two datasets, matching the entries by country name, using the dplyr left_join function.

joined = left_join(gapminder_unfiltered,
 world,
 by = c("country" = "name")) |>
 st_as_sf()
head(joined)

## Simple feature collection with 6 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 60.50417 ymin: 29.40611 xmax: 74.91574 ymax: 38.47198
## Geodetic CRS: WGS 84
## # A tibble: 6 × 13
## country continent year lifeExp pop gdpPercap iso3 status color_code
## <chr> <fct> <int> <dbl> <int> <dbl> <chr> <chr> <chr>
## 1 Afghanistan Asia 1952 28.8 8425333 779. AFG Membe… AFG
## 2 Afghanistan Asia 1957 30.3 9240934 821. AFG Membe… AFG
## 3 Afghanistan Asia 1962 32.0 10267083 853. AFG Membe… AFG
## 4 Afghanistan Asia 1967 34.0 11537966 836. AFG Membe… AFG
## 5 Afghanistan Asia 1972 36.1 13079460 740. AFG Membe… AFG
## 6 Afghanistan Asia 1977 38.4 14880372 786. AFG Membe… AFG
## # ℹ 4 more variables: region <chr>, iso_3166_1_ <chr>, french_shor <chr>,
## # geometry <MULTIPOLYGON [°]>

I am going to select the country column and plot that using the base R plot function for a quick visualisation.

joined |>
 select("country") |>
 plot()

Hmmmmmmm that doesn’t look quite right does it?

The issue here is a common one when grabbing a spatial boundaries file from the internet. The data sets being joined have different names for some of the countries. For example, in the world data we have USA as ‘United States’ where as in gapminder it’s ‘United States of America’. The dplyr::anti_join function can be helpful finding countries that don’t match. I will use fct_recode from {forcats} to align the world country names with gapminder. In the example below, I am just fixing the USA but you can see from the plot above that several other countries need to be recoded (19 in total), I am doing this behind the scenes to avoid clogging up the page.

library(forcats)
world = world |>
 mutate(name = fct_recode(.data$name,
 "United States" =
 "United States of America"))

Okay, lets see what this looks like now.

joined |>
 select("country") |>
 plot()

That’s better! Now I’ve got the data I want to plot, I can use ggplot2 to start creating the visualisation that I will be animating. Before that, I will filter the data to keep only the Americas, then use geom_sf to plot the geometry data.

library(ggplot2)

americas = joined |>
 filter(continent == "Americas")

americas_plot = ggplot(americas) +
 geom_sf()

This plot looks good but I’m going to change the coordinate reference system (CRS) to one (“EPSG:8858”) that is designed for the Americas. I found this CRS on epsg.io, a website I would recommend if you are looking for some different CRS’s. st_transform can be used to change the CRS to EPSG:8858. This is what it looks like now:

americas = st_transform(americas, "EPSG:8858")

new_crs_plot = ggplot(americas) +
 geom_sf()

Okay so now the plot looks right we will start preparing it to be animated.

library(ggplot2)

plot = americas %>%
 filter(year == 2007) %>%
 ggplot() +
 geom_sf(aes(fill = lifeExp)) +
 labs(title = "Year: 2007",
 fill = "Life Expectancy") +
 theme_void() +
 ggplot2::scale_fill_viridis_b() +
 theme(legend.position = c("inside"),
 legend.position.inside = c(0.23, 0.23),
 plot.title = element_text(size = 15,
 hjust = 0.5),
 panel.border = element_rect(color = "black",
 fill = NA))

This is the plot we are going to animate now so we’ll use {gganimate}. The transition_states function partitions the data using a states column (here our ‘year’ column), iteratively creating a frame of the animation for each year value in the input data. The next function is animate which will convert these frames into a GIF. Note, make sure you have the dependencies installed or you may end up with 100 PNG files in your working directory rather than a GIF!

library(gganimate)

animation = plot +
 ggtitle("Year: {closest_state}") +
 transition_states(states = year)

animate(animation,
 renderer = gifski_renderer("img/map.gif"),
 alt = "Animation with missing values.")

The keener eyed of you will notice some countries don’t have a value for every year.

americas |>
 st_drop_geometry() |>
 count(country) |>
 arrange(n)

## # A tibble: 36 × 2
## country n
## <chr> <int>
## 1 French Guiana 1
## 2 Guadeloupe 1
## 3 Martinique 1
## 4 Aruba 8
## 5 Grenada 8
## 6 Netherlands Antilles 8
## 7 Suriname 8
## 8 Bahamas 10
## 9 Barbados 10
## 10 Belize 10
## # ℹ 26 more rows

So 25 countries have 12 observations (the max), four have 10 and 8 respectively and three have 1. To fill in these blanks, I’m going to use {tidyr} to compute some mock values using the dataset mean for each year. The countries with one would continue with one value from from 2002.

library(tidyr)

completed = americas |>
 mutate(country = forcats::fct_drop(country)) |>
 complete(year, country) |>
 select(country, lifeExp, year) |>
 group_by(year) |>
 mutate(lifeExp =
 replace_na(lifeExp,
 replace = mean(lifeExp,
 na.rm = TRUE)))

geoms = americas |>
 select(country) |>
 distinct()

plot = left_join(completed,
 geoms,
 by = "country") |>
 st_as_sf() |>
 st_transform("EPSG:8858") |>
 ggplot() +
 geom_sf(aes(fill = lifeExp)) +
 labs(title = "Year: {closest_state}",
 fill = "Life Expectancy") +
 theme_void() +
 ggplot2::scale_fill_viridis_b() +
 theme(legend.position = c("inside"),
 legend.position.inside = c(0.23, 0.23),
 plot.title = element_text(size = 15,
 hjust = 0.5),
 panel.border = element_rect(color = "black",
 fill = NA))

animation = plot +
 transition_states(states = year)

animate(animation,
 renderer = gifski_renderer("img/map2.gif"))

So that is our final animated map, of course we could add more styling or complexity - maybe in a future blog. If you want to learn more about working the topic, check out our Spatial Data Analysis with R course or another Jumping Rivers blog, Thinking About Maps and Ice Cream by Nicola Rennie.

For updates and revisions to this article, see the original post

Shiny in Production 2025: R Dev Day

Thu, 24 Jul 2025 23:59:00 +0000

Do you use R? Would you like to play a part in sustaining it? Find out about the R Dev Day that is returning as a satellite event to Shiny in Production 2025. This post will answer questions you may have, such as: “Do I need to be an R guru to participate?”, “What will I be expected to do?”, and “Is there a cost to attend?”. Hopefully by the end, you’ll be motivated to sign up!

What is an R Dev Day?

An R Dev Day is a hands-on collaborative event, where people work in small groups on contributions to base R or to infrastructure that supports such contributions from the community.

What do you mean by base R?

Base R is the colloquial term for everything that comes in the source distribution of R. From a user’s point of view, the main components are the R manuals and 14 packages, including base, datasets, graphics, and stats. This codebase is maintained by the R Core Team with contributions from the wider community.

What do you mean by infrastructure?

In this context, we’re using infrastructure to refer to any documentation or tooling that facilitates or encourages contribution. Some examples are the R Development Guide Quarto book, the R Dev Container containerised development environment, and the Translations Dashboard.

Do I need to be an R guru to participate?

Come as you are! We aim to prepare a range of tasks suitable for people with different skills, so you can find something that matches your knowledge and experience. When you register, you can select the areas that you’re interested in contributing to and let us know if you have particular skills to offer.

Don’t forget, you’ll be working in a small group, so you can benefit from each other’s expertise, and there will be experienced developers on hand to help out!

What will I be expected to do?

Tasks will be prepared in advance on the r-dev-day GitHub repo. You can check out some of the closed issues from past R Dev Days. Typical tasks include:

Contributing to fixing bugs in base R
- Creating a reproducible example (reprex), e.g., Bug 17148: rasterImage shows incorrect image orientation.
- Debugging an issue to find the root cause, e.g., Bug 17616 - Anomaly with contrast functions.
- Proposing a patch to fix an issue, e.g., making a minor change that has already been suggested, making larger changes to an R function after some analysis, or updating the C code underlying an R function to fix a bug.
Adding a new section in the R Dev Guide, e.g., Document how to make a feature request.
Translating messages, errors and warnings via the Weblate online interface.

We aim to prepare tasks where you can make good progress and report back at the end of the event. If you can continue to contribute on an ad-hoc basis after the event, e.g., responding to a review of your contribution or taking on a new task, that is very much appreciated, but we understand if you can’t.

When and where is the R Dev Day?

R Dev Day @ SIP 2025 will take place on the Tuesday afternoon and Wednesday morning, before the Shiny in Production 2025 tutorials. It will be in the same building as the main conference.

Is there a cost to attend?

No! The event is free to attend and open to people who are not attending Shiny in Production 2025. However, R Dev Day @ SIP 2025 participants receive 20% off Shiny in Production 2025 registration and early bird registration for the conference is open till Saturday 9 August, so we encourage you to register for both events while there is space left!

For updates and revisions to this article, see the original post

Importing Data with Python

Thu, 17 Jul 2025 23:59:00 +0000

Importing data is a key step in the data science workflow. It also has a huge responsibility. How you import (or connect to) a dataset has consequences for how you work with that data throughout a project, because a Pandas DataFrame (say) requires Pandas-specific code. Similarly, your data constrains your code - if it can fit in memory on a single computer, the constraints are different than if your data is so large that its storage must be distributed. Data-import is a key place where a data-project can go wrong. If you import a dataset without validating that it contains sensible values, a nasty surprise may await you….

Python has wonderful libraries for data manipulation and analysis. You can readily work with data sources of a variety of types. For example:

that are too big to hold in memory;
that are distributed across a network;
that update rapidly;
or that don’t easily conform to a tabular, relational form.

The Python data stack is sufficiently mature that there are multiple libraries for all of these settings. There are some moves to introduce a standardised syntax for some data-frame functionality across libraries. At Jumping Rivers, we have a number of Python training courses, and teach how to use Pandas, PySpark, Numpy and how to work with databases via SQL from Python.

Importing into memory

Firstly we will compare two in-memory data-frame packages: Pandas and Polars. We will work in a virtual environment (so that the packages installed are handled independently of the system-wide Python; see our Barbie-themed blog post on virtual environments).

# [Linux terminal commands]
# Create & activate a new virtual environment
python -m venv .venv
source .venv/bin/activate
# Install pandas and polars into the environment
pip install pandas polars

We will generate a simple dataset to import with the two packages. PyPI download numbers for a specific Python package can be obtained using the pypistats package. After installing it, we will pull out the number of downloads for the package pytest - see our recent blog posts for an introduction to this testing library.

# [Linux terminal commands]
# Install pypistats
pip install pypistats
# Obtain download-statistics for `pytest` in tab-separated format
pypistats python_minor -f tsv pytest > data/pytest-downloads.tsv

The structure of that file is straight-forward. It records both the number, and the percentage, of pytest downloads across each minor version of Python (“3.8”, “3.9” and so on) for the last 180 days (a default time-span).

# [Linux terminal commands]
head data/pytest-downloads.tsv

## "category" "percent" "downloads"
## "3.11" "23.40%" 282,343,944
## "3.9" "18.78%" 226,548,604
## "3.10" "18.74%" 226,155,405
## "3.12" "15.48%" 186,819,921
## "null" "8.41%" 101,489,156
## "3.8" "6.13%" 73,965,471
## "3.13" "4.91%" 59,253,846
## "3.7" "3.36%" 40,551,618
## "3.6" "0.54%" 6,546,017

So it should be trivial to import it into Python using either Pandas or Polars.

files = {
 "pytest_data": "data/pytest-downloads.tsv"
}

import pandas as pd
downloads_pd = pd.read_csv(files["pytest_data"], sep="\t")
downloads_pd.head()

## category percent downloads
## 0 3.11 23.40% 282,343,944
## 1 3.9 18.78% 226,548,604
## 2 3.10 18.74% 226,155,405
## 3 3.12 15.48% 186,819,921
## 4 NaN 8.41% 101,489,156

import polars as pl
downloads_pl = pl.read_csv(files["pytest_data"], separator="\t")
downloads_pl.head()

## shape: (5, 3)
## ┌──────────┬─────────┬─────────────┐
## │ category ┆ percent ┆ downloads │
## │ --- ┆ --- ┆ --- │
## │ str ┆ str ┆ str │
## ╞══════════╪═════════╪═════════════╡
## │ 3.11 ┆ 23.40% ┆ 282,343,944 │
## │ 3.9 ┆ 18.78% ┆ 226,548,604 │
## │ 3.10 ┆ 18.74% ┆ 226,155,405 │
## │ 3.12 ┆ 15.48% ┆ 186,819,921 │
## │ null ┆ 8.41% ┆ 101,489,156 │
## └──────────┴─────────┴─────────────┘

We can see that Pytest was downloaded ~ 280 million times on Python 3.11, and this accounted for about 23% of downloads.

The syntax for importing the dataset using the two libraries is almost identical. The only difference for the read_csv() function is that you use sep=... in Pandas and separator=... in Polars.

Polars is more memory efficient than Pandas in a lot of settings. One of the memory-efficiencies that Polars allows, is filtering by rows and columns during import. To take advantage of this, you can use the scan_csv() function in Polars. This creates a lazy data-frame that can be directly manipulated by Polars DataFrame methods. These method calls are applied at the point when the data is loaded, rather than on an in-memory data-frame. The Polars website has far more details about the lazy API, and the benefits it can bring to your work.

Here, we will be working with eagerly-loaded, in-memory, data-frames - a good choice during exploratory work.

Validating a dataset

Have we loaded what we wanted?

The Pandas/Polars .dtypes attribute gives you information about the data-types in each column of a data-frame:

print(downloads_pd.dtypes)

## category object
## percent object
## downloads object
## dtype: object

print(downloads_pl.dtypes)

## [String, String, String]

Again, the formatting of the output looks a little different between the two packages, but the results are broadly similar: all of our data—the version-strings in ‘category’, the percentage values and download counts—have been read as character strings. Pandas tell us they are ’object’s, but we know what that typically means they are strings:

type(
 downloads_pd["category"][0]
)

## <class 'str'>

So Python has imported our data incorrectly. In a real project we would want to know about this, so we might validate the data in our datasets. We would also want to prevent data-import mistakes as far as possible (see later) by being more explicit about how each column is imported and converted.

Python has a range of packages for validating both the schema for, and the values in, a dataset.

For a general class, if you want to check the data-types that are stored in the fields, you could use Pydantic:

# [Terminal]
pip install pydantic

from pydantic import BaseModel, NonNegativeInt

class Person(BaseModel):
 name: str
 age: NonNegativeInt

Correct data-types cause no fuss:

Person(name="Russ", age=47)

## Person(name='Russ', age=47)

But incorrect data (here a negative age) throw errors:

Person(name="Buddy", age=-1)

## pydantic_core._pydantic_core.ValidationError: 1 validation error for Person
## age
## Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-1, input_type=int]
## For further information visit https://errors.pydantic.dev/2.11/v/greater_than_equal

There are extensions of Pydantic that work with data-frames. For example, Pandera can work with Pandas, Polars and several other data-frame libraries. You would need to install a different Pandera extension depending on the data-frame library you are working with (“pandera[pandas]” for Pandas, “pandera[polars]” for Polars etc).

# Terminal
pip install "pandera[pandas]"

import pandera.pandas as pa

schema = pa.DataFrameSchema({
 "category": pa.Column(str, nullable=True),
 "percent": pa.Column(float, checks=pa.Check.in_range(0, 100)),
 "downloads": pa.Column(int)
})

schema.validate(downloads_pd)

## pandera.errors.SchemaError: non-nullable series 'percent' contains null values:
## 17 NaN
## 18 NaN
## Name: percent, dtype: object

Validation of the data has identified an issue with the dataset. There is a missing value in the “percent” column towards the end of the dataset. There are other issues - the two numeric columns are currently strings - but lets check out the issue that Pandera has identified first.

This is the end of the dataset:

downloads_pd.tail()

## category percent downloads
## 14 3.3 0.00% 2,259
## 15 2.6 0.00% 87
## 16 3.2 0.00% 68
## 17 Total NaN 1,206,596,056
## 18 Date range: 2024-12-31 - 2025-07-07 NaN NaN

There is some metadata included in the final couple of lines of the dataset that should be ignored at import. Let’s just ignore the final couple of lines at import (this isn’t a very robust solution, but is fine for now).

downloads_pd = pd.read_csv(files["pytest_data"], sep="\t", nrows = 17)
downloads_pd.tail()

## category percent downloads
## 12 3.40 0.00% 12,777
## 13 3.15 0.00% 2,882
## 14 3.30 0.00% 2,259
## 15 2.60 0.00% 87
## 16 3.20 0.00% 68

Now if we validate the Pandas data-frame, another issue has been identified.

schema.validate(downloads_pd)

## pandera.errors.SchemaError: expected series 'category' to have type str:
## failure cases:
## index failure_case
## 0 0 3.11
## 1 1 3.90
## 2 2 3.10
## 3 3 3.12
## 4 5 3.80
## 5 6 3.13
## 6 7 3.70
## 7 8 3.60
## 8 9 2.70
## 9 10 3.14
## 10 11 3.50
## 11 12 3.40
## 12 13 3.15
## 13 14 3.30
## 14 15 2.60
## 15 16 3.20

This is a more substantial issue - the Python version-strings (3.2, 3.3 and so on) have been converted into floating-point numbers during import. So now Python version “3.2” is Python 3.20 in the data-frame.

Rich Iannone has written a useful blog post comparing various data-validation libraries for Polars at the “Posit-dev” blog. The tools mentioned in his post can check more substantial matters than the missing-data, data-type and data-range issues that we mentioned above. In particular, the tool “pointblank” can create data-validation summary reports that can be used to report back to data-collection teams or to analysts. Python already had data-reporting tools like “great expectations” and “test-driven data analysis”.

Importing with data-type constraints

I knew the structure of the tab-separated file before attempting to load it with Pandas and Polars. That is, I knew that the percents looked like “12.34%” and that the download counts looked like “123,456,789” (with commas separating the thousands and millions). Neither package can automatically convert these number formats into the format that they require without a bit of help. Even if we explained what the post-import data-type should be for each column, the two libraries wouldn’t be able to parse the input data directly.

Both Polars and Pandas allow you to provide a pre-defined schema when importing data. For Pandas, you provide a dtype dictionary which specifies what the output column data-type should be. For Polars, you provide a schema argument, where the data-types are specified in a Polars-specific format (because the data is stored in Rust, the Python int and str data-types don’t work for Polars).

If we import our data using a schema, Pandas and Polars will complain:

downloads_pd = pd.read_csv(
 files["pytest_data"],
 sep="\t",
 nrows = 17,
 dtype={"category": str, "percent": float, "downloads": int}
)

## ValueError: could not convert string to float: '23.40%'

downloads_pl = pl.read_csv(
 files["pytest_data"],
 separator="\t",
 n_rows=17,
 schema={"category": pl.Utf8, "percent": pl.Float64, "downloads": pl.Int64}
)

## polars.exceptions.ComputeError: could not parse `"23.40%"` as dtype `f64` at column 'percent' (column number 2)
##
## The current offset in the file is 7 bytes.
##
## You might want to try:
## - increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
## - specifying correct dtype with the `schema_overrides` argument
## - setting `ignore_errors` to `True`,
## - adding `"23.40%"` to the `null_values` list.
##
## Original error: ```remaining bytes non-empty```

The errors arise because we need to also specify how to convert our data into the expected data-types when it isn’t obvious. This is done using ‘converters’ in Pandas:

downloads_pd = pd.read_csv(
 files["pytest_data"],
 sep="\t",
 nrows = 17,
 dtype={"category": str},
 converters = {
 # 12.34% -> 12.34
 "percent": lambda x: float(x.strip("%")),
 # 123,456,789 -> 123456789
 "downloads": lambda x: int(x.replace(",", ""))
 }
)

downloads_pd.head()

## category percent downloads
## 0 3.11 23.40 282343944
## 1 3.9 18.78 226548604
## 2 3.10 18.74 226155405
## 3 3.12 15.48 186819921
## 4 NaN 8.41 101489156

In Polars, the with_columns() method allows conversion to the expected data-types.

downloads_pl = pl.read_csv(
 files["pytest_data"],
 separator="\t",
 n_rows=17
).with_columns(
 # 12.34% -> 12.34
 pl.col("percent").str.replace("%", "").cast(pl.Float64),
 # 123,456,789 -> 123456789
 pl.col("downloads").str.replace_all(",", "").cast(pl.Int64)
)

downloads_pl.head()

## shape: (5, 3)
## ┌──────────┬─────────┬───────────┐
## │ category ┆ percent ┆ downloads │
## │ --- ┆ --- ┆ --- │
## │ str ┆ f64 ┆ i64 │
## ╞══════════╪═════════╪═══════════╡
## │ 3.11 ┆ 23.4 ┆ 282343944 │
## │ 3.9 ┆ 18.78 ┆ 226548604 │
## │ 3.10 ┆ 18.74 ┆ 226155405 │
## │ 3.12 ┆ 15.48 ┆ 186819921 │
## │ null ┆ 8.41 ┆ 101489156 │
## └──────────┴─────────┴───────────┘

Now if we validate our datasets against the Pandera schema, we should have a little more success:

schema.validate(downloads_pd)

## category percent downloads
## 0 3.11 23.40 282343944
## 1 3.9 18.78 226548604
## 2 3.10 18.74 226155405
## 3 3.12 15.48 186819921
## 4 NaN 8.41 101489156
## 5 3.8 6.13 73965471
## 6 3.13 4.91 59253846
## 7 3.7 3.36 40551618
## 8 3.6 0.54 6546017
## 9 2.7 0.20 2371860
## 10 3.14 0.02 300816
## 11 3.5 0.02 231325
## 12 3.4 0.00 12777
## 13 3.15 0.00 2882
## 14 3.3 0.00 2259
## 15 2.6 0.00 87
## 16 3.2 0.00 68

Nice.

Summary

In this post we have looked at importing data with both Pandas and Polars. We have seen that the import functions are similar (read_csv(...)) but that there are some subtle differences with how you specify some things (the column separator argument, the data-schema and so on). Explicit conversion of the imported data is performed differently between the two packages as well.

Once your data has been imported, you should check that it contains what it is supposed to: are the columns you need all present, do they store the correct data-types, do the values within those columns sit within the expected range. Data validation libraries, like Pointblank and Panderas are really useful for checking data-schema and data-values before you do anything critical with your dataset.

For updates and revisions to this article, see the original post

R Package Quality: Maintainer Criteria

Tue, 15 Jul 2025 23:59:00 +0000

This is final part of a five part series of related posts on validating R packages. Other posts in the series are:

At last we come to the final post! Over the previous four posts, we considered all aspects of how we validate package. As we’ve constantly repeated, most individual scores aren’t that important. Instead, it’s the cumulative effect that’s important; it gives us a hint of where to spend our energy.

This final post, considers the package’s maintenance aspects, including update frequency and bug management. The general idea is that around this component is to understand if bugs are addressed in a clear, quick and transparent method. Some of the scores are subjective, for example scoring the bug closure rate. However, as this is combined with multiple scores, tinkering with any particular score has limited effect.

Score 1: Bug Closure Rate

A score based on the median bug closure rate. If longer than 12 months, give a score of 0; between 6 and 12 months, a score of 0.2; between 4 and 6 months, a score of 0.5; between 2 and 4 months, a score of 0.8; and if shorter than two months, give 1.

An analysis of CRAN suggests 70% of packages have a bug closure rate less than two months.

Score 2: Maintainer

Binary score of whether a package has at least one maintainer. All packages on CRAN must have a maintainer.

Score 3: Source Control

A binary score of whether the package has an associated version-controlled repository. This isn’t just GitHub! But includes r-forge, GitLab, and various other flavours of source control out there.

Score 4: Bug Reports URL

A binary score of whether a package links to a location where it is possible to file bug reports. If possible, we try to infer this URL. For example, if the website is a GitHub repo, then it’s almost certain to have an issues page.

Score 5: Bugs Status

The proportion of bug reports that are closed. If no issues have ever been opened, a value of 1 is returned.

Score 6: The Number of Contributors

A score based on the number of contributors to the package. Returns 0 if a single contributor, 0.5 if two contributors, 1 if 3 or more contributors are found. Around 60% of CRAN packages have at least two contributors. Only 30% of CRAN packages have more than two contributors.

Score 7: Maintainers other Packages

Score based on how many packages its maintainers have created on CRAN. A score of 1 indicates 3 or more CRAN packages, 0.5 two packages, and 0 for 1 or fewer packages. Around 60% have two packages on CRAN, and 40% have three or more packages.

Examples

Package	No’ of Contributors	Bug Status	Closure Rate
`{drat}`	1.00	0.75	0.80
`{microbenchmark}`	1.00	0.78	0.00
`{shinyjs}`	0.00	0.78	0.00
`{tibble}`	1.00	0.68	0.00
`{tsibble}`	1.00	0.81	1.00

For clarity, scores where all packages are 1, have been omitted from the table.

All packages have GitHub pages and are authored by experienced R developers. {shinyjs} scores 0 for the number of contributors, as there is only a single contributor. In the context of Shiny Application validation, a sole author is something to be aware of.

The (surprising?) bug closure rate is 0 for {tibble}, {shinyjs}, and {microbenchmark}. Looking at the GitHub Issues for {tibble} there does seem to be a lot of long term issues/features. Interestingly, we’ve found that many of the popular packages have a low score closure rate. This is usually, that issues are also tracking some future features. Again, individual scores aren’t the important issues. It’s the overall story!

For updates and revisions to this article, see the original post

R Package Quality: Code Quality

Thu, 10 Jul 2025 23:59:00 +0000

This is part four of a five part series of related posts on validating R packages. Other posts in the series are:

In this post, we’ll take a closer look at code quality and how we can use automated tools to quickly get a feel for a package. The obvious package check is R CMD check. Anyone who has created a package, is familiar with constantly running R CMD check to ensure that their package is note, warning and error free. However, that’s not the only tool we can draw on. Codebase size, security vulnerabilities and the number of exported functions all give a hint to the package quality.

When validating R packages, code quality contributes around 50% to the total. Remember to check out our dashboard to get an overview.

Score 1: Passing R CMD check

The bedrock of all good R packages! Packages are downloaded, installed and the standard R CMD check is performed. The score is the weighted sum of errors (1) and warnings (0.25), with a maximum score of 1 (no errors or warnings) and a minimum score of 0. Essentially, the metric will allow up to 1 error or 4 warnings before returning the lowest score of 0.

We are working on being more discerning on notes and warnings, but just now, it’s a relatively simple metric that highlights packages with potential issues.

Score 2: Codebase Size

This score is based on the R codebase size, as determined by the number of lines of R code. The general idea is that larger codebases are harder to maintain. Of course, the obvious question is “what is a large R base”?

Instead of coming up with arbitrary numbers, we analysed all packages on CRAN (2025/03). If a package is in the lower quartile for codebase size, the package is scored 1. Otherwise, the empirical CDF is used.

For those who are interested, the largest R package on CRAN had 100,000+ lines of R code!

Score 3: Security Vulnerabilities

If a package has a known security vulnerability, it receives a score of 0. This uses the {oysteR} package to detect issues.

Score 4: Release

This is a binary score, if the package under assessment is the latest version, it’s scored 1. Otherwise, a 0 is returned. We did investigate using a more sophisticated scoring system based on minor and major releases. But within the R community, semantic versioning isn’t consistently followed, so we opted for a simpler rule.

Score 5: Exported Namespace Size

Score a package based on the number of exported objects. Fewer exported objects mean the risk surface is lower, and bugs are potentially less likely. Similar to codebase size, the question is what is large? Analysing all packages on CRAN, gave us suitable cut-offs. If a package is in the lower quartile for the number of exports, the package is scored 1. Otherwise, the empirical CDF is used.

Our analysis of CRAN suggests that most packages export relatively few objects. A modest package exporting 11 objects scores 0.5. Exporting around 26 objects reduces this to around 0.25.

Score 6: Unit Test Coverage

Score based on the fraction of lines of code which are covered by a unit test. For validation of packages in the Pharmaceutical sector we also provide additional unit tests (remediated code coverage) and investigate the Exported function test coverage.

Score 7: Dependencies

Score based on the number of dependencies a package has, assuming a lower score for more packages. ‘Suggests’, ‘Enhances’, base or recommended packages are not considered as dependencies when calculating this score.

This is a data driven score, based on all packages in CRAN (2025/03). If a package is in the lower quartile for the number of package dependencies, the package is scored 1. Otherwise, the empirical CDF is used. In practice, this means that packages with around 5 dependencies are scored 0.5, which decreases to 0 around twenty dependencies.

Dependencies can be an emotive topic! As with all other scores, this metric isn’t the “be all and end all”, instead it’s just an indication of package fragility.

Examples

For simplicity, we’ve removed the columns on vulnerabilities, R CMD check and release, as for all packages, the score was 1.

Package	Dependencies	Exported Namespace	Test Coverage	Codebase Size
`{drat}`	1.00	0.56	0.75	0.73
`{microbenchmark}`	1.00	1.00	0.56	0.84
`{shinyjs}`	0.82	0.13	0.03	0.66
`{tibble}`	0.36	0.12	0.82	0.17
`{tsibble}`	0.20	0.04	0.87	0.11

The scores above indicate that {tibble} and {tsibble} are relatively large, complex packages. These packages export many functions, and have multiple dependencies. Reassuringly, they have a high test coverage.

The {shinyjs} package has a worryingly low test coverage. However, inspection of the code shows that there are many manual tests that aren’t captured. This highlights a key aspect, automated aren’t enough, especially in the validated setting. Part of litmus is to having a qualified person assess the package.

For updates and revisions to this article, see the original post

Elevate Your Skills and Boost Your Career with Jumping Rivers Free Monthly Webinars

Tue, 08 Jul 2025 23:59:00 +0000

Are you ready to expand your knowledge in R, Python, Shiny, and Posit while becoming a more valuable asset to your team? Jumping Rivers is here to help you do just that with our free monthly webinar series designed for data professionals at all levels. These 55-minute sessions are easy to join online and packed with practical insights to help you sharpen your skills, tackle real-world challenges, and stay ahead in the fast-evolving data landscape. Whether you’re looking to improve your coding, learn best practices for deploying apps, or dive into machine learning, there’s something here for you.

Webinar Schedule

Date & Time (BST)	Topic
13:05, 21 August	Reports that Write Themselves: Automated Reporting with Quarto
13:05, 18 September	Building Scalable Shiny Apps with Asynchronous Programming
13:05, 23 October	Understanding Posit: Ecosystem and Enterprise Use Cases
13:05, 20 November	Machine Learning with Python
13:05, 11 December	Accessible Shiny: Designing for All Users

Note: All webinars take place on the second last Thursday of each month at 13:05 UK time.

Meet Our Speakers and Partners

We’re proud to host a diverse range of expert speakers and partners who bring unique perspectives and deep expertise to each session. Keep an eye out for monthly speaker breakdowns by following us on our social media platforms under the name Jumping Rivers - your go-to source for data insights.

Benefits of Attending

Gain hands-on exposure to the latest tools and best practices in R, Python, Posit, and Shiny.
Grow your professional network by connecting with fellow data scientists, engineers, and experts.
Boost your career prospects with practical skills and industry insights that make you stand out.
Flexible, free learning — join from anywhere with no cost or commitment.
Exclusive discounts for attendees:
- Attend 2 sessions and get a 30% discount for the upcoming Shiny in Production in-person conference on the 8th-9th of October 2025 - an event where you can meet and learn from leading experts in the data science and data engineering communities, network with larger companies, and enjoy great food and engaging talks.
- Attend more than 2 sessions and receive a 20% discount on any of our public online training courses — known for their high quality and practical focus.

Ready to Join?

We look forward to welcoming you to our webinars and supporting your data science journey!

For updates and revisions to this article, see the original post

R Package Quality: Documentation

Thu, 03 Jul 2025 23:59:00 +0000

This is part three of a five part series of related posts on validating R packages. Other posts in the series are:

In this post, we’ll take a closer look at package documentation and how it helps assess the “risky-ness” of a package The documentation score evaluates how complete and helpful a package’s documentation is. Package documentation comes in many guises. It could be a function examples, vignettes or even a website. While we don’t believe every package must have a website, vignettes, and examples. But the absence of all three usually points to weak documentation.

When validating R packages, documentation contributes around 15% to the total.

Score 1: Exported Objects Documentation

A score based on the proportion of exported objects that have documentation. For example, if we have ten functions, but only eight are documented, then the score would be 0.8. For all packages on CRAN, this is almost certainly 1, but for packages that are only available on GitHub, this may be less.

Score 2: Proportion of Help Pages with Examples

A score based on the proportion of help pages that have examples.

Score 3: NEWS file

A NEWS file is an indication of a development and release cycle. It helps users understand what has changed between versions. This detects the presence of a NEWS file. Of course, R packages make this interesting with NEWS, NEWS.md, inst/NEWS.md and/or Changelogs!

Score 4: Vignettes

Around 40% of CRAN packages have a single vignette, with only 10% having more than one vignette - we checked! For simplicity, this score is a simple binary metric, based on whether a package has any vignettes.

Score 5: Package Website

Does a package have an associated website? Ten or fifteen years ago, package websites were rare. Today, GitHub and GitLab make it easy for a package to host a website.

Score 6: NEWS updated to the Current version

The package’s NEWS file is outdated or missing, making it challenging to track recent changes, bug fixes, or updates. This lack of transparency may pose risks, as users are unable to verify whether critical updates have been implemented.

Summary

We can all agree that a package doesn’t need all of the components described above. It’s perfectly reasonable to have few examples, but very detailed vignettes. The important point is to investigate packages that have little documentation.

Examples

Using the packages from the previous blog post, and omitting scores where all packages scored 1, we have the following results

Package	News Current	Vignettes	Examples
drat	1.00	1.00	0.43
microbenchmark	0.00	0.00	0.20
shinyjs	1.00	1.00	0.90
tibble	1.00	1.00	0.61
tsibble	0.00	1.00	0.82

All packages use source control, have a package website and provide documentation. The {microbenchmark} doesn’t have NEWS/Changelog. Similarly it’s missing vignettes. But recall it still has a high overall package score. The idea behind litmus, isn’t that a package must be perfect, but to take a pragmatic approach to scoring.

Oddly, the {tsibble} package does have a NEWS file, but it doesn’t mention the latest version, but I think this was an oversight.

For updates and revisions to this article, see the original post

Building Trust with Code: Validating Shiny Apps in Regulated Environments

Mon, 30 Jun 2025 23:59:00 +0000

This blog post is a follow up to my 2025 R/Medicine talk on Validating Shiny Apps in Regulated Environments.

Over the last years Shiny has become a cornerstone in data science applications, from dashboards and review tools to interactive decision making apps. But in regulated environments like pharma, healthcare, or finance, the stakes are higher. A clever visualization isn’t enough. We need to prove the app works reliably, reproducibly, and transparently.

So, what does it actually mean to validate a Shiny app?

Why Validation Matters

Validation isn’t about ticking a box. It’s about building trust.

In regulated settings, apps influence real world decisions. Regulators expect traceability, reproducibility, and documentation. Without these, you’re not just at risk of bugs, you risk noncompliance. And that means delays, rework, or worse.

Think of validation as a safety net. It ensures the app behaves as expected, be it under edge cases, months down the line, or even when someone else deploys it.

We once helped a client whose Shiny app was blocked from deployment by their compliance team because there was no documentation of who had last changed a calculation. Adding logging and a simple GitHub workflow solved it overnight.

Validation doesn’t have to be complex. It just has to be intentional.

What Makes a Shiny App Validatable?

Not every Shiny app is born equal. But some design choices from the start can make validation easier down the line:

Modular, testable code: Keep logic in functions, not tangled in server.R.
Clear separation: UI, logic, and data should live in separate spaces.
Version control: For both code and data.
Reproducible environments: Ensure the development environment can be replicated.
Minimal hidden state:Avoid global variables or side effects.

These practices aren’t just about validation, they also make your codebase more maintainable and collaborative.

Common Pitfalls (and How to Avoid Them)

As someone that has seen a lot of Shiny applications over the years, some common patterns come up again and again, especially when validating legacy apps.

Hardcoded file paths that break in production
Ad hoc data wrangling inside server functions
Global variables causing unpredictable behavior
No formal record of package dependencies
No tests. No logs. No idea who changed what or why

Sound familiar? You’re not alone. These are solvable problems, often with small changes that pay off in the long run.

The Unique Challenge of Shiny

Shiny is interactive by nature, which makes it harder to validate than static scripts. Here’s what makes it tricky and what to do about it:

Reactive chains hide logic. Break them down and add logging.
User controlled outputs might produce unexpected results. Validate downloadable content and limit inputs.
Deployment differences matter. Validate the version that’s actually in production.
No audit trail by default. Packages like {logger}, {loggit}, or custom logging can give you a starting point.

In Shiny apps, testing isn’t just about code, it’s about behavior. Think about what the user sees, clicks, and downloads. All of that needs to be validated.

Software Engineering for Validation

Good engineering habits go a long way:

Use {testthat} for logic
Combine with {shinytest2} for UI workflows
Use {lintr} and CI/CD pipelines to catch issues early
Set up a code review process
Automate documentation and testing reports

With that in mind, an example of a minimal validation stack could look something like:

{testthat} for unit testing
{shinytest2} for end to end checks
{renv} or Docker for environments
{logger} for audit trails
GitHub Actions (or similar) for automation

Easier to implement when you build it in from the start.

Documentation: The Backbone of Validation

Documentation doesn’t have to be bureaucratic. It just has to be clear.

A great way to get started would be:

Functional Requirements Spec (FRS): What the app should do
Test Plan & Summary (TP/TSR): How you know it does it
README/User Guide: For both users and reviewers
Audit trail: Who changed what, when, and why
Reproducibility artifacts: renv.lock, Dockerfiles, Git commits

Matching Effort to Risk

Not every app needs the same level of scrutiny. That’s where a risk based approach comes in. (Risk Appetite)

Low risk: sandbox tools, exploratory dashboards → lighter touch
High risk: decision support, outputs used in reports or submissions → full validation

Start by defining the app’s intended use, data sensitivity, and audience. It helps you make smart trade offs.

“But it’s just an internal tool!”

Internal tools often evolve into production tools. Validation future proofs them.

“It slows us down!”

Done right, validation saves time. It catches bugs early and reduces friction with compliance teams.

Tools for Risk & Security

Beyond testing and documentation, assessing package level risk and security is essential, especially when your app depends on external libraries.

There are some tools out there that can help with this, including:

riskmetric: Evaluate risk across R packages using metrics like maintenance, documentation, and testing.
oysteR: Scan R packages for known security vulnerabilities via CVEs.
diffify – Compare changes between versions of R packages to identify what’s changed and what might break.
Litmus.dashboard – Explore package-level risk scores interactively and track changes over time.

How we deal with Shiny Validation in Jumping Rivers

At Jumping Rivers, we’ve been validating R packages for quite some time now, and have in the meanwhile developed the Litmusverse, a toolkit designed to make R package validation easier, more transparent, and aligned with regulatory expectations.

But how is that related to Shiny Validation? While a Shiny app doesn’t have to be a package, treating it as one simplifies validation a lot. It lets us apply the same best practices used for standard R packages: version control, documentation, testing, and reproducible environments. From there, we just add application specific validation steps.

Validate the Shiny application package dependencies using the Litmusverse workflow, using a scoring strategy that suits the application risk appetite.
Validate the application code itself using a separate scoring strategy more focused on code quality, documentation and not on popularity or CRAN metrics as we would use for dependencies (Litmus allows for scoring strategies to be tweaked at will or even include custom metrics if needed).
Generate a report with the validation results from both the dependencies validation and the application validation.

Final Thoughts: Start Validated, Stay Validated

The best time to think about validation is at the start of your project. The second best time is right now.

Build with validation in mind.
Document as you go.
Automate wherever possible.
Choose tools that support transparency and traceability.

Validation isn’t a one time hurdle. It’s a habit you build with each commit, each test, each documented decision.

Validation isn’t a blocker, it’s a confidence booster. For you, your team, and your reviewers.

Get in Touch

If you’re interested in learning more about R validation and how it can be used to unleash the power of open source in your organisation, contact us.

For updates and revisions to this article, see the original post

R Package Quality: Package Popularity

Thu, 26 Jun 2025 23:59:00 +0000

This is part two of a five part series of related posts on validating R packages. Other posts in the series are:

In our previous post, we introduced the four components that make up a litmus package score: documentation, popularity, code quality, and maintenance. In this post, we’ll look at package popularity. Package popularity is an interesting, and sometimes controversial, measure. In our experience it often sparks strong (and usually negative) reactions. The idea is simple: if a package is widely used, bugs are more likely to be found and fixed, and if the maintainer steps away, there’s a higher chance someone else will take over. Of course, high usage doesn’t mean a package is risk-free. But popularity can provide helpful context. Consider this example:

{pkgA}: Extremely popular and a dependency for many other packages.
{pkgB}: Very few downloads and minimal usage.

In a situation like this, {pkgA} may offer more stability over time, simply because more people rely on it. It does not mean that {pkgA} is risk free, only that the risk is lower than {pkgB}.

All other things being equal, if you had sixty minutes to assess both packages, would you spend thirty minutes on each, or weight your time to the “least popular” package?

It’s important to keep in mind that statistical packages tend to be less popular than “foundational” ones. Packages for tasks like data wrangling, date-times, and plotting are used by nearly everyone, regardless of the use case. In contrast, more specialised packages, for example, those designed to handle experimental designs with drop-outs, naturally have a smaller audience.

So a lower popularity doesn’t necessarily reflect lower quality or usefulness. It may just reflect a more niche purpose.

Score 1: Yearly Downloads

For packages on CRAN, we can obtain download statistics. Of course, the obvious question is, “what is a large number of downloads?” To answer this question, we obtained the download statistics of every package on CRAN, and used that data as the basis of our score.

More precisely, if a package is in the upper quartile for the number of package downloads (approximately 7,000 downloads per year), the package is scored 1. Otherwise, the empirical CDF is used to score.

Of course, you could choose a different period of time, say month, or a trend over time. But our investigations suggest that while having a variety of scores based on downloads, very little new information is gained. But there is an additional increase in complexity.

Score 2: Reverse Dependencies

We also examine the number of reverse dependencies, that is, how many other packages rely on it. The reasoning is simple: if many packages depend on it, there’s a greater chance that bugs will be spotted and fixed. It also suggests that other developers have reviewed and trusted the package enough to build on top of it.

Similar to package downloads, we used all packages on CRAN as a basis for scoring. Packages in the top quartile for reverse dependencies receive a score of 1. All others are scored using the empirical cumulative distribution function (CDF). In practice, this ends up behaving like a near-binary score, since only a small number of packages have significant reverse dependencies.

Examples

We’ve selected five packages to illustrate these scores - the total litmus score is given in brackets:

{drat} (0.94): A fantastic little package that simplifies creating local R repositories.
{microbenchmark} (0.87): A useful utility package, for (precisely) measuring function calls in R.
{shinyjs} (0.90): Perform common useful JavaScript operations in Shiny apps, created by Dean Attali.
{tibble} (0.81): The cornerstone(?) of the tidyverse.
{tsibble} (0.80): Tibbles for time series.

All five packages, as we would expect, have a high overall litmus score; we didn’t want to pick on more risky packages!

For package popularity, which makes up 15% of the total litmus score, all five packages selected, score a maximum of 1 for downloads and reverse dependencies. Potentially, we could change the score to make it a more “continuous” measure. For example, the number of downloads for {tibble} is always more than {tsibble}, as the latter depends on the former. However, the purpose of assessing packages, isn’t to provide a ranked list of packages, it’s to identify packages that are potentially risky. So having a more continuous measure isn’t that helpful.

Summary

We tend to think about package popularity as a way of crowd sourcing information about the package of interest. As we’ve mentioned, it’s only a signal, and as such it only contributes to 15% of the overall litmus score.

For updates and revisions to this article, see the original post

Shiny in Production 2025: Lightning Talk Lineup

Tue, 24 Jun 2025 23:59:00 +0000

We are pleased to announce the lightning talks for this year’s Shiny in Production conference! We’ve already announced the full length talks (25 minutes each) in this blog. This blog however is all about this year’s lightning talks session (5 minutes per talk).

Lightning talks

Andreas Wolfsbauer - AGES - Austrian Agency for Health and Food Safety

Enhancing Epidemiological Surveillance with a Shiny Application for Standardized Data Analysis

The Agency for Health and Food Safety (AGES) is responsible for monitoring notifiable infectious diseases in Austria. Within the Institute for Surveillance & Infectious Disease Epidemiology, we have developed a Shiny application designed to provide standardized analysis and visualization of all (n=76) notifiable disease categories, by processing data from the epidemiological notification system of Austria.

The application offers a dashboard that enables users to select specific diseases and visualize data through interactive plots. Features include filtering by year, federal state and age-group, facilitating descriptive epidemiological analysis of the notification data. An analysis tab allows users to apply custom filters and generate tailored plots, enhancing the depth of data exploration. Users can download all plots along with the underlying data and generate a PDF report. They also have the option to export filtered data as a CSV file for further use. Further development plans include a starting page highlighting long-term trends, to provide a compact overview for quick identification of diseases with need of action. Additionally, we will create an information page, that shows disease-specific metadata, and analyses of seasonal trends.

Furthermore, discussions are ongoing to develop a dashboard for broader accessibility, initially within the organization, with potential public access. Challenges encountered include optimizing application performance and availability, particularly given the constraints of utilizing the free version of Shiny Server. To address this, we are exploring parallel and asynchronous programming techniques to enhance efficiency and responsiveness. Additionally, we are evaluating deployment solutions such as ShinyProxy to improve multi-user access and scalability.

David Carayon - INRAE

Rescuelog: a Shiny-Based Monitoring System for Lifeguards: Insights from Southwest France

Drowning prevention on coastal beaches relies heavily on lifeguard vigilance and timely intervention. However, traditional rescue data collection methods often suffer from inefficiencies, delayed reporting, and a lack of real-time analytics. To modernize lifeguard operations across the beaches of southwest France, we developed an end-to-end open-source data pipeline powered by R, Shiny, and ruODK.

At the core of this system is ruODK, an R package that facilitates seamless integration with Open Data Kit (ODK), a widely used tool for field data collection. Lifeguards use tablets running ODK Collect to log rescue incidents in real time, which are then ingested directly into an R-managed database. The data is processed, analyzed, and visualized through a Shiny dashboard, offering lifeguards and supervisors instant access to key operational insights, trend analysis, predictive models and customizable reports.

By leveraging R’s data manipulation capabilities (tidyverse) alongside Shiny’s interactivity, we achieved a fully automated and scalable monitoring system that replaces paper-based logs with a dynamic, data-driven approach. Initial deployments in 2023 (on five beaches) demonstrated significant improvements in efficiency and situational awareness, prompting an expansion to 80 beaches by 2025. The system’s open-source nature ensures cost-effectiveness, reproducibility, and adaptability for other regions and applications.

This project exemplifies how R and Shiny can power real-time decision-making in public safety operations. It also highlights the untapped potential of ruODK for bridging field data collection with analytical pipelines—showcasing an impactful use case of Shiny in production.

Kia Mack - Kent Wildlife Trust

Building the Kent BNG Register: Shiny for UI-First Development in a Small Charity Tech Team

R Shiny is a powerful and beginner-friendly tool for rapidly developing interactive applications, but is it the best choice for UI-first web design?

In this talk, we share our experience building the Kent Biodiversity Net Gain Site Register, a user-authenticated web portal that links the demand and supply of biodiversity credits. We’ll discuss the ways in which Shiny was a great fit—allowing rapid prototyping, seamless integration with R’s data analysis tools, and reactive programming. We’ll explore why it suited a small conservation charity with a two-person team, enabling us to build a functional, data-driven application without the need for specialist web development skills.

However, we’ll also examine its limitations, from performance bottlenecks to challenges in creating a polished, responsive UI. We’ll share the strategies we used to overcome these issues, including optimising reactive dependencies, using custom CSS and JavaScript for a more refined UI, implementing caching and database indexing to improve performance, and leveraging Shiny modules to enhance scalability.

Whether you’re considering Shiny for a large-scale project or looking for ways to improve an existing app, this talk will provide practical insights into where Shiny excels and what can be learned from mainstream web development languages to improve our use of Shiny.

Natalia Petersen - NHS England

Hackathon to Streamline the National Disease Registration Service Cancer Treatments Shiny App

The Cancer Treatments dashboard is an interactive tool, built in Shiny, produced by the National Disease Registration Service (NDRS), within NHS England. The Shiny app displays graphs and tables presenting statistics on surgery, chemotherapy, and radiotherapy treatments for patients diagnosed with cancer in England.

Users can select to view the data by demographic factors such as ethnicity and stage at diagnosis, and by geography, via dropdown menus. The app is refreshed annually and is publicly available, aimed at supporting the understanding of cancer treatments for both technical and non-technical audiences. The previous Shiny code was long and repetitive, making it difficult to navigate, challenging to de-bug, time-consuming to run, and prone to human error due to limited automation.

To address these concerns, whilst also delivering improvements to the user interface, the team took part in a targeted hackathon day where individuals each took on a specific workflow and set of objectives, guided by user feedback. The re-developed app is now built on the NDRS Shiny app template, ensuring consistent styling. Bespoke, reusable functions are sourced throughout the code, allowing for modularisation, and graphs are built using the Plotly package to improve usability and interactivity. All code required for producing the publication is available on GitHub, increasing transparency and scope for reuse.

Through collaborative effort, careful division of labour, communication in person and online, and application of Reproducible Analytical Pipeline principles we were able to successfully and quickly deliver improvements to the Shiny app, which will be published May 2025.

Rhian Davies - The Strategy Unit, NHS

The Accidental Engineers: Managing Shiny Apps, Pipelines, and Tech Debt in the NHS

How big should the hospitals of the future be? That’s the question we’re trying to answer. Our team has built a complex statistical model with over 100 parameters, using 140 million rows of patient data to help healthcare leaders plan for future demand. The model incorporates uncertainty, allowing users to explore different policy scenarios and compare their hospital against national benchmarks. But while the maths is complicated, the hardest part isn’t the modelling, it’s making sure everything keeps running smoothly.

What started as a small data science project has grown into a sprawling web of interconnected tools, and some days, it feels like we’ve become accidental software engineers. Maintaining multiple Shiny apps, APIs, and pipelines across R, Python, and PySpark means we’re now juggling Databricks workflows, GitHub Actions, Azure Blob Storage, and Posit Connect deployments. Every month, we release a new version of the model while ensuring legacy versions are maintained and compatible. And with technical debt piling up, we’re starting to ask: do we keep patching things, or should we tear it all down and start again?

This talk is an honest reflection on the challenges of managing large-scale Shiny apps in a high-pressure environment, how we balance new development with maintenance, and what we’ve learned along the way. The code for all our tools is available publicly on GitHub.

Samer Hijjazi - MD Anderson Cancer Center

From Clicks to Insights: Harnessing RSelenium in R Shiny Applications

This talk explores the opportunity to incorporate the RSelenium package into R Shiny applications. RSelenium is a package which allows users to perform web automation and advanced web scraping. In comparison to rvest, RSelenium can give you the ability to web scrape data from more difficult websites. This talk would teach the R community a lot about web automation, as well as showing another creative way of using R Shiny applications.

For updates and revisions to this article, see the original post

R Package Quality: Validation and beyond!

Thu, 19 Jun 2025 23:59:00 +0000

This is part one of a five part series of related posts on validating R packages. Other posts in the series are:

As is often the case, it’s pretty easy to talk about “good” R packages. We can even wave our hands and talk about packages following “software standards” or “best practices”. But what does that mean?

Most of us would agree that packages like {Rcpp} or {dplyr} are solid. At the other end of the spectrum, we could point to outdated, poorly tested or unmaintained packages as “risky”. But the reality is that most R packages fall somewhere in between.

However, the reality is considerably more nuanced: the vast majority of R packages exist somewhere along the continuum between these two extremes. They may exhibit excellence in certain aspects whilst falling short in others, or they might represent perfectly adequate solutions for specific use cases whilst being unsuitable for mission-critical applications. The primary objective of this post is to assist organisations and individual practitioners in developing a clearer, more systematic understanding of the packages upon which they depend. It’s important to acknowledge upfront that any scoring system will have limitations—some genuinely high-quality packages might receive unexpectedly low scores due to specific circumstances, whilst some packages with significant underlying issues might score well on surface-level metrics. However, this doesn’t diminish the considerable value of establishing a consistent, structured framework for package assessment.

In developing Litmus, our solution for R package assessment and validation, we’ve had to wrestle with these concepts in great detail. We have come up with a framework that we believe addresses the challenges presented by package validation. In the coming series of Litmus blog posts, we will be examining in detail the choices we made to balance the need for both robustness and flexibility in R package quality assessment.

Before examining the specifics of how we evaluate and score packages, it’s crucial to understand the foundational principles that underpin our methodology. In this post, we will be digging into the core principles of our approach.

Guideline 1: Scores are not static

At first glance, this principle might appear counterintuitive, but it reflects a fundamental reality: the standards we apply to R packages today cannot reasonably be identical to those we might have employed in 2015, nor should they remain unchanged looking forward to 2030.

Consider the obvious evolution in scale: package download numbers have increased dramatically over the past decade, reflecting both the growth of the R community and the maturation of package distribution infrastructure. More subtly but equally importantly, the general tooling ecosystem has undergone dramatic improvements. Modern development practices now routinely include automated testing via GitHub Actions, comprehensive code coverage analysis, automated dependency checking, and sophisticated static analysis tools. Packages developed today have access to these resources in ways that simply weren’t feasible or standard practice a decade ago. Since number of downloads represents a metric of package popularity, what is considered a high vs. low number of downloads will need to be periodically adjusted.

Furthermore, our scoring approach is explicitly tied to specific package versions. When a maintainer releases a new version of a package, potentially addressing security vulnerabilities, improving documentation, adding new features, or enhancing test coverage, the previous version often becomes a less optimal choice despite having been perfectly adequate when it was current.

Solution: We implement an annual comprehensive audit of our scoring mechanisms. This yearly review process serves multiple functions: updating the underlying data used to generate scores where relevant (such as adjusting download thresholds to reflect ecosystem growth), introducing new scoring criteria as best practices evolve, and retiring metrics that may have become less relevant or discriminatory.

Guideline 2: Scores shouldn’t change often

While we acknowledge that scores are transient, they shouldn’t change often or dramatically. For example, it makes sense to yearly audit our scoring mechanism for downloads and adjust the criteria. This would change scores on packages, but only in a small way.

Solution: We maintain disciplined annual audits of our scoring mechanisms, with changes implemented deliberately and with clear documentation of the rationale. Between these annual reviews, scoring criteria remain stable unless critical issues are identified.

Guideline 3: Cutoffs depend on use Cases

In an ideal world, we should “hand analyse” all packages, spending time assessing each package individually. From a practical perspective, focusing our attention on the borderline packages, those that are almost good enough or just good enough to make the cut, makes sense. However, what constitutes “borderline” varies dramatically depending on the intended application. A package being considered for use in a regulatory submission to the FDA faces entirely different quality requirements compared to one being used in an MSc Statistics project or an exploratory data analysis. The former context demands extensive validation, comprehensive documentation, and demonstrated stability, whilst the latter might reasonably accept some additional risk in exchange for cutting-edge functionality or convenience.

Solution: Rather than imposing universal “risky” package thresholds, we advocate for situation-dependent cutoffs that reflect the specific requirements and risk tolerance of different use cases. We provide guidance for establishing appropriate thresholds for common scenarios whilst recognising that organisations may need to customise these based on their specific regulatory, commercial, or academic contexts.

See our post on Risk Appetite for more on this.

Guideline 4: “Good” packages may have serious issues

It’s crucial to recognise that even the most well-regarded packages can face problems that lie entirely outside their maintainers’ direct control. For example, a package might depend on a system library that subsequently reveals a security vulnerability, or one of its dependencies might become unmaintained. Alternatively, changes in the broader R ecosystem—such as modifications to base R or updates to critical dependencies—might create compatibility issues that haven’t yet been addressed. These scenarios highlight why a single numerical score, whilst valuable for initial triage, cannot capture the full complexity of package risk assessment. Some issues represent genuine “showstoppers” that require immediate attention regardless of a package’s overall score.

Solution: Whilst maintaining our commitment to clear, interpretable numerical scores for initial assessment, we supplement these with specific flags for “showstopper” issues that require immediate human review. These might include known security vulnerabilities, dependencies on risky packages, or compatibility issues with current R versions.

Guideline 5: Avoid cliff edges

Regardless of your statistical persuasion, we can all agree that having a super hard cut-off of “p = 0.05” is silly. The idea that “p = 0.05000001” is “not significant”, but “p = 0.4999999” can change the world, doesn’t really make sense. The same idea should apply to scores. Where possible, the scoring mechanism should be smooth and continuous.

Solution: We employ continuous, smooth scoring functions wherever possible. For example, rather than awarding full points for packages with >80% test coverage and zero points for those with <80%, we use gradual scoring curves that reward improvements at all levels whilst still recognising meaningful distinctions in quality.

Guideline 6: Not all scores are created equally

A score based on whether or not there is a maintainer should count more towards an overall score than a score based on whether or not there is a website URL. The former is more important than the latter in most cases, and thus should contribute more towards an overall score, if this overall score is to be considered useful.

Solution: Creating a scoring strategy that weighs individual metrics sensibly within categories, which are also weighted to reflect their relative importance. We will discuss this strategy in more detail in a later blog post, but here is the general idea.

We think of package quality as having four attributes:

Documentation (weight 15%): Assess the quality and completeness of the package documentation. This is clearly subjective, as a package with full documentation, could have “bad” or outdated documentation. Nevertheless, packages that lack examples in their help pages, vignettes or NEWS files have lower scores.
Code (weight 50%): This evaluates the quality and structure of the package code. Key components of this score include package dependencies (always a controversial topic), the number of exported objects, vulnerabilities, and test coverage.
Maintenance (weight 20%): Reviews standard maintenance aspects of the package, including frequency updates, bug management, and number of contributors.
Popularity (weight 15%): Review the package’s popularity. This includes package downloads over the last year and reverse dependencies. The idea is that these are strong indicators that the community has already placed trust in that package.

These numbers can of course be adjusted.

Implementation Considerations and Future Development

This scoring framework represents an ongoing effort to bring greater systematisation and transparency to R package quality assessment. As the R ecosystem continues to evolve, we anticipate that both our methodology and our understanding of what constitutes package quality will require ongoing refinement. We welcome feedback from the community about both the theoretical framework presented here and its practical implementation. Particular areas where community input would be valuable include the appropriate weightings for different quality attributes, the identification of additional metrics that might enhance assessment accuracy, and the development of context-specific guidance for different usage scenarios. Our commitment to annual methodology review ensures that this framework will adapt to reflect changes in best practices, tooling availability, and community standards whilst maintaining the stability and predictability that users require for practical decision-making.

Get in Touch

If you’re interested in learning more about R validation and how it can be used to unleash the power of open source in your organisation, contact us.

References

Risk Appetite in R packages
White paper from the Validation Hub on assessing R package accuracy
Case Studies from various companies. Our approach builds on these ideas.

For updates and revisions to this article, see the original post

Shiny in Production 2025: Full Length Talks

Tue, 17 Jun 2025 23:59:00 +0000

We are pleased to announce the full line-up for this year’s Shiny in Production conference! The conference includes nine full-length talks (25 minutes each) and a lightning talk session (5 minutes per talk), we’ll cover those in a separate blog.

Talks

Cameron Race - Head of Children and Schools Statistics and Product Manager

shinyGovstyle: A ‘Shiny’ Secret Weapon for Production-Ready Government Public Services

In the UK, we are required to make public sector websites accessible to all users. While there is a wealth of UK government data publicly available through a number of existing digital services, it can be tough to engage with. Government analysts are increasingly turning to R Shiny to enhance their data dissemination, making it more engaging for users, but with hundreds of analysts working in silos across government, how can analysts build full digital services in a way that carries the same consistency, trustworthiness and authority as a domain such as GOV.UK?

Charlie Gao - Posit Software, PBC

Advances in the Shiny Ecosystem

Charlie Gao, Senior Software Engineer on Posit’s open source team will review some of the latest high-performance async tooling developed by Posit to support R Shiny in terms of performance, scalability and user experience.

Colin Fay - ThinkR

After {shiny} — Bringing R to Mobile with webR

As the use of mobile devices becomes increasingly central to how users interact with data products, the R community has long sought ways to bring R-powered applications into the mobile space. Historically, this has meant adapting {shiny} apps for smaller screens—either through responsive design or packages like {shinyMobile}. While effective for certain use cases, these approaches are fundamentally web-based, requiring a server and a stable internet connection, and lacking access to native device features.

This talk presents a new path forward: Rlinguo, a fully native mobile application built with webR, a version of R compiled to WebAssembly. Unlike traditional {shiny}-based solutions, Rlinguo runs R directly on the device, without a server. It works offline, stores data locally, and can leverage native mobile APIs—pushing the boundaries of what’s possible with R in a mobile context.

Through this case study, we’ll explore the architecture behind Rlinguo, contrast it with the {shiny} model, and discuss what it means for the future of R development. Topics will include:

What it takes to embed R in a mobile app using webR
Technical and design trade-offs between web-based and native solutions
Practical applications for offline, device-integrated R tools

Whether you’re building with {shiny} today or simply curious about the next evolution of R in production, this session offers a look at where R can go when it steps beyond the browser.

Gabriela De Lima Marin - Brazilian Network Information Centre

A Collaborative Initiative for Mapping and Georeferencing Public Schools in Brazil

This project presents a collaborative initiative aimed at improving the geolocation accuracy of Brazilian public schools through an interactive Shiny web application.

By integrating existing location data from the Brazilian School Census with APIs from Google, Microsoft, and OpenStreetMap, we established an innovative workflow to assign accurate geographic coordinates to schools previously lacking precise location data.

The Shiny application provides a user-friendly interface allowing school administrators and education managers to visually verify and manually adjust school locations via interactive maps. Over the past two years, this approach enabled the precise geolocation of previously unlocated schools and significantly enhanced the accuracy of geolocation data of schools.

The geolocation data collected and validated through this project will be openly shared with relevant governmental stakeholders, promoting transparency and supporting evidence-based decision-making. Moreover, the project exemplifies how collaborative data science and innovative web technology—particularly R Shiny—can be effectively leveraged in public administration, enabling managers, stakeholders, and the community to directly contribute to data accuracy and positively influence educational outcomes in Brazil.

Jack Anderson - National Disease Registration Service, NHS England

Transforming the reporting of national patient outcomes with Shiny: 30-day mortality post-Systemic Anti-Cancer Therapy

In June 2020, the National Disease Registration Service began reporting 30-day mortality post-Systemic Anti-Cancer Therapy (SACT) Case-Mix Adjusted Rates (CMAR) to NHS trusts in England. This work applies logistic regression to report trust-level case-mix adjusted 30-day mortality rates, which enable comparisons between trusts and with the national average. Historically, results were shared as an Excel workbook with an accompanying companion brief and FAQ document, and each report was shared in isolation from previous releases. Since April 2023, implementation of R Shiny has enabled 30-day mortality rates to be reported seamlessly on an interactive, publicly accessible dashboard. Utilising the Plotly and DT packages, dynamic funnel plots and data tables are tailored to user needs through Shiny input pickers, which reactively subset and summarise data visualisations based on user selections.

This enables NHS trust users to flexibly review their 30-day mortality outcomes against those of other trusts, their wider Cancer Alliance, and national averages, both overall and stratified by key patient demographics.

The Shiny dashboard also enables users to view current and previous CMAR reports together in one place and includes download button functionality for documentation and underlying data. With dedicated tabs for summary data, trust exclusions, and trust response statements, Shiny allows for end-to-end exploration of CMAR outcomes, making it easier for users to gain insight into clinical practice. The resulting Shiny dashboard supports clinical governance within trusts and enables clinical colleagues to better understand their patient outcomes within their wider context.

Laura Mawer & Marcus Palmer - Datacove, Harrison-Palmer Limited

Using Shiny for Python to Power AI-Driven University Application Forecasting

Universities face growing uncertainty in student recruitment, making accurate forecasting critical for strategic and financial planning. Athena is an AI-powered prediction tool that leverages Shiny for Python to provide real-time insights into application trends. By combining machine learning (Random Forest models), trend analysis, and interactive scenario planning, Athena enables universities to test recruitment strategies, adjust campaign spending, and instantly see the projected impact on future application numbers.

This talk will explore how Shiny for Python was used to develop a fully interactive forecasting tool without requiring extensive front-end development. We will discuss why Shiny for Python was chosen, how it integrates with a machine learning pipeline, and how it powers real-time scenario analysis with dynamic dashboards. Additionally, we’ll demonstrate how AI-generated recommendations via an API enhance decision-making, providing actionable insights tailored to user-selected scenarios.

Attendees will gain practical knowledge on building AI-driven, interactive applications using Shiny for Python, implementing predictive models, and designing intuitive decision-support tools for non-technical users. The session will conclude with a live demo, showing Athena in action and sharing best practices for deploying Shiny for Python in production. This talk is designed for developers, data scientists, engineers, and senior decision-makers looking to leverage AI-powered forecasting, business intelligence, and strategic planning in a real-world application.

Nic Crane - NC Data Labs

htmlwidgets Are a Secret Sauce in R – Can LLMs Make Them the Perfect Condiment?

htmlwidgets quietly power some of the most compelling Shiny apps out there, but writing them from scratch can be fiddly and time-consuming. In this talk, we’ll kick things off by taking an audience-sourced ingredient list and asking a large language model to whip up a fresh htmlwidget. Then we’ll plate up a version we prepared earlier - also model-generated - but chopped, seasoned, and finished with our own touches. Along the way, we’ll explore how LLMs can assist in crafting htmlwidgets that reflect your flavour of R - from tidy eval to package structure - rather than sticking to a bland house style.

For updates and revisions to this article, see the original post

Why JR’s Training is Different

Mon, 09 Jun 2025 23:59:00 +0000

At Jumping Rivers, we believe training should be more than just a tick-box exercise. It should be transformative. Whether you’re learning R, Python, SQL, Git or Posit for the first time or diving into advanced topics like machine learning and Quarto, our courses are built to help you actually use what you learn — not just watch someone code. View our upcoming catalogue here!

What Sets Us Apart?

Expert Trainers: Our trainers aren’t just good with code — they’re professional data scientists who solve real-world problems for industry clients every day. From building dashboards to optimising machine learning models, they bring this experience straight into the classroom.
High-Quality Content: We regularly update our material to reflect the latest best practices, tools, and workflows. No tired examples. No generic slides. Just practical, polished, and engaging content that’s been road-tested across sectors.
Hands-On and Practical: Expect live coding, interactive exercises, and the space to ask questions. You’ll finish the course with code you wrote and skills you can apply immediately.
Tailored for You: We design our sessions around your level, your pace, and your goals. Whether you’re an academic, analyst, developer, or decision-maker, you’ll get value from day one.

What People Are Saying

“Genuinely one of the best courses I’ve attended — well-paced, friendly, and full of real-world examples.”

“The trainer was fantastic — incredibly knowledgeable and approachable. I finally feel confident using Git and R together.”

“I loved how applied it all was. We weren’t just learning syntax — we were solving actual problems.”

“Great value. I’ve already used what I learned in a project at work.”

Upcoming Live Training (with Super Early Bird Tickets 🎉)

Here are all the upcoming courses for July and August. Please note all times are UK time.

7-8th July

Introduction to R - 09:00AM-12:30PM

Introduction to R - 13:30PM–17:00PM

Ready to Skill Up?

✅ Small class sizes

✅ Live sessions with expert trainers

✅ Immediate impact on your day-to-day work

✅ Super early bird tickets — limited and going fast!

🎟 Sign up now and invest in training that actually moves the needle. Whether you’re upskilling your team or boosting your own confidence, you’re in the right place.

For updates and revisions to this article, see the original post

Rethinking Image Formats

Thu, 05 Jun 2025 23:59:00 +0000

Adding images to a web page used to be straightforward. You’d add the img tag to the HTML, set the src attribute to the appropriate URL and, hopefully, write some informative alt text. (You might also add some CSS, either inline or via a stylesheet.)

<img src="plot.png" alt="Scatter plot of age vs score. Line of best fit runs through the points, and an outlier can be seen at age 28, score 40." />

It’s slightly more complicated today, with monitor and browser technology changing the requirements, at least if you are using raster images (like JPEGS, PNGs and GIFs) and want things to look good for all your users. High density screens on smartphones have been popular for a while but 4k and 5k monitors are also becoming more affordable. To make text easy to read, these are often set to 200% scaling so that one measured pixel corresponds to 2 real pixels in each dimension. (For smartphones and tablets this scaling can even be 300%, though their true pixel counts are lower than those of 4k and 5k monitors.) A result of all this is that, for images not to look pixelated on these screens, they need twice as many pixels in each direction - that’s four times the number of pixels for a given image display size. So what can we do about this?

Using the srcset Attribute

Fortunately, browsers added the srcset attribute to make it easier for the developer to specify multiple images to use. The browser then picks the “best” option for a given user based on the information given in the srcset attribute and information the browser already has about the device on which the page is being viewed. The simplest way to utilise this attribute is to specify an image that is twice as large in the srcset property alongside a “2x” marker. By convention, we name the larger image the same as the smaller image, but with @2x in the name just before the extension:

<img src="plot.png" srcset="plot@2x.png 2x" alt="Scatter plot of age vs score. Line of best fit runs through the points, and an outlier can be seen at age 28, score 40.">

This tells the browser to serve the base image to users with “regular” screens and the larger image to those with scaled screens. You could also add a “3x” version here if you wanted, though that would require an image with nine times as many pixels as the base image. The actual file size in memory may not be nine times that of the base image due to the compression algorithms scaling well, but they’ll still be considerably bigger.

The shortcoming with the above syntax is that it’s not really targetting the right thing. It tells the browser to choose based only on scaling factors and not on the actual rendered image sizes. An image could be set to display at 600 “CSS” pixels on a wide screen, like a desktop monitor, and 300 CSS pixels on a narrower one, like a phone. For a phone with 2 times scaling the 600 pixel image would then look fine but the browser doesn’t inherently know that the 1200 pixel image is unnecessary. So it will (probably) load the 1200 pixel image, making page-load slower than necessary and potentially gobbling up more of the user’s mobile data than warranted.

The specification for srcset offers an alternative that seems to solve this issue: just directly list the widths of available images by specifying a number and the letter “w”:

<img
 srcset="plot-small.png 300w, plot.png 600w, plot-large.png 1200w"
 alt="Scatter plot of age vs score. Line of best fit runs through the points, and an outlier can be seen at age 28, score 40.">

If the browser knows what size the img element will be rendered at, the sizes of the image options and the pixel density of the screen it can pick the best image for the job. The catch is that, at least when the browser sees the img tag for the first time, it won’t know what size it will be rendered at unless we specifically tell it. We can do that using the sizes attribute on the img element. Unfortunately, for responsive layouts this can get very messy and very confusing very quickly.

If you want to get into the nitty gritty of using srcset with sizes then there is a great article on CSS Tricks that goes into way more detail than we have space for here. Let’s, instead, look at alternative ways of reducing the burden of large images.

Using Vector Graphics

The solution that makes life easy… when it’s applicable. Instead of using a PNG (or JPEG), use an SVG - a scalable vector graphic.

Advantages of SVG

Instead of storing data about the colours of millions of pixels, these files store a set of instruction for constructing an image. This is usually the perfect solution for company logos and most common chart types because they can be scaled however you like precisely because they’re just a list of instructions. No need to serve multiple images.
They can be added to the page in a number of ways, including using a simple img tag.
With a bit of JavaScript they can be made interactive and they’re easy to animate.

Shortcomings of SVG

They’re essentially useless for detailed images, like photography.
Fonts may not be rendered properly when added through the src attribute of an image tag if that font isn’t already on the users system. A work-around for this is to open a vector-image editor and find the option for rendering text as paths. While this will likely increase the file size a bit and cause minor imperfections in text rendering, it may be more problematic that this adds an extra step in the workflow when the SVGs are generated programatically.

Illustrative example

Use the controls below to change between image formats and scaling to see the effect. It should be apparent that when you scale up a PNG or JPEG the image becomes more blurred and that the SVG, for the most part, remains crisp regardless of the scale-factor. (You may notice small artefacts with the SVG text when scaled up. These are seen because the characters are rendered using SVG paths rather than fonts, as described in the previous section.)

Select an image
Scale factor

Using New Image Formats

Given the above, you may think the available image options for the web looks something like this:

JPEG (with lossy compression) for images with (up to) millions of colours;
PNG for images with large consistent blocks of colours (like logos) or images that require transparency;
SVG for vector graphics;
GIF for your favourite animated meme.

But for images that can’t be easily represented in vector format there are several newer raster image formats: JPEG XL, WebP, AVIF and HEIC (A.K.A. HEIF) that offer better compression (lossy and lossless) than PNG, JPEG and GIF. Of these new formats, only WebP and AVIF have meaningful browser support, but that support is actually very good: currently 95.4% for WebP and 93.5% for AVIF. In fact, you may think support is good enough for both formats to not need to provide a fallback. However, if you want to, you can use the picture and source elements to cover even more browsers:

<picture>
 <source srcset="/images/home/whale-deep-dive-light-blue.webp 1x, /images/home/whale-deep-dive-light-blue@2x.webp 2x" type="image/webp">
 <img src="/images/home/whale-deep-dive-light-blue.png" alt="Jumping Rivers' cartoon whale with Moon in background">
</picture>

In the above example we use the srcset attribute to provide two different sizes in the WebP format and the img tag to provide a PNG fallback for older browsers (we assume users of older browsers aren’t using modern high-definition screens). The alt text also still needs to be included in the img tag rather than moved into the source or picture tags.

When it comes to choosing between WebP and AVIF, WebP has marignally better browser support, but consensus is that AVIF offers better compression. This is maybe not surprising since it’s a much newer new format than WebP, which actually turns fifteen in 2025. The downside to that is that we have found support for AVIF in editing tools to be much lower than it is for WebP. That landscape is always changing, however. WebP has one other advantage over AVIF: it supports lossy images with transparency so if you need small image sizes and transparency it’s the only format in town.

Both WebP and AVIF support image animation but, as you will see in the next section, there’s another alternative for replacing our old friend the GIF.

The example below shows a 300-pixel-wide image of The Catalyst building in Newcastle, where Jumping Rivers is headquartered. You can choose between viewing a lossless PNG, lossless WebP, lossy JPEG, and a lossy WebP image. The two lossless formats should look the same, but the WebP image is about 20% smaller in file size than the PNG. The lossy images both have “medium” levels of compression so should be of roughly comparable quality, but not identical (since they use different compression algorithms). The lossy WebP image is only about one third the file size of the JPEG!

Select an image
Scale factor

Using Videos Instead of GIFs

GIFs, particularly animated GIFs, have been a big part of internet culture. However, they are a very old format with large file sizes and poor colour gamuts.): they are limited to a max of just 256 different pixel colours. All modern browsers support video natively through the video element and these offer much better compression and huge colour palettes.

<video src="assets/hex-dissolve.mp4" aria-label="Litmusverse hex sticker animation" autoplay="true" loop="true" muted="true"><video>

The aria-label attribute is used like the alt text of an img element. The other attributes should be fairly self-explanatory: autoplay tells the browser to play the video automatically, loop to loop the video around back to the start when it finishes and muted not to play any sound. The latter is required because, thankfully, browsers will no longer autoplay videos with sound.

For updates and revisions to this article, see the original post

Custom PowerPoints Using {officer}

Thu, 22 May 2025 23:59:00 +0000

From a purely design perspective, Quarto’s standard PowerPoint output falls short. It is limited to seven layout options, with the most complex being “Two Content.” The {officer} R package offers a powerful alternative for those seeking full control and customisation.

Why PowerPoint?

At work, I use a Linux operating system (OS), and at home, I use macOS. Within my little bubble, it’s easy to forget how much of the market share Microsoft still holds. It’s estimated that around 70% of the desktop operating system market share belongs to Microsoft. Many of the clients I work with prefer Microsoft outputs, such as PowerPoint, over HTML or PDF. Aside from company alignment with Microsoft, there are a few practical reasons why using PowerPoint with Quarto can be advantageous:

No need to be a CSS / LaTeX whizz-kid to produce professional-looking slides
Possible (and easy) to edit after rendering the doc!

What is {officer}?

From davidgohel.github.io/officer

The officer package lets R users manipulate Word (.docx) and PowerPoint (*.pptx) documents. In short, one can add images, tables and text into documents from R. An initial document can be provided; contents, styles and properties of the original document will then be available.

This means for this workflow, Quarto is sidestepped altogether, and we focus entirely on R scripts and R coding.

How?

There are a few ways to use {officer} - I’ll walk through the approach that I’ve found to be most effective.

Layout templates

First - you’ll need a PowerPoint presentation that contains template layout slides. There are no limits to these slides, the format can be as custom as you like and there can be as many layouts as you want. Remember - this file doesn’t need any actual slides, it only needs layouts! To create a layout:

Enter “Slide Master” mode
Add any content (headers, footers, styling etc) you want to appear on each slide to the “Slide Master”
Create a new Layout Slide

To insert content from R, the easiest way is via placeholders. These can be text, tables, images and more. To add a placeholder:

Click “Insert Placeholder” and choose the content type
If it’s a text placeholder, you can customise the formatting of the text

You can see below that I’ve added some basic Jumping River styling to mine, and added two placeholders; a text placeholder for a title and an image placeholder for a plot.

In order to access these placeholders easily from R, it’s better to rename them:

Home tab
Click the “Select” dropdown
Click “Selection pane”
Select your placeholder and rename

Here I’ve named my image placeholder “plot”, and my text placeholder for the slide title, “title”. Note that it’s also a good idea to name your layout - just right click and hit rename. In this demo I’ve just left it as “Title Slide”.

The R code

Now that I’ve got my template set up, the rest is in R. First, we load {officer} and read the PowerPoint document in as an R object.

library("officer")
doc = read_pptx("mytemplate.pptx")

If you’ve forgotten your layout / placeholder names, access them through layout_summary() and layout_properties()

layout_summary(doc)
layout_properties(doc, layout = "Title Slide", master = "Office Theme")

Before any content can be added, content is needed! Let’s use the {palmerpenguins} package to create a simple plot of “Adelie” penguins data

library("palmerpenguins")
library("dplyr")
library("ggplot2")

adelie_plot = penguins |>
 filter(species == "Adelie") |>
 ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) +
 geom_point() +
 theme_linedraw() +
 theme(
 # Make the background transparent
 plot.background = element_rect(fill = "transparent", colour = NA),
 # Match the panel colour to the slide
 panel.background = element_rect(fill = "#F1EADE", colour = NA)) +
 labs(
 x = "Bill Length (mm)",
 y = "Flipper Length (mm)")

I can add empty slides to the document using the add_slide() function. Here I simply choose a layout from my .pptx file to use.

doc = add_slide(doc, layout = "Jumping Rivers", master = "Office Theme")
doc

Then, using the ph_with() function, I can insert R objects into my placeholders by name

doc = ph_with(
 doc,
 value = "Adelie",
 location = ph_location_label("title")
)
# Add the plot
doc = ph_with(
 doc,
 value = adelie_plot,
 location = ph_location_label("myplot")
)

To create the PowerPoint, use print()

print(doc, "penguins.pptx")

And there we have it! I’ve used only two placeholders here to keep the example simple, but in reality there is no limit.

Looping

It’s easy to make use of programming when using purely R code to generate PowerPoints. For instance, we could stick our code into a for loop, and add a slide for each Penguin species

# Read in doc again
# this resets the doc object to the original file
doc = read_pptx("mytemplate.pptx")

for (penguin_species in c("Adelie", "Chinstrap", "Gentoo")) {
 doc = add_slide(doc, layout = "Title Slide", master = "Office Theme")
 # Add the title using the iterator value
 doc = ph_with(
 doc,
 value = penguin_species,
 location = ph_location_label("title")
 )
 # Create the plot using the iterator value
 penguin_plot = penguins |>
 filter(species == penguin_species) |>
 ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) +
 geom_point() +
 theme_linedraw()
 theme(
 plot.background = element_rect(fill = "transparent", colour = NA),
 panel.background = element_rect(fill = "#F1EADE", colour = NA)) +
 labs(
 x = "Bill Length (mm)",
 y = "Flipper Length (mm)")
 # Add the plot
 doc = ph_with(
 doc,
 value = adelie_plot,
 location = ph_location_label("plot")
 )
}
# Output to a file
print(doc, "penguins_loop.pptx")

Conclusion

There are a few drawbacks to this method:

It is quite annoying to insert large amounts of text using just an R script
Content added to the “Slide Master” slide cannot be moved or edited on the output file
The web version of powerpoint doesn’t have the Slide Master functionality features

However, I think the pros outweigh the cons:

Completely (and I mean completely) custom layouts
I haven’t covered this here but it’s really easy to convert ggplots to vector graphics in DrawingML format so they can be edited in PowerPoint
Easier to programmatically generate lots of slides
It can be edited after rendering by anyone!
It can be styled before rendering by anyone!

For updates and revisions to this article, see the original post

Shiny in Production 2025: Workshops

Tue, 20 May 2025 23:59:00 +0000

Shiny in Production is heading back to The Catalyst in Newcastle upon Tyne this October! We’ve got a great mix of workshops and a full day of talks, with speakers being announced soon. You’ll find all the workshop details below, and you can sign up now on the conference website. Whether you’re just getting started with Shiny or have been using it for years, come join us for a great hands-on experience with Shiny and other web-based development tools.

Day one of the conference (Wednesday 8th October), will consist of the three parallel workshops running from 13:30 to 17:00, followed by a drinks reception in the evening, a great opportunity for networking and debriefing from the day’s learning.

Workshop 1: End-to-End testing for {shiny} with Playwright and {golem} - Colin Fay

A Shiny application that dazzles in development can still fall apart in production if user journeys break, data pipelines drift, or browsers behave unexpectedly. Automated end-to-end (E2E) testing is the safety net that keeps released apps robust, and Playwright is quickly becoming the gold-standard tool for doing it across Chrome, Firefox and WebKit. In this hands-on workshop we’ll walk through a workflow for writing, running and maintaining Playwright tests that keep your Shiny apps ship-shape long after launch. Here’s what we’ll tackle:

why E2E testing matters even when you already have unit tests
installing and configuring Playwright in a golem project using {pw}
scripting core user flows—clicks, inputs…
validating data and UI state with snapshots and assertions
running tests headlessly in CI pipelines (GitHub Actions, GitLab CI, Posit Connect)
handling Shiny specificity
debugging failed tests

For this workshop, bring a laptop and a Shiny app you care about. You’ll leave with a working Playwright test harness you can drop straight into your projects—plus the confidence to deploy on Friday without fear.

By the end of the workshop, participants will…

understand the role of end-to-end testing in the Shiny deployment pipeline
be able to install Playwright and scaffold tests from R
write expressive Playwright scripts that capture user journeys in a Shiny app
run tests in parallel across browsers locally and in continuous-integration systems

About the speaker

Colin Fay is a Lead Developer at ThinkR, a French agency specializing in all things R. By day, he helps companies unlock R’s full potential by building tools, architecting infrastructure, and developing data and software engineering solutions. His expertise spans web applications (frontend & backend), R in production, and scalable software development. By night, he’s an open-source enthusiast, international speaker, and long-distance runner. A passionate advocate for the R community, he actively contributes to open-source projects and shares his knowledge through talks and workshops worldwide. Colin is the main developer of {golem}, a framework for building robust Shiny applications, and the lead author of [Building Production-Grade Shiny Apps](https://engineering-shiny.org/index.html.

Workshop 2 - Asynchronous Shiny - Dr Russ Hyde

Imagine you couldn’t register to attend “Shiny in Production” if someone else was in the process of registering, and you had to wait until they had finished before you could click to “Buy tickets on EventBrite”. This kind of “blocking” shouldn’t happen in modern web applications but is surprisingly common in Shiny applications. It happens because a single R process handles all of the server-side processing for multiple users—one long-running task can prevent any other task from proceeding, hampering interactivity both between and within user-sessions.

Fortunately, Shiny’s support for asynchronous programming can alleviate this problem. In the asynchronous approach, you start tasks running without having to wait for them to complete. But, this requires a change in mindset for many programmers and there are a few concepts to understand before you can take advantage of this approach. So, what are you waiting for? Sign up for this workshop!

By the end of the workshop, participants will…

understand how within-session and between-session blocking can arise in a Shiny app
understand the basics of asynchronous computation
solve between-session blocking with future/promise
solve blocking the modern way, with ExtendedTask

About the speaker

Russ has previously worked in molecular biology and bioinformatics. He holds a PhD in Molecular Physiology and MSc in Mathematics. Russ is an author of several CRAN packages and mentor on the R-for-data-science community.

Workshop 3: Figma and User-Interface Design for Shiny - Pedro Silva

Applications should look attractive, be engaging, and work intuitively for users. All of these aspects benefit from spending time focussing on user-interface (UI) and user experience (UX) design during app development. Indeed, we find that clients provide lots of feedback on the look and feel of an app, and that it is useful to prepare a view of the overall design even before any interactive functionality is implemented, so that design feedback can be obtained as early as possible.

Graphical tools like Figma allow the designer to build both coarse- and fine-grained illustrations of how an application or website will look, and simulate the user workflow through the application. The designs can be shared with clients, and feedback gathered through comments pinned to the design.

This workshop requires no prior experience in UI/UX design and will guide you through your first steps in Figma, demonstrating how to quickly prepare design ideas for Shiny applications. We’ll also get you started with creating some components—reusable modules of your design that can transition into different states. You will need a Figma account to participate; there is a free-tier that is sufficient for the workshop.

By the end of the workshop, participants will…

create simple wireframe designs in Figma
set font styles and colour palettes consistently across your design
use the bootstrap UI kit in Figma
create small components with a simple transition into an alternative state
use CSS to replicate a simple Figma design in Shiny

About the speaker

Pedro is a full stack developer with over 15 years of experience in the field, loves front-end and R Shiny development, and is a moonlight practitioner of JavaScript dark arts.

What’s next?

Early bird tickets for the conference are still available at the time of writing, so don’t miss out! The full line up of speakers will be announced in the coming weeks. Still not convinced? Head over to our YouTube channel to take a look at talks from previous years to see what we have in store.

For updates and revisions to this article, see the original post

Advanced Testing in Python

Thu, 08 May 2025 23:59:00 +0000

Writing tests is one of the best ways to keep your code reliable and reproducible. This post builds on our previous blog about Python testing with pytest Part 1, and explores some of the more advanced features it offers. From parametrised fixtures to mocking and other useful pytest plugins, we will show how to make your tests more reproducible, easier to manage and demonstrate how writing simple tests can save you time in the long run.

Testing in Python

When we write code, it is important to ensure it behaves as expected, which is why we test it. Testing (and re-testing) our code should be a regular practice, ideally done thoroughly, quickly, and reliably after every change.

To achieve this, we write additional code to verify the behavior of our main code. We use specific terms to differentiate between these two types of code:

Production Code: the code that fulfills the purpose of the software, and is run by the user.
Test Code: additional code only used to test the production code.

The directory structure for production and testing code typically looks as follows:

./advanced_pytest/
|── map.py # production code
├── tests/
│ ├── parametrised_fixture.py
│ └── test_map.py
├── venv

where the main functions, in our case map.py are in the root directory and the tests are stored under tests.

Parametrised fixtures

In Part 1 we introduced the concept of fixtures in pytest. Now, let’s explore parametrised fixtures, a powerful feature that allows us to run the same test logic with different inputs. This helps avoid code duplication while testing various scenarios without rewriting your tests.

import pytest

@pytest.fixture(params=[1, 2, 3])
def input_value(request):
 return request.param

def test_increment(input_value):
 assert input_value + 1 > input_value

This test will run three times—once for each value in the params list (1, 2, and 3). By parameterising the fixture, we effectively reuse the same test logic across multiple inputs. This makes your tests more compact and helps catch potential issues that might only appear with certain values.

Mocking

Mocking is the process of replacing a real object with a pretend object, which records how it is called and can assert if it is called incorrectly. In python, mocking can be performed via the unittest.mock module. We can create a mock version of a function as follows:

# ./tests/test_mock_function.py
from unittest.mock import Mock
mock_function = Mock(name="my_function", return_value=2)

This creates a new object called mock_function which can be used in place of any other function. The name="my_function" argument is a label for the mock_function which is useful when debugging. The return_value=2 argument for Mock means that any time that mock_function() is called, it will return 2, regardless of any other arguments passed to mock_function().

We can use our mock_function in a test:

# ./tests/test_mock_function.py (continued)
def test_mock_function_works():
 assert mock_function() == 2
 assert mock_function(123, "abc") == 2

Running the test script shows that mock_function() always returns 2.

python -m pytest tests/test_mock_function.py
============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.3.5, pluggy-1.5.0
rootdir: /PATH/pytest-advanced-blog-post
collected 1 item

tests/test_mock_function.py . [100%]

============================== 1 passed in 0.02s ===============================

Mocking External Dependencies

When testing functions that interact with external systems (such as APIs or databases), it’s important to isolate the code being tested. We want to avoid having our tests make real calls to remote resources, as this could cause failures due to issues like internet outages or slow database responses. Instead, we use mocks. Pytest supports mocking by integrating with the unittest.mock module (here we use the patch function).

Let consider an example of some code (map.py) that retrieves and displays a static map image of a geographic location (Paris in this case).

import requests


def map_at(lat, long, satellite=False, zoom=12, size=(400, 400)):
 base = "https://static-maps.yandex.ru/1.x/?"
 params = dict(
 z=zoom,
 size=str(size[0]) + "," + str(size[1]),
 ll=str(long) + "," + str(lat),
 l="sat" if satellite else "map",
 lang="en_US",
 )
 return requests.get(base, params=params, timeout=60)


paris_map = map_at(48.853, 2.3499)

import IPython

IPython.core.display.Image(paris_map.content)

In this example there is a single function map_at() that could be tested. Additional code in the script makes use of that function (paris_map = map_at(...)). The way the script is written means that whenever it is loaded as a module (import map), all of the top-level commands will be evaluated. In particular, when a test script loads this module, the commands paris_map = map_at(...) and ...Image(paris_map.content) will run. You don’t want this to happen. That is, you don’t want to run all of the code in your analysis scripts, just to test that the functions within it work correctly, it will make your testing routine take a long time.

The top-level code that displays a map of Paris is script-specific. It should run when map.py is ran as a script, but not when map.py is imported. The standard Python way to prevent script-specific code from running when a module is imported, is to wrap it in the following block:

if __name__ == "__main__":
 # script-specific commands go here

To make testing easier, we can make map.py a little more import-safe:

import requests
import IPython


def map_at(lat, long, satellite=False, zoom=12, size=(400, 400)):
 # Function body is unchanged
 # ...
 return requests.get(base, params=params, timeout=60)


if __name__ == "__main__":
 paris_map = map_at(48.853, 2.3499)
 IPython.core.display.Image(paris_map.content)

Now we can load the functions from map.py without having to run all the other script-specific code within it. Then the test file (test_map.py) for the map.py would be:

import requests

from unittest.mock import patch

from map import map_at

def test_build_default_params():
 with patch.object(requests, "get") as mock_get:
 map_at(51.0, 0.0)
 mock_get.assert_called_with(
 "https://static-maps.yandex.ru/1.x/?",
 params={
 "z": 12,
 "size": "400,400",
 "ll": "0.0,51.0",
 "l": "map",
 "lang": "en_US",
 },
 timeout=60,
 )

This test checks the behavior of the map_at function. Using the unittest.mock.patch method, the test mocks the requests.get function to prevent actual network calls. It ensures that when the map_at function is called with specific coordinates, it generates the correct HTTP GET request with the expected URL and parameters (such as zoom level, map type, and language).

Similarly, you can patch a function using the context manager with patch.object(my_module, "original_function", mock_function) and this will mean that any calls to my_module.original_function() will be replaced with calls to mock_function().

Mocking is important in testing because it isolates the code being tested from external dependencies, such as APIs, databases, or file systems. This allows tests to run faster, as they do not rely on slow or unreliable external services. Mocking also ensures tests are more predictable and repeatable by simulating specific responses or error conditions, without making real network requests or modifying external data. This makes tests more focused on the logic of the code itself, while avoiding unintended side effects.

Useful pytest Plugins

Pytest’s functionality can be extended through a rich ecosystem of plugins. Here are some useful plugins:

pytest-xdist: Enables parallel test execution, speeding up test runs.

 pip install pytest-xdist
 pytest -n auto

pytest-cov: Provides code coverage reports.

 pip install pytest-cov
 pytest --cov=your_package

pytest-mock: Simplifies mocking by integrating with unittest.mock.

 pip install pytest-mock

By integrating these advanced pytest features, you can make your tests more efficient, reproducible, and easier to manage. Don’t hesitate to experiment with parametrised fixtures, mocking, and useful plugins like pytest-cov and pytest-xdist to level up your testing.

For updates and revisions to this article, see the original post

Announcing the Jumping Rivers Dashboard Gallery

Tue, 15 Apr 2025 23:59:00 +0000

At Jumping Rivers we love data dashboards and are delighted to announce the release of a gallery to showcase our application-development skills.

Tools like Shiny, Dash, Streamlit and Observable have simplified the process of making interactive, visual, data products.

Despite this simplification, our clients often approach us with challenges that step beyond what is easily achieved in these dashboard frameworks. They may have accessibility requirements, or need applications to be responsive to a users browser size or device. They may need a user interface that matches their branding, or that is easy to use. Data itself is sometimes a challenge, and some clients need a data pipeline developing, for pre-processing or validation, so that their applications can work more effectively. There are niche skills involved in data presentation and visualisation, and we have a wealth of experience with charts, tables, maps and have built a range of custom data-widgets for clients.

Our dashboard gallery contains several applications that highlight our expertise across the Jumping Rivers data science team. Within it, you can find a list of the applications that are available to view, with links to each and some technical information. All the applications are publicly accessible. In the coming months we will be adding further applications to the gallery.

At Jumping Rivers we have worked with Shiny for many years (including Shiny for Python), and have several training courses, dozens of blog posts and host our annual conference “Shiny in Production” on this tool. Consequently, many of the gallery applications are built using Shiny. But our team also boasts expertise with the data visualisation library D3.js and a range of JavaScript frameworks, and so there is a Vue.js application and a timeline developed in D3 presented within the showcase too.

Included in our initial gallery collection are:

our Litmus dashboard which displays risk- and quality-scores for use when validating R packages (see our recent blog post);

a map application displaying the “Integrated Care Board” boundaries and population sizes for NHS England;

a timeline showing the history of the R language, built with D3.js;

a fun quiz to determine which cat you are;

an airport departure-board with interactively-selectable columns.

Please explore our dashboard gallery, and if you or your team have a project that would benefit from our expertise in dashboard development, deployment and the underlying infrastructure please contact us.

For updates and revisions to this article, see the original post

What's new in R 4.5.0?

Thu, 10 Apr 2025 23:59:00 +0000

R 4.5.0 (“How About a Twenty-Six”) was released on 11th April, 2025. Here we summarise some of the interesting changes that have been introduced. In previous blog posts we have discussed the new features introduced in R 4.4.0 and earlier versions (see the links at the end of this post).

The full changelog can be found at the r-release ‘NEWS’ page and if you want to keep up to date with developments in base R, have a look at the r-devel ‘NEWS’ page.

penguins

Who doesn’t love a new dataset?

One of the great things about learning R for data science is that there are a collection of datasets available to work with, built into the base installation of R. The Palmer Penguins dataset has been available via an external package since 2020, and has been added to R v4.5.0 as a base dataset.

This dataset is useful for clustering and classification tasks and was originally highlighted as an alternative to the iris dataset.

In addition to the penguins dataset, there is a related penguins_raw dataset. This may prove useful when teaching or learning data cleaning.

`use()`

If you have worked in languages other than R, its approach to importing code from packages may seem strange. In a Python module, you would either import a package and then use functions from within the explicit namespace for the package:

import numpy
numpy.array([1, 2, 3])
# array([1, 2, 3])

Or you would import a specific function by name, prior to its use

from numpy import array
array([1, 2, 3])
# array([1, 2, 3])

In an R script, we either use explicitly-namespaced functions (without loading the containing package):

penguins |>
 dplyr::filter(bill_len > 40)

Or we load a package, adding all its exported functions to our namespace, and then use the specific functions we need:

library("dplyr")
penguins |>
 filter(bill_len > 40)

The latter form can cause some confusion. If you load multiple packages, there may be naming conflicts between the exported functions. Indeed, there is a filter() function in the base package {stats} that is overridden when we load {dplyr} - so the behaviour of filter() differs before and after loading {dplyr}.

R 4.5.0 introduces a new way to load objects from a package: use(). This allows us to be more precise about which functions we load, and from where:

# R 4.5.0 (New session)
use("dplyr", c("filter", "select"))

# Attaching package: ‘dplyr’
# 
# The following object is masked from ‘package:stats’:
#
# filter
#

penguins |>
 filter(bill_len > 40) |>
 select(species:bill_dep)

# species island bill_len bill_dep
# 1 Adelie Torgersen 40.3 18.0
# 2 Adelie Torgersen 42.0 20.2
# 3 Adelie Torgersen 41.1 17.6
# 4 Adelie Torgersen 42.5 20.7
# 5 Adelie Torgersen 46.0 21.5
# 6 Adelie Biscoe 40.6 18.6

Note that only those objects that we use() get imported from the package:

# R 4.5.0 (Session continued)
n_distinct(penguins)
# Error in n_distinct(penguins) : could not find function "n_distinct"

A feature similar to use() has been available in the {box} and {import} packages for a while. {box} is a particularly interesting project, as it allows more fine-grained control over the import and export of objects from specific code files.

Parallel downloads

Historically, the install.packages() function worked sequentially - both the downloading and installing of packages was performed one at a time. This means it could be slow to install many packages.

We often recommend the {pak} package for installing packages because it can download and install packages in parallel.

But as of R 4.5.0, install.packages() (and the related download.packages() and update.packages()) are capable of downloading packages in parallel. This may speed up the whole download-and-install process. As described in a post on the R-project blog by Tomas Kalibera, the typical speed-up expected is around 2-5x (although this is highly variable).

C23

C23 is the current standard for the C language. Much of base R and many R packages require compilation from C. If a C23 compiler is available on your machine, R will now preferentially use that.

grepv()

For pattern matching in base R, grep() and related functions are the main tools. By default, grep() returns the index of any entry in a vector that matches some pattern.

penguins_raw$Comments |> grep(pattern = "Nest", x = _)
# [1] 7 8 29 30 39 40 69 70 121 122 131 132 139 140 163 164 193 194 199
# [20] 200 271 272 277 278 293 294 299 300 301 302 303 304 315 316 341 342

We have been able to extract the values of the input vector, rather than the indices, by specifying value = TRUE in the arguments to grep():

penguins_raw$Comments |>
 grep(pattern = "Nest", x = _, value = TRUE)
# [1] "Nest never observed with full clutch." 
# [2] "Nest never observed with full clutch." 
# [3] "Nest never observed with full clutch." 
# [4] "Nest never observed with full clutch." 
# [5] "Nest never observed with full clutch." 
# [6] "Nest never observed with full clutch. Not enough blood for isotopes."

Now, in R 4.5.0, a new function grepv() has been introduced which will automatically extract values rather than indices from pattern matching:

penguins_raw$Comments |>
 grepv(pattern = "Nest", x = _)
# [1] "Nest never observed with full clutch." 
# [2] "Nest never observed with full clutch." 
# [3] "Nest never observed with full clutch." 
# [4] "Nest never observed with full clutch." 
# [5] "Nest never observed with full clutch." 
# [6] "Nest never observed with full clutch. Not enough blood for isotopes."

Contributions from R-Dev-Days

Many of the changes that are described in the “R News” for the new release came about as contributions from “R Dev Day”s. These are regular events that aim to expand the number of people contributing code to the core of R. In 2024, Jumping Rivers staff attended these events in London and Newcastle (prior to “SatRDays” and “Shiny In Production”, respectively). Dev days are often attached to a conference and provide an interesting challenge to anyone interested in keeping R healthy and learning some new skills.

Trying out R 4.5.0

To take away the pain of installing the latest development version of R, you can use docker. To use the devel version of R, you can use the following commands:

docker pull rstudio/r-base:devel-jammy
docker run --rm -it rstudio/r-base:devel-jammy

Once R 4.5 is the released version of R and the r-docker repository has been updated, you should use the following command to test out R 4.5.

docker pull rstudio/r-base:4.5-jammy
docker run --rm -it rstudio/r-base:4.5-jammy

An alternative way to install multiple versions of R on the same machine is using rig.

Shiny in Production 2024 Videos

Tue, 08 Apr 2025 23:59:00 +0000

2024 Videos

Considering a ticket for Shiny in Production 2025 but unsure what to expect? Maybe you attended in past years but missed out in 2024, or you simply want a refresher on last year’s highlights. Whatever the case, the video player below has you covered!

Explore six in-depth talks, four lightning talks, and a bonus talk from Shiny in Production 2024… or binge-watch them all!

Details for 2025

Feeling inspired after watching the videos? Great news—you can join us for Shiny in Production 2025!

This year’s conference takes place on October 8th–9th, and as of now, early-bird tickets are still available. Stay updated on workshops, speakers, and other key details on our dedicated Shiny in Production 2025 site, or grab your tickets on Eventbrite.

As with the 2024 event, there will also be a satellite R Dev Day starting the afternoon before the conference. Here you can join others making contributions to base R or to infrastructure that supports such contributions. Read organiser Heather Turner’s blog post about the 2024 event to get an idea of what to expect.

For updates and revisions to this article, see the original post

Visualising R Package Risk Assessments using Litmus

Mon, 07 Apr 2025 23:59:00 +0000

A few years ago, we started working with a global pharma company who brought us a particularly thorny challenge. They wanted to use R for FDA submissions—but every package they introduced had to pass through a slow, resource-intensive process to be risk assessed and approved. They’re sadly unable to be gung-ho about what R tooling they use, needing instead to be thoughtful and meticulous, considering the statistical rigour, reproducibility, stability and security before including the tools in their production environment. In practice, this meant that it would take up to two years for them to be able to approve a new R package for use. Ouch.

After performing an audit of their process, we identified a few areas where we could create efficiencies. Our goal: automate everything that could be automated, reducing the manual burden on reviewers while improving consistency and traceability. Development began in earnest last year, and the result is the Litmusverse?, a suite of R packages that allows us to risk assess your R package collection, report on the findings and rescue high-risk packages that are business critical.

Everything then packaged into one easy to use application

Does your package pass the {litmus} test?

What is the Litmusverse? {litmus} grabs your R package metadata and generates valuable quality insights. {litmus.score} transforms these outputs into targeted quality scores—code, documentation, popularity, maintenance—plus an overall package rating. {litmus.report} delivers this intelligence in PDFs for permanent records. {litmus.dashboard} offers a comprehensive overview, empowering R library managers with better decision-making tools and streamlined record-keeping.

Our approach is agnostic regarding the package source - it doesn’t matter if your package is hosted on CRAN, BioConductor or an internal repository. We can risk assess and remediate it all the same. You can read more about our approach to risk assessment in a recent blog post.

Our aim is to help clients curate a risk-assessed collection of packages, to continue driving innovation using R. Keep an eye out for upcoming blogposts outlining the details of our approach. In the meantime…

Give our dashboard a spin!

We have prepared a Shiny app that allows you to interact with a collection of packages that we have assessed and scored, using {litmus} tools and our new scoring strategy. We’ll be publishing more details about our approach to scoring in the coming weeks. In the app, you will be able to assess the high-level qualities of a package collection, including the distribution of scores:

If you click on ‘Package List’ you’ll be able to see the collection’s metrics in a detailed, sortable table:

If you click on an individual row in this table, it will take you through to a detailed breakdown for the individual package, providing an overview of its score within the collection:

You can also drill down into a visual representation of each feature within the context of the collection of packages:

Ready to put your packages to the test?

The free version of our app allows you to view a subset of CRAN packages. If you are keen to unlock the full potential of Litmus, i.e. customise the package list that is displayed, include your own internally developed packages or non-CRAN packages, record decisions about including a package in your environment, retrieve PDF reports for long-term storage, and remediate business critical packages, we’re ready to help.

Get in touch with us to discuss how we can help you curate a robust R ecosystem using the Litmusverse. As official Posit partners, we are also at the ready to assist you with setting up your ideal R Development environment. For more information about our other Data Science and Data Engineering services, please visit the Jumping Rivers website.

To find out more about how we can facilitate your organisation’s adoption of open-source, please contact us. Contact Us

For updates and revisions to this article, see the original post

Should I Use Your R Package?

Mon, 31 Mar 2025 23:59:00 +0000

The answer to this simple, innocuous question is: it depends.

It depends on the package in question, of course. Perhaps less obviously, but just as importantly, it depends on who’s asking the question.

We’re sure if we asked you about “package quality”, we would all come up with what makes a good package:

Documentation
Unit tests
Author credibility
Does the package have a web page?
Security vulnerabilities
Bug closure rate
Are there multiple maintainers?
Does the package have any reverse dependencies?

We could (and have) come up with another twenty of these attributes. With 95% confidence, we’re sure that most people would agree that everything we’ve thought of is important. But with 100% confidence, we are certain we would disagree on how substantial these characteristics are. Surely, unit testing is more important than the popularity of the package? But how important is the documentation quality relative to the number of maintainers?

It all depends on why we are asking. It’s all about your risk appetite.

What is Risk Appetite?

Risk appetite is all about the risks you are and aren’t willing to take. It ranges from “Our packages need to be vaguely sensible, not compromise our system and have a place where I can log bugs” to “if our packages aren’t thoroughly tested and proven to be fit for purpose, I can’t use them in production”. The former is fairly easy to report on, whereas the latter is quite a bit more complicated.

The Risk Seekers!

Who amongst us wouldn’t want a top-quality R package? Who are the risk seekers? Most of us, at some point or another. If you are experimenting with building Shiny applications, as long as the package is “secure”, any old package is fine - you just want to experiment. Likewise, if you are an academic and you want to compare your method to one already published, as long the package is “correct”, that’s good enough.

During our training courses, we are often asked this question about quality. How bad can a package be to be usable? A thought experiment we like to do is “suppose you had an R package, with only one version. It’s never updated, no one has heard from the maintainer in ten years. But it provides code for an algorithm you want to use. What would you do?” The obvious answer for those who have a high risk appetite is “something is better than nothing” and “proceed with caution”.

Risk Averse

There are lots of examples of where we are (and should be) risk-averse when it comes to R packages. For example:

In the pharmaceutical industry, we need reassurance that the statistics used in reporting are correct. It’s vital that these packages are highly regulated!
Accuracy and stability are crucial for official Government reports on the state of the economy. A minor bug could have significant consequences.
Banks also work in a regulated environment, running complex models, so have to be careful about the accuracy of their data.

Another crucial aspect is that not only do they need to consider what packages they are using, but also demonstrate this thinking in an auditable manner. This is not dissimilar from the ISO 9001 process. In the context of the Pharmaceutical industry, the holy grail is using R packages in FDA submissions for new therapies.

The R Validation Hub is Paving the Way

The pharmaceutical industry is the first to address these requirements in a meaningful way. The R Validation Hub put out a white paper which addresses the use of R and its packages for statistical analysis in pharmaceutical regulatory submissions, proposing a risk-based approach for validating R packages within validated infrastructure. The paper suggests that base R packages present minimal risk, whilst contributed packages require risk assessment based on their purpose, maintenance practices, community usage, and testing protocols.

The proposed framework classifies packages as either “Intended for Use” (loaded directly by users) or “Imports” (supporting dependencies), focusing validation efforts primarily on the former. Risk assessment should evaluate whether packages are statistical or non-statistical in nature, examine development practices, consider community adoption metrics, and review testing coverage. Organisations can use this assessment to determine package inclusion in validated systems and identify additional testing requirements, with high-risk packages needing more rigorous validation.

The approach required for those not working in regulated industries will probably not be as serious as this, but this gives an idea of what the gold standard for R package validation should be, which we can draw inspiration from for less strict applications. They’ve also created some helpful tools, like {riskmetric} which allows us to pull metadata about packages, and create quality scores for these data.

How Do We Enable Risk Assessment for Everyone Across the Risk Spectrum?

This is the question we have been grappling with over the past few months. How do we gather all of the information required to make informed decisions about including packages in production environments, using a flexible framework that meets the needs of everyone on the risk appetite spectrum? Especially considering…

There are so many packages on CRAN!

This is both a blessing and a curse, as anyone who’s ever worked in a regulated environment can tell you. The obvious answer is to automate, automate, automate! This is exactly what we’ve done in the creation of the Litmus package validation framework.

Our process relies on automation wherever possible:

We have written code based on {riskmetric} that pulls package metadata from CRAN, git repositories and Posit Package Manager to provide a comprehensive overview of the package’s qualities
We have created a framework to analyse and score packages based on these data
We have created reporting and dashboarding workflows that allow us to generate package- and collection-level overviews of the scores for each package
We’ve implemented automatic acceptance/rejection of a package based on client-specified criteria
Our process also enables automated reporting of any additional manual steps taken to save a package from the bin, for example writing additional remedial tests or documentation

Keep an eye out for future blogs on this topic, as we dive a little deeper into the underlying principles driving our approach to package validation.

Does Your Package Pass the Litmus Test?

Ready to find out how we can help you validate your R package collection? Check out the Litmusverse and Get in touch.

For updates and revisions to this article, see the original post

Sparklines in Reactable Tables in Shiny Apps

Thu, 27 Mar 2025 23:59:00 +0000

This is the third blog in a series about the {sparkline} R package for inline data visualisations. You can read the first one about getting started with the package here and the second one about embedding them in HTML tables with the {reactable} package here.

In this blog I am taking it a step further and demonstrating how to use our sparkline reactable table in a Shiny app. Thankfully {reactable} has some helpful functions that make this super easy! I will also look at using a dynamic traffic light image in a reactable table at the end.

Reactable Sparkline Table

I’m going to start where we ended the last blog. The following code creates a {reactable} table using the iris data with a few {sparkline} visualisations in the columns.

library(sparkline)
library(reactable)
library(dplyr)

data = tibble(
 names = c("x", "y", "z"),
 values = c(list(rnorm(10)), list(rnorm(10)), list(rnorm(10)))
 ) |>
 mutate(box = NA,
 line = NA,
 bar = NA)

table = reactable(data,
 columns = list(
 values = colDef(show = FALSE),
 box = colDef(cell = function(value, index) {
 sparkline(data$values[[index]], type = "box")
 }),
 line = colDef(cell = function(value, index) {
 sparkline(data$values[[index]], type = "line")
 }),
 bar = colDef(cell = function(value, index) {
 sparkline(data$values[[index]], type = "bar")
 })
 )
 )

Using sparklines in a Shiny App

This is actually made very easy by two {reactable} functions which follow the traditional Shiny naming. In our server we’ll need to use renderReactable (which uses htmlwidgets::shinyRenderWidget under the hood), to create our table in the server. Then in the UI we’ll use reactableOutput (which uses htmlwidgets::shinyWidgetOutput) to call our table in the app UI.

To demonstrate this I am using a basic shiny app with a sparkline bullet chart in a reactable table then a screenshot of the result.

# Server
library(shiny)

server <- function(input, output) {

 output$sparkline_table <- renderReactable({

 data = iris |>
 group_by(.data$Species) |>
 mutate(mean = mean(.data$Sepal.Length),
 lower_range = range(.data$Sepal.Length)[1],
 upper_range = range(.data$Sepal.Length)[2],
 bullet = NA)

 iris_table = reactable(
 d,
 defaultColDef = colDef(show = FALSE),
 columns = list(
 Species = colDef(show = TRUE),
 Sepal.Length = colDef(show = TRUE),
 bullet = colDef(
 cell = function(value, index) {
 sparkline(c(d$mean[[index]],
 d$Sepal.Length[[index]],
 d$upper_range[[index]],
 d$lower_range[[index]]),
 type = "bullet")
 },
 show = TRUE
 )
 )
 )
 })
}

# UI
ui <- fluidPage(

 titlePanel("Hello Sparkline!"),

 sidebarLayout(
 sidebarPanel = sidebarPanel(
 sliderInput(inputId = "rows",
 label = "Number of rows:",
 min = 1,
 max = 50,
 value = 30)
 ),

 mainPanel = mainPanel(

 reactableOutput(outputId = "sparkline_table")

 ))
)

Dynamic Image in a Reactable Table

Another thing that you can do with {reactable} is dynamic image columns, to show this I’ve created a traffic light visualisation with 3 levels:

Level 1 (green):

Level 2 (Amber):

Level 3 (Red):

For this example I’m only going to include the code required to create the {reactable} table but following the steps above will work for a shiny app as well, ensuring that the images are available to the app at the path you pass to the table.

The key here is to use a reactable column definition which is a function. This function will take the value and create a html image tag with the path to the correct svg file (png and jpeg will work the same).

library(tibble)
library(htmltools)
library(reactable)

data <- tibble(
 Value = 1:3,
 `Traffic Light` = 1:3
)

path = "/blog/sparkline-reactable-shiny/images/"

table =
 reactable(data,
 defaultColDef = colDef(align = "center"),
 columns = list(`Traffic Light` = colDef(
 cell = function(value) {
 src = paste0(path, value, ".svg")
 image = img(src = src, style = "height: 40px;")
 tagList(
 div(
 style = "display: inline-block; width: 60px",
 image)
 )
 })
 )
 )

In this blog we have looked at embedding sparkline reactable tables into a shiny app and using another type of dynamic image inside a reactable table. This brings me to the end of the series on {sparkline}, with a notable cameo from {reactable} and a bit of {shiny} too. Stay tuned for similar data science blogs.

For updates and revisions to this article, see the original post

Sparklines in Reactable Tables

Thu, 13 Mar 2025 23:59:00 +0000

This is the second blog in a series about the {sparkline} R package for inline data visualisations. You can read the first one here. In this post I will be demonstrating how you can include sparklines inside HTML tables.

Reactable

{reactable} is an R package for producing HTML tables, commonly used in Shiny.

To create a HTML reactable table all we need to do is input a data.frame object to the reactable function. These tables have a nice simple default look however we can also add our own styles very easily. In our first example of a table I am just using the in built R iris dataset.

library(reactable)

reactable(iris)

A few things that can be easily added to reactable tables are filters, sortable columns, searchable columns, default page size, borders and striped & text wrapping. Along with these arguments we can of course implement our own styling with CSS.

reactable(
 iris,
 striped = TRUE, searchable = TRUE,
 filterable = TRUE, bordered = TRUE,
 defaultPageSize = 8
)

Sparklines in Reactable Tables

Box, Line and Bar Charts

When it comes to embedding sparklines in reactable tables we need to add a new column to our table, which we will then overwrite in the columns argument of reactable.

In the first example I am using a mock dataset with 3 observations ‘x’, ‘y’ and ‘z’, each one is just a list containing 10 values generated by rnorm. Then I am using dplyr’s mutate function to add a column full of NA values.

Now on the reactable side, I am again using the reactable function, where I use the columns argument which takes a “Named list of column definitions”. For each different sparkline I will need to use colDef to add a function which takes a value and index argument. I then use the sparkline function and pass data$values[[index]] along with the type to determine which chart I’d like. You can set a column preferences in colDef, I have used it here to hide the values column.

library(sparkline)
library(dplyr)

data = tibble(
 names = c("x", "y", "z"),
 values = c(list(rnorm(10)), list(rnorm(10)), list(rnorm(10)))
 ) |>
 mutate(box = NA,
 line = NA,
 bar = NA)

table = reactable(data,
 columns = list(
 values = colDef(show = FALSE),
 box = colDef(cell = function(value, index) {
 sparkline(data$values[[index]], type = "box")
 }),
 line = colDef(cell = function(value, index) {
 sparkline(data$values[[index]], type = "line")
 }),
 bar = colDef(cell = function(value, index) {
 sparkline(data$values[[index]], type = "bar")
 })
 )
 )

Bullet Chart

In our final example, I am again using the iris data but this time I’m creating a summary for each species containing the mean and inter-quartile range (IQR) of the Sepal.Length column. These values will be used to create a bullet graph. In a bullet graph, an observed value (the ‘performance’) is compared against a target value, and an illustration of the data-spread (here the IQR) are presented. In a given row of the figure, the value of Sepal.Width for a specific iris will be presented as the performance; the target that this is compared against is the mean for the relevant species, lower IQR will be the range1 and higher IQR will be range2.

Then when creating our reactable table it is slightly different to our previous example (where I just pass a list of values to the sparkline function), for a bullet graph I will need to pass in a vector in the form c(target, performance, range1, range2). I can then access the values via d$ (or another form of extraction) and specify which row I need with [[index]].

d = iris |>
 group_by(.data$Species) |>
 mutate(mean = mean(.data$Sepal.Length),
 lower_range = range(.data$Sepal.Length)[1],
 upper_range = range(.data$Sepal.Length)[2],
 bullet = NA)

iris_table = reactable(d, defaultColDef = colDef(show = FALSE),
 columns = list(
 Species = colDef(show = TRUE),
 Sepal.Length = colDef(show = TRUE),
 bullet = colDef(cell = function(value, index) {
 sparkline(c(d$mean[[index]],
 d$Sepal.Length[[index]],
 d$upper_range[[index]],
 d$lower_range[[index]]), type = "bullet")
 }, show = TRUE)
 ))

In this blog we have implemented box-plots, bar, line and bullet graphs into reactable tables. Other options can be found on the jQuery Sparklines website or in the previous blog. Stay tuned for the next blog in this series on using sparkline reactable tables in Shiny apps.

For updates and revisions to this article, see the original post

Shiny in Production 2025: Abstracts Deadline Extension

Tue, 11 Mar 2025 23:59:00 +0000

Call for Abstracts Deadline Extended

Good news! We’re extending the deadline for abstract submissions for Shiny in Production 2025 by two weeks. You now have until 11:59 PM BST on 3rd April 2025 to submit your proposal.

This extension gives you extra time to refine your ideas and submit a strong proposal for the conference, which will take place on 8th-9th October 2025 in Newcastle upon Tyne, UK.

Why Submit?

Shiny in Production is the premier event for developers, data scientists, and industry professionals using {shiny} in production environments. If you have insights, case studies, or innovative applications of Shiny, this is your chance to share your expertise with the community.

Topics of Interest

We invite abstracts on a wide range of topics, including but not limited to:

AI & Machine Learning in Shiny: Integrating predictive models, LLMs, and generative AI into Shiny applications.
Shiny for Large Enterprises: How big companies successfully deploy and maintain Shiny apps.
Data Storytelling with Shiny: Making complex data insights accessible through compelling visual narratives.
Shiny vs. Other Web Frameworks: A comparison of when and why to choose Shiny over alternatives.
Beyond Dashboards / Creative Uses of Shiny: Exploring non-traditional applications like simulations, process automation, and interactive reports.
Python: Developing Python Shiny apps
Automated Testing and Continuous Deployment: Best practices for maintaining high-quality applications through automated workflows.

To get an idea of past topics, check out our YouTube channel, where we have playlists of talks from Shiny in Production 2022, 2023 and 2024.

Submission Guidelines

To submit your abstract, please follow these guidelines:

Abstract length: Up to 250 words.
Deadline: Submissions must be received by 11:59PM BST on ~~20th March 2025~~ 3rd April 2025.
Submission portal: Submit your abstract here.

Important Dates

Abstract submission deadline: ~~20th March 2025~~ 3rd April 2025
Notification of acceptance: mid-April 2025
Conference dates: 8th-9th October 2025

For more information, visit our conference website.

For updates and revisions to this article, see the original post

Vetiver: MLOps for Python

Thu, 27 Feb 2025 23:59:00 +0000

This post is the fourth in our series on MLOps with vetiver:

Part 1: Vetiver: First steps in MLOps
Part 2: Vetiver: Model Deployment
Part 3: Vetiver: Monitoring Models in Production
Part 4: Vetiver: MLOps for Python (this post)

Parts 1 to 3 introduced the {vetiver} package for R and outlined its far-reaching applications in MLOps. But did you know that this package is also available in Python? In this post we will provide a brief outline to getting your Python models into production using vetiver for Python.

Installation

Like any other Python package on PyPI, vetiver can be installed using pip. Let’s set up a virtual environment and install all of the packages that will be covered in this blog:

python -m venv venv/
source venv/bin/activate
pip install vetiver pandas pyjanitor scikit-learn pins

Check out our previous blog about virtual environments in Python for more details.

Data

We will be working with the World Health Organisation Life Expectancy data which provides the annual average life expectancy in a number of countries. This can be downloaded from Kaggle:

import pandas as pd

url = "https://www.kaggle.com/api/v1/datasets/download/kumarajarshi/life-expectancy-who"
data = pd.read_csv(url, compression = "zip")
data.head()
#> Country Year ... Income composition of resources Schooling
#> 0 Afghanistan 2015 ... 0.479 10.1
#> 1 Afghanistan 2014 ... 0.476 10.0
#> 2 Afghanistan 2013 ... 0.470 9.9
#> 3 Afghanistan 2012 ... 0.463 9.8
#> 4 Afghanistan 2011 ... 0.454 9.5
#> 
#> [5 rows x 22 columns]

Let’s drop missing data, clean up the column names and select a subset of the variables to work with:

import janitor

data = data.dropna()
data = data.clean_names(strip_underscores=True)
data = data[[
 "life_expectancy",
 "percentage_expenditure",
 "total_expenditure",
 "population",
 "bmi",
 "schooling",
]]
data.head()
#> life_expectancy percentage_expenditure ... bmi schooling
#> 0 65.0 71.279624 ... 19.1 10.1
#> 1 59.9 73.523582 ... 18.6 10.0
#> 2 59.9 73.219243 ... 18.1 9.9
#> 3 59.5 78.184215 ... 17.6 9.8
#> 4 59.2 7.097109 ... 17.2 9.5
#> 
#> [5 rows x 6 columns]

Vetiver is compatible with models built in scikit-learn, PyTorch, XGBoost and statsmodels. The actual modelling process is not so important in this blog. We will be more interested in how we go about taking this model into production using vetiver. So let’s go with a simple K-Nearest Neighbour model built using scikit-learn:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

target = "life_expectancy"
covariates = [
 "percentage_expenditure",
 "total_expenditure",
 "population",
 "bmi",
 "schooling",
]
y = data[target]
X = data[covariates]

model = Pipeline(
 [
 ("transform", StandardScaler()),
 ("model", KNeighborsRegressor()),
 ]
)
model.fit(X, y)
#> Pipeline(steps=[('transform', StandardScaler()),
#> ('model', KNeighborsRegressor())])

Let’s break down what’s happened here:

We selected our target variable (life expectancy) and the covariates (features) that will be used to predict the target.
We constructed a modelling pipeline which includes:
- Preprocessing of input data via standardisation.
- K-Nearest Neighbours regression.
In the final step, we fitted our model to the training data.

Usually at this point we would evaluate how our model performs on some unseen test data. However, for brevity we’ll now go straight to the MLOps steps.

MLOps

In a typical MLOps workflow, we are setting up a continuous cycle in which our trained model is deployed to a cloud environment, monitored in this environment, and then retrained on the latest data. The cycle repeats so that we are always maintaining a high model performance and avoiding the dreaded model drift (more on this later).

From the diagram above, the crucial steps that set this workflow apart from a typical data science project are model versioning, deployment and monitoring. We will go through each of these in turn using vetiver.

Before we can begin, we must convert our scikit-learn model into a “vetiver model”:

import vetiver

v_model = vetiver.VetiverModel(model, model_name="KNN", prototype_data=X)
print(type(v_model))
#> <class 'vetiver.vetiver_model.VetiverModel'>
print(v_model.description)
#> A scikit-learn Pipeline model
print(v_model.metadata)
#> VetiverMeta(user={}, version=None, url=None, required_pkgs=['scikit-learn'], python_version=(3, 10, 12, 'final', 0))

Our VetiverModel object contains model metadata and dependencies (including the Python packages used to train it and the current Python version). The model_name will be used to identify the model later on, and the prototype_data will provide some example data for the model API (more on this below).

Model versioning

In a cycle where our model is continuously being retrained, it is important to ensure that we can retrieve any models that have previously been deployed. Vetiver utilises the pins package for model storage. A pin is simply a Python object (could be a variable, data frame, function, …) which can be stored and retrieved at a later time. Pins are stored in “pins boards”. Examples include:

Local storage on your device
Google Drive
Amazon S3
Posit Connect

Let’s set up a temporary pins board locally for storing our model:

from pins import board_temp

model_board = board_temp(
 versioned=True, allow_pickle_read=True
)
vetiver.vetiver_pin_write(model_board, v_model)
#> Model Cards provide a framework for transparent, responsible reporting. 
#> Use the vetiver `.qmd` Quarto template as a place to start, 
#> with vetiver.model_card()
#> Writing pin:
#> Name: 'KNN'
#> Version: 20250220T141808Z-af3d5

Enabling allow_pickle_read will allow quick reloading of the model later on, whenever we need it.

At this stage our VetiverModel object is now stored as a pin, and we can view the full list of “KNN” model versions using:

model_board.pin_versions("KNN")
#> created hash version
#> 0 2025-02-20 14:18:08 af3d5 20250220T141808Z-af3d5

As expected, we only have one version stored so far!

Model deployment

If we want to share our model with other users (colleagues, stakeholders, customers) we should deploy it to an endpoint on the cloud where it can be easily shared. To keep things simple for this blog, and to ensure the code examples provided here are fully reproducible, we will just deploy our model to the localhost.

First we have to construct a model API. This is a simple interface which takes some input and gives us back some model predictions. Crucially, APIs can be hosted on the cloud where they can receive input data via HTTP requests.

Our VetiverModel object already contains all of the info necessary to build an API using the FastAPI framework:

app = vetiver.VetiverAPI(v_model, check_prototype=True)

Running app.run(port=8080) will start a local server for the model API on port 8080. We are then presented with a simple graphical interface in which we can run basic queries and generate predictions using our model. The prototype_data argument which we defined when constructing our VetiverModel (see above) is used here to provide some example input data for queries:

Alternatively we can also submit queries from the command line. The graphical interface above provides template curl commands which can be copied into the command line and executed against the model. For example, the input data shown in the above screenshot can be fed into the model via a POST request:

curl -X POST "http://127.0.0.1:8080/predict" \
 -H "Accept: application/json" \
 -H "Content-Type: application/json" \
 -d '[{"percentage_expenditure":71.27962362,"total_expenditure":8.16,"population":33736494,"bmi":19.1,"schooling":10.1}]' \

The same command would work for querying APIs on the cloud as long as the IP address for the API endpoint (here it is http://127.0.0.1, which points to the localhost) is updated accordingly.

Deploying your model locally is a great way to test that your API behaves as you expect. What’s more, it’s free and does not require setting up an account with a cloud provider! But how would we go about deploying our model to the cloud?

If you already have a server on Posit Connect, it’s just a case of running vetiver.deploy_rsconnect() (see the Posit vetiver documentation for more details). If you don’t have Posit Connect, not to worry! Instead you can start by running:

vetiver.prepare_docker(model_board, "KNN")

This command is doing a lot of heavy lifting behind the scenes:

Lists the Python package dependencies in a vetiver_requirements.txt file.
Stores the Python code for the model API in an app.py file.
Creates a Dockerfile containing the Python version requirement for the model and the docker commands for building and running the API. An example is shown below:

# # Generated by the vetiver package; edit with care
# start with python base image
FROM python:3.10

# create directory in container for vetiver files
WORKDIR /vetiver

# copy and install requirements
COPY vetiver_requirements.txt /vetiver/requirements.txt

#
RUN pip install --no-cache-dir --upgrade -r /vetiver/requirements.txt

# copy app file
COPY app.py /vetiver/app/app.py

# expose port
EXPOSE 8080

# run vetiver API
CMD ["uvicorn", "app.app:api", "--host", "0.0.0.0", "--port", "8080"]

With these files uploaded to the cloud server of your choosing, the docker build command will take care of the rest. This process can be automated on AWS, Google Cloud Run, Azure, and many other cloud platforms.

Model monitoring

Success! Your model is now deployed and your users are interacting with it. But this is only the beginning…

Data changes! Over time you will notice various aspects of your data changing in unexpected ways:

The way the data is distributed may change (data drift).
The relationship between the target variable and covariates may change (concept drift).

These two processes will conspire to create model drift, where your model predictions start to drift away from the true values. This is why MLOps is not simply a one-off deployment. It is a continuous cycle in which you will be retraining your model on the latest data on a regular basis.

While we will not be providing a full worked example of model drift here, we will just mention some helpful functions provided by vetiver to deal with this problem:

vetiver.compute_metrics(): computes keys metrics at specified time intervals, allowing us to understand how the model performance varies over time.
vetiver.pin_metrics(): stores the model metrics in a pins board for future retrieval.
vetiver.plot_metrics(): plots the metrics over time.

You can get an idea of how these Python methods can be used by reading our previous blog post where we monitored the model’s performance using vetiver for R.

The metrics can be entirely defined by the user, and might include the accuracy score for a classification model and the mean squared error for a regression model. We can also make use of predefined scoring functions from the sklearn.metrics library.

For more on model monitoring, check out the Posit vetiver documentation.

Summary

Hopefully by reading this post you will have a better understanding of MLOps and how to get started with MLOps in Python. Most importantly, you don’t have to be an expert in AWS or Azure to get started! Vetiver provides intuitive, easy-to-use functions for learning the crucial steps of MLOps including versioning your model, building a model API, and deploying your model using docker or Posit Connect.

For some further reading, check out:

Our previous blog posts on vetiver with R.
The Posit vetiver documentation.

For updates and revisions to this article, see the original post

Shiny in Production 2025: Call for Abstracts

Mon, 17 Feb 2025 23:59:00 +0000

Call for abstracts now open

We are excited to announce the Call for Abstracts for Shiny in Production 2025, to be held on 8th-9th October 2025 in Newcastle upon Tyne, UK. This event brings together industry experts, data scientists, and developers to explore the latest advancements and best practices in deploying Shiny applications in production settings.

About the Conference

As Shiny continues to revolutionise data visualisation and interactive web applications, the need for robust, scalable, and efficient production environments is more critical than ever. This conference aims to address these needs by providing a platform for knowledge sharing, collaboration, and innovation.

Whether you’re a seasoned {shiny} user who wants to network and share knowledge, someone who’s just getting started and wants to learn from the experts, or anybody in between, if you’re interested in {shiny}, this conference is for you.

Topics of Interest

We invite abstracts on a wide range of topics, including but not limited to:

AI & Machine Learning in Shiny: Integrating predictive models, LLMs, and generative AI into Shiny applications.
Shiny for Large Enterprises: How big companies successfully deploy and maintain Shiny apps.
Data Storytelling with Shiny: Making complex data insights accessible through compelling visual narratives.
Shiny vs. Other Web Frameworks: A comparison of when and why to choose Shiny over alternatives.
Beyond Dashboards / Creative Uses of Shiny: Exploring non-traditional applications like simulations, process automation, and interactive reports.
Python: Developing Python Shiny apps
Automated Testing and Continuous Deployment: Best practices for maintaining high-quality applications through automated workflows.

To get an idea of past topics, check out our YouTube channel, where we have playlists of talks from Shiny in Production 2022, 2023 and 2024.

Submission Guidelines

To submit your abstract, please follow these guidelines:

Abstract length: Up to 250 words.
Deadline: Submissions must be received by 11:59PM on 20th March 2025.
Submission portal: Submit your abstract here.

Important Dates

Abstract submission deadline: 20th March 2025
Notification of acceptance: mid-April 2025
Conference dates: 8th-9th October 2025

For more information, visit our conference website.

For updates and revisions to this article, see the original post

Sparkline Package for Inline Visualisations

Thu, 13 Feb 2025 23:59:00 +0000

Introduction

Sparkline

This was the introductory part of the blog series, the second part will be on embedding sparklines in html tables. The CRAN {sparkline} package allows you to make small inline html charts using jQuery in R.

Charts Available With {sparkline}

You can make the following charts with the package:

Line
Bar
Tristate
Discrete
Bullet
Pie
Box Plot

The omnipotent.net website has a great feature for viewing the different types of chart. I will show a few examples here, along with the code for producing them.

Note the documentation for all of the different charts is great and can be found here.

All of the plots from this package use the sparkline function, and we pass the type of chart we want as the type argument (default is line). The function will take a vector or list for the values argument, depending on the type of chart we are creating this can be either data to plot or specifications for the plot.

Note: I am using spk_add_deps so I can display them in this blog.

Line

library(sparkline)

data = list(1, 2, 3, 5, 2, 3, 4, 6, 9, 2, 4, 6)

line1 = sparkline(values = data, type = "line") |>
 spk_add_deps()

We can remove the fill and the spots using the following arguments.

line2 = sparkline(values = data, type = "line", spotRadius = "", fillColor = "")

Bar

bar1 = sparkline(values = data, type = "bar")

We can change the bar colors, spacing and width.

bar2 = sparkline(values = data, type = "bar", barColor = "red", barSpacing = "10", barWidth = 8)

Box Plots

For box plots, the data passed to the values argument will be used to calculate the chart. If you want to pass in pre-computed values like median, max or min for example, you can do this using the raw = TRUE and pass them in as the values argument.

box = sparkline(values = data, type = "box")

Bullet Graph

This is one of the examples where the values passed correspond to specifications for the plot and should be ordered as: target, performance, range1, range2, range3.

bullet1 = sparkline(values = c(7, 5, 10, 6, 3), type = "bullet")

bullet2 = sparkline(values = c(7, 5, 10, 6, 3), type = "bullet", rangeColors = c("lightgrey", "grey", "slategrey"))

See you in the next blog post about embedding sparklines in html tables.

For updates and revisions to this article, see the original post

Porting a Shiny App to Observable Framework: Part 2

Thu, 30 Jan 2025 23:59:00 +0000

Preamble

This post, Part 2 in a series of two, looks at styling and deploying the Observable Framework app we built in Part 1. Codeblocks with burgundy backgrounds refer to specifc tagged commits in the accompanying GitHub repositiory.

Styling the App with CSS

We can add a stylesheet by referencing it through the “style” property in the configuration file: observable.config.js. That config file can be used to define various attributes for our project, including what title and favicon should be displayed in the browser tab, where the root of the source code is (root: "src") and where, relative to that root, the stylesheet is stored (style: "style/style.css").

You can go crazy here with your CSS or keep it simple. Since this is just meant as a quick demonstration we’ll do the latter: we’ll tweak the appearance of controls, add Jumping Rivers fonts and colours and rearrange the layout for wider screens:

src/style/style.css

@import url("https://fonts.googleapis.com/css2?family=Outfit:wght@100..900&display=swap");


body {
 font-family: "Outfit", sans-serif;
 position: relative;
 color: #0c293d;
 background-color: #fcfbfa;
}

main {
 display: grid;
 justify-content: center;
 align-items: center;
 column-gap: 3em;
 grid-template-columns: 350px 500px;
 grid-template-areas:
 "title title"
 "controls chart"
 "controls count";
}

main > h1 {
 font-weight: 600;
 grid-area: title;
 text-align: center;
}

main > div {
 display: none;
}

main > div:has(form) {
 display: unset;
 grid-area: controls;
 padding-top: 1em;
}

main > div:has(figure) {
 display: flex;
 justify-content: center;
 grid-area: chart;
}

main > p {
 grid-area: count;
 text-align: center;
}

input[type="number"] {
 text-align: right;
}

main form[class^="inputs"]:has(input[type="number"]) {
 display: inline-flex;
 flex-direction: column;
 width: calc(50% - 1em);
 margin-right: 1em;
 margin-bottom: 1em;
}

main form[class^="inputs"]:has(input[type="number"]) label {
 width: 100%;
}

main form[class^="inputs"]:has(select, input[type="range"], input[type="text"], input[type="radio"]) {
 display: flex;
 width: 100%;
 flex-direction: column;
 margin-bottom: 0.5em;
}

main form[class^="inputs"]:has(select, input[type="range"], input[type="text"], input[type="radio"]) > * {
 width: 100%;
}

[aria-label="tip"] {
 fill-opacity: 0.8;
}

[aria-label="tip"] text tspan:first-child {
 font-weight: bold;
}

@media (max-width: 950px) {
 main {
 padding: 0 1em;
 grid-template-columns: unset;
 grid-template-areas:
 "title"
 "controls"
 "chart"
 "count";
 }
}

To keep things succinct, our stylesheet makes use of the relatively new (Firefox was the last major browser to support this in late 2023) CSS :has pseudoclass. If you need to support older browsers you’d have to find another way of doing things. Using :has allows us, for example, to target elements with specific descendants without relying too much on the generated classes remaining unchanged and without manually adding explicit ids or classes to those target elements.


git switch --detach styles

Tidying Up

All that’s left now to “complete” our app is to tidy up a few loose ends, removing some comments and files that are no longer helpful. This amounts to:

Updating the README
Updating and pruning the observablehq.config.js configuration file
Deleting a JavaScript file we don’t use
Removing an irrelevant image file


git switch --detach tidy

Deployment

You can build a static version of the app using:

npm run build

This is only static in the sense that the output files can be served by essentially any old server; there’s no need to have a server that can process the R scripts or (Python or rust etc) or build HTML from markdown. You won’t get the hot reloading that you get with npm run dev as you make changes but the output - that by default gets dumped in a dist/ directory - can be deployed almost anywhere. That includes on Observable cloud, which is super-easy to do. Run

npm run deploy

You’ll be asked to sign in if you haven’t already: you can use your GitHub credentials for this, if you like. After that you’ll get a few simple questions to answer about naming, visibility and the like and then - within a minute or so - it’s done, with a link to the deployed app printed to the terminal. View our app. The Observable website has further instructions if you want to go down the route of automated deploys and/or GitHub actions.


git switch --detach deploy

Final Thoughts

This was a fun thing to try and didn’t take especially long to implement. The way you can add scripts for data generation and things “just work” is really neat. Having the whole of d3 and Observable Plot available without having to do explicit installs and imports is also helpful. Because of these things, setup of a new project can be really quick. Deployment to Observable cloud is also super speedy and other deployment targets shouldn’t be difficult, either.

On the negative side I’m not convinced by the use of markdown files for generating dashboards. For anything complex, HTML (or a framework that uses HTML-based template syntax like Vue or Svelte) just seems more logical to me. I also haven’t yet been converted over to the notebook style of development with fenced JavaScript blocks.

In short, the speed at which a new project can be set up can make Observable Framework a good solution for prototyping dashboards and interactive websites. Simple deployment options makes it easy to share such prototypes with other stakeholders. For production applications I’m not sure what Observable Framework offers that can’t be built in a more maintainable way with popular, “traditional”, JavaScript frameworks. These can still use Observable Plot, which I do think works nicely and will definitely be using again: you just have to explicitly add it to the project and import it where needed.

For updates and revisions to this article, see the original post

Shiny in Production 2025

Thu, 23 Jan 2025 23:59:00 +0000

The fourth instalment of Shiny in Production is back this October, hosted at the Catalyst in Newcastle upon Tyne, with the super early bird deadline on the 31st of January!

Set in the heart of Newcastle, this conference dives into the world of {shiny} and other web-focused R packages. Whether you’re a seasoned {shiny} user looking to connect and share insights, a beginner eager to learn from experts, or anyone in between, this event is tailored for anyone passionate about {shiny}.

The two-day program includes an afternoon of hands-on workshops, followed by a full day of engaging conference talks. You can choose a ticket for the conference only or bundle it with one of the workshops for a deeper learning experience.

In addition, attendees can also join the “R Dev Day,” a satellite event running alongside Shiny in Production.

For more information, check out the conference website and to buy tickets go to the eventbrite page.

For updates and revisions to this article, see the original post

Porting a Shiny App to Observable Framework: Part 1

Thu, 16 Jan 2025 23:59:00 +0000

Preamble

This post, Part 1 in a series of two, looks at porting the functional code of a Shiny app - written in R - into JavaScript code to be used in an Observable Framework application. Part 2 will look at styling and deploying the ported application.

Background and Motivation

If you’re interested in interactive data visualisation you’ve probably heard of the d3 JavaScript library, even if you’ve never used it or even know any JavaScript. Mike Bostock, the creator of d3, and colleagues followed this up with d3.express, which was quickly renamed to Observable. In Mike’s words:

It’s for exploratory data analysis, for understanding systems and algorithms, for teaching and sharing techniques in code, and for sharing interactive visual explanations. To make visualization easier—to make discovery easier—we first need to make coding easier.

If you’re not familiar with Observable, think of Jupyter notebooks or Mathematica but with JavaScript (sort of).

And following on from Observable came Observable Plot:

Observable Plot is a free, open-source, JavaScript library for visualizing tabular data, focused on accelerating exploratory data analysis. It has a concise, memorable, yet expressive interface, featuring scales and layered marks in the grammar of graphics style popularized by Leland Wilkinson and Hadley Wickham and inspired by the earlier ideas of Jacques Bertin.

If you like ggplot2 and like the look of d3 but are put off by the idea of having to dive deep into hardcore JavaScript and low-level SVG primitives, Observable Plot could be just the thing for you. Even if you don’t know JavaScript, if you can read JSON and have some experience with reactive programming in a notebook, then I suspect you could probably pick up Observable Plot in the Observable environment fairly quickly.

More recently, the Observable team released Observable Framework (often shortened to just “Framework” with a capital “F”), in their own words:

Observable Framework is an open-source static site generator for data apps, dashboards, reports, and more. Framework includes a preview server for local development, and a command-line interface for automating builds & deploys.

You write simple Markdown pages — with interactive charts and inputs in reactive JavaScript, and with data snapshots generated by loaders in any programming language (SQL, Python, R, and more) — and Framework compiles it into a static site with instant page loads for a great user experience. Since everything is just files, you can use your preferred editor and source control, write unit tests, share code with other apps, integrate with CI/CD, and host projects anywhere.

Having a background in both data science and web development, I’ve spent many hours with the {ggplot2} and {shiny} packages and many more wrangling and visualising data using d3. I’ve also dabbled with the Observable environment but, until now, never used Observable Plot. With the addition of Observable Framework, this seemed like an opportune time to take a look at both and see how they compare to Shiny.

The Shiny App

To pick a suitable app to experiment with I scoured the Shiny gallery page. I wanted a “Goldilocks” example: not really simple but not highly complex, either. And, obviously, something with a chart. The Movie explorer seemed to fit the bill perfectly: a single chart but with lots of permitted modifications. Perfect for some reactive programming. A zoomed out screenshot of the app (below) shows that it is, perhaps, too tall. This means that users would have to scroll to see those controls lying at the bottom, putting the top of the chart out of view.

GitHub

You can follow along with this blog post yourself by adding and removing code, step-by-step. You can also clone our repository from GitHub:

git clone https://github.com/jumpingrivers/observable-framework-movie-explorer.git

The “main” branch here is in the “final” state of the app but there are also tags marking the commits for the end of each step we take, that can be easily switched to, as noted at the end of each section with the short code blocks that look like this.

Creating the Default Framework App

The website for Observable Framework has an excellent Getting started guide. Here we’ll just steal from step 1 of that. You’ll need a fairly recent version (version 18 or above at the time of writing) of Node.js installed.

At the command line, in the parent directory for your future project, run

npx @observablehq/framework@latest create

and simply accept all the default values by pressing Enter.

To get a live-updating preview of the site run

npm run dev

This will launch your default browser and you’ll now have something that looks like the image below.


git switch --detach start

Generating the Data File

From following the above we end up with a bunch of stuff we don’t actually need for our app. But some of it is useful for pointing us in the right direction, we’ll clear the rest out later.

In the “src/data” directory there’s a file with the slightly odd-looking “.csv.js” extension. The .js extension tells Observable Framework that the content of the file is JavaScript. Observable then knows to execute the file using the node CLI. The .csv extension is used for the generated file name, i.e. Observable Framework sees launches.csv.js and passes it to node, the output from the script is then saved to a file called launches.csv.

But it’s not just the .js extension Framework knows what to do with. It also understands that .py is Python, .rs is rust and .go is Go. And, most importantly for us, it knows to use Rscript when a file has the extension .R. To work, these scripts (called data loaders in the Framework documentation) need to write to standard output. In an R script we can do that explicitly with the print function.

All we need now is our own data. Helpfully the data and code for the Shiny app is MIT-licensed and on GitHub.

The top of the server.R file looks like this

Top of the Shiny app's server.R file

library(ggvis)
library(dplyr)
if (FALSE) {
 library(RSQLite)
 library(dbplyr)
}

# Set up handles to database tables on app start
db <- src_sqlite("movies.db")
omdb <- tbl(db, "omdb")
tomatoes <- tbl(db, "tomatoes")

# Join tables, filtering out those with <10 reviews, and select specified columns
all_movies <- inner_join(omdb, tomatoes, by = "ID") %>%
 filter(Reviews >= 10) %>%
 select(ID, imdbID, Title, Year, Rating_m = Rating.x, Runtime, Genre, Released,
 Director, Writer, imdbRating, imdbVotes, Language, Country, Oscars,
 Rating = Rating.y, Meter, Reviews, Fresh, Rotten, userMeter, userRating, userReviews,
 BoxOffice, Production, Cast)

We can use this as a starting point but:

We don’t need the {ggvis} library;
We have to reference our own copy of the movies.db SQLite database;
The src_sqlite function is deprecated;
It turns out there’s more data in the all_movies object than we actually need.

The following code, that I put in a file called movies.json.R and placed in the “src/data” directory alongside the movies.db database, deals with all these issues:

src/data/movies.json.R

library(dplyr)
library(RSQLite)
library(dbplyr)

# Hack to find the database path
script_directory = gsub("--file=", "", commandArgs()[4])
db_path = file.path(dirname(script_directory), "movies.db")

# Updated code to no longer use deprecated function
conn = dbConnect(RSQLite::SQLite(), db_path)
omdb = tbl(conn, "omdb")
tomatoes = tbl(conn, "tomatoes")

# Removed films without a BoxOffice value
# Select only the variables we actually use
all_movies = inner_join(omdb, tomatoes, by = "ID") %>%
 filter(Reviews >= 10 & !is.na(BoxOffice)) %>%
 select(Title, Runtime, Genre, Released, Director, Oscars,
 Rating = Rating.y, Meter, Reviews, BoxOffice, Cast)

# Convert data to a JSON string
json = all_movies %>%
 collect() %>%
 jsonlite::toJSON()

# Tidy up database connection
dbDisconnect(conn)

# Print data
print(json)

If you followed the server.R code from the original Shiny app then hopefully most of these changes make sense. The exception is probably the “Hack to find the database path”:

script_directory = gsub("--file=", "", commandArgs()[4])
db_path = file.path(dirname(script_directory), "movies.db")

I know I’ve put the database in the same directory as my R script but the script needs to know the path relative to where it’s executed from. This isn’t actually obvious at this point. But we can find the path from the execution location to the script using the commandArgs function. The rest then is just some ugly code to take the output of the commandArgs function, find the script relative to the execution location and then replace the script file name with the database file name that we know lives in the same directory.

Since writing this code, it’s been pointed out to me that a cleaner solution is to use here::here:

db_path = here::here("src", "data", "movies.db")

The downside to this is the documentation for the function states that the “package is intended for interactive use only”, so use at your own risk.

We can test our script by installing dplyr, RSQLite and dbplyr as necessary and then running (from the root of the project):

Rscript src/data/movies.json.R > movies.json

This will create a JSON file, movies.json, with our data in the root of the project. You can delete this as it’s not needed.

We also no longer need the initial files in the “src/data” directory — events.json and launches.csv.js — and can delete them.

That’s everything we want to do in R covered. Now to actually build the movies app.


git switch --detach data

The Markdown File

For a simple app made of a single page the expectation is that the content of the app is placed inside a markdown file called index.md directly inside of the “src” directory of the project. This already exists in our generated project, alongside another couple of markdown files we can safely delete.

So now we write the “content” of our app in the index.md file in place of original content we generated in the “Getting Started” section. Being a markdown file, you may think this would end up containing a load of markdown syntax. It turns out that in our case the file mostly looks like blocks of JavaScript… because that’s what it is.

The page starts,however, with use of explicit HTML markup: <h1></h1>. That’s because Observable Framework automatically turns headings created using # markdown syntax into anchor points (i.e. links to that specific part of the page). This is useful for writing “Help” or other documentation, as you can easily link to specific parts of the page, but isn’t particularly useful here.

As already noted, most of the markdown file is “fenced” blocks of JavaScript using the syntax ```js…```. The critical thing here to understand is that these blocks are actually executed in the browser. They are not simply there for displaying code to the user. Framework is reactive by default and the bit I had (and still have, if we’re honest) to get my head around is that each fenced block forms a “cell”. The thing that made most sense to me was thinking of cells in Excel: you change the value in a cell and the values of other cells that depend on it automatically update regardless of where the cell is positioned in the two-dimensional grid of the spreadsheet. Still, with Framework, I’m not sure how much “stuff” should go in a single cell: What is the best practice here? Does it matter so long as the output is correct in terms of both value and position on the page? Is there any significant effect on performance? How do the answers to the previous questions change when we go from creating notebooks to creating dashboards?

My current thinking on this can be summarised roughly as “create blocks of stuff that looks like it goes together and seems to work, with some cells dealing with the UI and some cells responsible for the graphic”. So let’s cover each block/cell in turn.

Building the UI

The first cell covers the loading of the data and some basic processing of it:

// Load the data from the file we generated
const movies = await FileAttachment('./data/movies.json').json();

// Sort the data by number of oscars won. This ends up putting the
// multi-oscar-winning movies at the end of the data array so that
// they get drawn last in our scatter plot and thus appear on top
movies.sort((a, b) => a.Oscars - b.Oscars);

// Modify/extend our data objects for easier future use
movies.forEach(function(d) {
 // Add a Boolean stating whether or not the movie won any Oscars
 d.OscarWinner = d.Oscars > 0;
 // Convert the release date string to a JS Date object
 d.Released = new Date(d.Released);
 // Add a property that is just the four-digit year of release
 d.YearReleased = d.Released.getFullYear();
 // Add an array of Genres and remove any excess whitespace
 d.Genres = d.Genre?.split(',').map(s => s.trim()) || [];
 // Convert the Director string to lowercase for simpler searching
 d.Director = d.Director?.toLowerCase() || '';
 // Convert the Cast string to lowercase for simpler searching
 d.Cast = d.Cast?.toLowerCase() || '';
 // Turn the BoxOffice revenue figures into millions of dollars
 d.BoxOffice = (d.BoxOffice || 0) / 1e6;
});

// Create an array containing all the different genres found in the
// dataset and sort alphabetically
const genres = Array.from(
 movies.reduce(function(set, d) {
 d.Genres.forEach(g => set.add(g));
 return set;
 }, new Set())
)
.sort((a, b) => a.localeCompare(b));

// Extract a two-element array giving the earliest and latest
// release years of films in the dataset
const yearExtent = d3.extent(movies, d => d.YearReleased);

Most of this is “vanilla” JavaScript but there are a couple of functions that aren’t: FileAttachment and d3.extent. FileAttachment is a function created specifically for Observable notebooks that also works with Observable Framework. It simplifies the code required to load data files like JSON, CSV and XLSX. It doesn’t need to be explicitly imported into a Framework markdown file. The same is true of d3.extent (and all other methods of the d3 library). This method takes an input array of data and an “accessor function” that is applied to each element of the array. The return value is then a two-element array of the minimum and maximum values returned when the accessor function is applied to each element of the input array.

The second JavaScript cell creates some, not especially interesting, utility functions and an object that are used later on in the construction of the controls and graphics. This is all vanilla JavaScript.

// Create function for defining a middle grey of varying opacity
const gy = 150;
const getGrey = opacity => `rgba(${gy},${gy},${gy},${opacity})`;

// Create function for converting a Boolean value to a text label
const getWonOscarText = bool => bool ? 'Won Oscar(s)' : 'Didn\'t Win an Oscar';

// Create an array of objects that can be used to map between data properties
// and their more human-friendly labels and vice-versa
const axisVariables = [
 {name: 'Tomatometer', prop: 'Meter'},
 {name: 'Numeric Rating', prop: 'Rating'},
 {name: 'Number of Reviews', prop: 'Reviews'},
 {name: 'Box-office revenue ($million)', prop: 'BoxOffice'},
 {name: 'Year', prop: 'Released'},
 {name: 'Length (minutes)', prop: 'Runtime'},
];

In the third block we finally start to build the user interface, adding all the controls for our sidebar. This is the point where we start utilising the power of Framework through the in-built view function and Inputs object.

In Observable, a view is a user interface element that directly controls a value in the notebook. A view consists of two parts:

The view, which is typically an interactive DOM element […].

The value, which is any JavaScript value.

For the Inputs methods the first argument typically represents the allowed values for the control and a second argument provides additional details using an object. The declaration order transfers to the order in which the corresponding UI elements appear in the HTML and thus the ordering, top to bottom, in the sidebar panel. We change the order here from the original Shiny example to something that seems a bit more logical. Specifically, the select menus for choosing the two axes are moved from the bottom of the controls to the top.

const xVariable = view(
 Inputs.select(axisVariables, {
 label: 'X-axis Variable',
 format: d => d.name,
 value: axisVariables.find((d) => d.prop === 'Meter')
 })
);

const yVariable = view(
 Inputs.select(axisVariables, {
 label: 'Y-axis Variable',
 format: d => d.name,
 value: axisVariables.find((d) => d.prop === 'Reviews')
 })
);

const reviewsMin = view(
 Inputs.range(
 [10, 300],
 { label: 'Minimum number of reviews on Rotten Tomatoes', step: 1, value: 80 }
 )
);

const yearMin = view(
 Inputs.number(
 yearExtent,
 { label: 'Earliest release year', step: 1, value: 1970 }
 )
);

const yearMax = view(
 Inputs.number(
 yearExtent,
 { label: 'Latest release year', step: 1, value: yearExtent[1] }
 )
);

const dollarsMin = view(
 Inputs.number(
 [0, 800],
 { label: 'Minimum box-office revenue ($million)', step: 10, value: 0 }
 )
);

const dollarsMax = view(
 Inputs.number(
 [0, 800],
 { label: 'Maximum box-office revenue ($million)', step: 10, value: 800 }
 )
);

const oscarsMin = view(
 Inputs.radio(
 [0, 1, 2, 3, 4],
 { label: 'Minimum number of Oscars won', value: 0 }
 )
)

const selectedGenre = view(
 Inputs.select(['All'].concat(genres), {
 label: 'Genre',
 value: 'All'
 })
);

const directorText = view(
 Inputs.text({
 label: 'Director name contains',
 value: ''
 })
);

const castText = view(
 Inputs.text({
 label: 'Cast contains',
 value: ''
 })
);

Inside a block in which a variable (or const) is declared using view, that variable will be an object representing that view. In other code blocks, however, that variable name can be used to directly retrieve the value associated with that view: there’s no requirement to de-reference the object. This can help make code look a lot nicer but can also be confusing. For instance, a view can be declared as a const (it is an object whose properties are still mutable) but the value of the variable with the same name changes in other blocks.

You may also notice that minimum and maximum values are set with separate range controls. This is because, despite the name, browser-native range inputs only support a single handle.

Our page now has a title and our controls, plus a footer we’ll get rid of later.


git switch --detach ui

Building the Graphic

We then add a block to process our data based on the values of our inputs:

const data = movies.filter(function(d) {
 return (
 d.Reviews >= reviewsMin
 && d.Oscars >= oscarsMin
 && d.YearReleased >= yearMin && d.YearReleased <= yearMax
 && (selectedGenre === 'All' || d.Genres.includes(selectedGenre))
 && d.Director.includes(directorText.toLowerCase())
 && d.Cast.includes(castText.toLowerCase())
 && d.BoxOffice >= dollarsMin && d.BoxOffice <= dollarsMax
 );
});

const xLabel = xVariable.name;
const yLabel = yVariable.name;

Finally, we can add the code to render our scatter chart using Observable plot:

Plot.plot({
 width: 500,
 height: 500,
 color: {
 type: 'categorical',
 range: [getGrey(1), 'orange'],
 domain: [getWonOscarText(false), getWonOscarText(true)], // Required for when filtering on oscar wins
 legend: true,
 },
 grid: true,
 marks: [
 Plot.axisX({ labelAnchor: 'center', labelArrow: 'none', label: xLabel }),
 Plot.axisY({ labelAnchor: 'center', labelArrow: 'none', label: yLabel }),
 Plot.dot(
 data,
 {
 x: xVariable.prop,
 y: yVariable.prop,
 stroke: d => getWonOscarText(d.OscarWinner),
 fill: getGrey(0.4),
 r: 4,
 channels: {
 filmTitle: { value: 'Title', label: '' },
 year: { value: 'YearReleased', label: '' },
 revenue: { value: 'BoxOffice', label: '' },
 },
 tip: {
 format: {
 filmTitle: true,
 year: d => `Year of release: ${d}`,
 revenue: d => `Revenue: $${d.toFixed(d < 10 ? 1: 0)} million`,
 x: false, y: false, stroke: false
 }
 }
 }
 ),
 ]
})

We finish with a line of markdown that includes inline JavaScript using ${} syntax:

Number of movies selected: ${ d3.format(',')(data.length) }

This line simply prints the number of movies plotted at any given time.

Our app is now fully interactive but everything is arranged down a single column, regardless of screen width!

Next Up

We’ve now got a functioning app but the layout isn’t great and we haven’t yet deployed it anywhere useful. We’ll cover both of these things in Part 2.

For updates and revisions to this article, see the original post

Creating an animated Christmas tree in R

Tue, 24 Dec 2024 23:59:00 +0000

With Christmas tomorrow we have decided to create an animated Christmas Tree using {ggplot2}, {sf} and {gganimate}.

First we need a tree. To do this we have used an {sf} polygon where we pass in the coordinates of the Christmas tree as a list matrix to st_polygon. We can then use geom_sf to add this layer onto a ggplot object.

library(ggplot2)
library(gganimate)
library(sf)

tree_coords =
 list(
 matrix(
 c(-4, 0,
 -2.22, 2,
 -3.5, 2,
 -1.5, 4,
 -2.5, 4,
 -0.8, 6,
 -1.5, 6,
 0, 8,
 1.5, 6,
 0.8, 6,
 2.5, 4,
 1.5, 4,
 3.5, 2,
 2.22, 2,
 4, 0,
 -4, 0),
 ncol=2, byrow=T
 )
 )

tree = st_polygon(tree_coords)

gg_tree = ggplot() +
 geom_sf(aes(), data=tree)

gg_tree

Okay, so now we have a tree shape. Now we need to make it a little more Christmassy by changing:

The color using: fill = "forestgreen", color = "darkgreen"
Adding the trunk: geom_rect(aes(xmin = -0.75, xmax = 0.75, ymin = -2, ymax = 0), fill = "saddlebrown", color = "sienna4")
Add a star on the top: geom_point(aes(x = 0, y = 8), color = "gold", shape = 8, size = 7, stroke = 3)
Remove the axis with: theme_void()
Set the border: coord_sf(xlim = c(-6, 6), ylim = c(-4, 10))
Add a Christmas message: annotate("text", x = 0, y = 9.5, label = "Merry Christmas \n From Jumping Rivers!", size = 6)

Now our tree looks like this:

gg_tree = ggplot() +
 geom_sf(aes(), data=tree, fill = "forestgreen", color = "darkgreen") +
 geom_rect(aes(xmin = -0.75, xmax = 0.75, ymin = -2, ymax = 0), fill = "saddlebrown", color = "sienna4") +
 geom_point(aes(x = 0, y = 8), color = "gold", shape = 8, size = 7, stroke = 3) +
 theme_void() +
 coord_sf(xlim = c(-6, 6), ylim = c(-4, 10)) +
 annotate("text", x = 0, y = 9.5, label = "Merry Christmas \n From Jumping Rivers!", size = 6)

gg_tree

Next we need to use {sf} again to make some lights for the tree then {gganimate} to make the lights flash.

Placing the points within the boundaries of the tree was a trickier task than we expected until we fell upon st_sample which we can pass a polygon to and it’ll create some sample points within the boundaries. We also create a vector to colour the points.

points = st_sample(tree, 75)
colours = sample(c("red", "yellow", "blue"), 75, replace = TRUE)

gg_tree = ggplot() +
 geom_sf(aes(), data=tree, fill = "forestgreen", color = "darkgreen") +
 geom_sf(aes(), data=points, color = colours) +
 geom_rect(aes(xmin = -0.75, xmax = 0.75, ymin = -2, ymax = 0), fill = "saddlebrown", color = "sienna4") +
 geom_point(aes(x = 0, y = 8), color = "gold", shape = 8, size = 7, stroke = 3) +
 theme_void() +
 coord_sf(xlim = c(-6, 6), ylim = c(-4, 10)) +
 annotate("text", x = 0, y = 9.5, label = "Merry Christmas \n From Jumping Rivers!", size = 6)

gg_tree

We can now animate it to make the lights sparkle using transition_time and ease_aes:

gg_tree +
 transition_time(1:75) +
 ease_aes('linear')

Lastly, have a great Christmas and New Year from the Jumping Rivers team!

For updates and revisions to this article, see the original post

Why would I use R for music?

Thu, 19 Dec 2024 23:59:00 +0000

With Christmas around the corner, and in the spirit of spreading some joy out into the world, I decided not to write about shiny, or data workflows, or developments in base R for a change. Rather, this post is about something that brings me joy: music.

Not that R doesn’t bring me joy. Hey, I’ve ‘done data’ in other languages and in the point-and-click world. Solving the data problem with R brings a very different kind of joy….

As with most of my blogs, this one started with a daft project. I wanted to make an app that printed out musical notation, with randomly-sampled notes, that I could use as improvisation prompts when playing piano at my local experimental music open mic. A problem we’ve all faced.

This felt like something I could build in shiny, though it proved a little more difficult than I expected. Solving the problem completely might need a second blog post, a htmlwidget, and a bit of Javascript knowledge.

Here, we’ll talk about music in R, what packages are available, how to represent musical notation, and what people are actually doing with music data in R. We’ll maybe round off with a public domain Christmas carol or two, for good measure.

Computer World: Musical scores, sequencers and beeping chips

Home computers have been making music since the 70’s. At a time when dedicated sound chips belonged to a distant future, electromagnetic interference from bit switches in an Altair was hacked to play “Fool on the Hill” through a neighbouring radio (outlined in “Bits and Pieces”, KB McAlpine, p154). Even earlier than this, people had made music on research computers at universities (“Bits and Pieces”, p12). The development, hand-in-hand, of electronic music, computer sound chips, music software and video games is a fascinating story. But that’ll have to wait for another day.

Fundamental to those developments, was a simple question: how do you represent a piece of music inside a computer? Converting this representation into sound is a separate issue, because there are things you can do with music beyond listening to it. You can compare different aspects of a collection of songs (keys, harmonies, lyrics etc), you can (attempt to) get a computer to compose new music, or you can rearrange a given piece or print out sheet music for musicians to play from. For example here Kris Shaffer analyses chords in 100 rock songs using R, and here is a presentation analysing chords, lyrics and spotify data by Bruna Wundervald and Julio Trecenti using packages from the r-music organization.

Nowadays, most of the music stored on your computer will be stored as recordings, such as mp3s. This wasn’t originally the case, early games encoded music directly using note pitches and durations - much like you find in sheet music. A modern view of this representation is provided by the Humdrum format. The following contains the chorus melody for “Jingle Bells”.

**kern
=1
4e
4e
2e
=2
4e
4e
2e
=3
4e
4g
4c
4d
=4
1e
=5
*-

We can view that melody in an online tool, and we get traditional music notation back out:

The pairs “4g”, “2e”, “1e”, and so on, represent the duration (4, 2, 1 in increasing length; 4 being a crotchet or ‘quarter-note’) and pitch (e, g). The “=1” lines separate bars, and the "**kern" and "*-" delimit the whole sequence. To represent multiple notes playing at the same time, you can use additional vertical tracks (spines) to represent the additional notes. The syntax can get pretty complicated but so does sheet music….

Solid State “S”urvivor: Sounds in R - {beepr}, {audio}, {tuneR}

When it evolved from S in 1993, creating music might not have been on the horizon for R.

R wasn’t really on my horizon at the time either, I was at school, and spent quite a bit of spare time writing music in OctaMED on a Commodore Amiga - again, involving multiple vertical tracks of pitches and durations.

Can R even make a sound? Aside from the groans that Error in mean[1:3]: object of type 'closure' is not subsettable can evoke?

Yes it can. There are a few packages available for producing sound in R. My favourite is {beepr}. If you’ve got a long-running script burning away on your computer, what better way to celebrate its completion than with a fanfare, or with the Super Mario Bros “Level Complete” tune:

source("my-beautiful-script.R")
beepr::beep(sound = "mario")

Doodly-doodly-doo!

You could similarly have a cymbal crash when you’ve finally loaded that big dataset if you install {drumr}:

cars = {Sys.sleep(5); mtcars}
drumr::beat("crash")

We aren’t going to go any further into emitting sounds or analysing music from R here. But there are a few packages like {audio} and {tuneR} that can be used for this purpose.

Replicas: Representing music and making sheet music in R

{tabr} is a CRAN package providing the ability to handle musical scores as data. It also provides the ability to render sheet music from this data, by integrating with a system dependency ‘LilyPond’. Once you have installed both LilyPond and {tabr}, you can construct sheet music from R. The syntax for encoding melodies in tabr is similar but different from that used in Humdrum, above.

library("tabr")
melody = as_music("e4 e e2 e4 e e2 e4 g c d e1")
plot_music(melody)

So again, we encode notes with both pitch and duration, though now the duration comes after the pitch (‘e4’ is a crotchet E). We don’t need to specify the duration of a note, if it is the same as the preceding note. {tabr} has added a time-signature and tempo using some default values. This particular tempo might not help Santa get his sleigh off the ground though - that’s about half the speed that Bing Crosby recorded it. The notes are written out in a lower octave than in the Humdrum example, too.

We can fix all that though. While we’re at it let’s make that final run a bit sassier:

melody <- as_music(
 "e'4 e' e'2 e'4 e' e'2 e'4 g' c'~ c'8 d'8 e'1",
 tempo = "2 = 120"
)
plot_music(melody)

You can find out the syntax used in the music strings using the tabrSyntax data-frame.

tabrSyntax

## description syntax example
## 1 note/pitch a b ... g a
## 2 sharp # a#
## 3 flat _ a_
## 4 drop or raise one octave , or ' a, a a'
## 5 octave number 0 1 ... a2 a3 a4
## 6 tied notes ~ a~ a
## 7 note duration 2^n 1 2 4 8 16
## 8 dotted note . 2. 2..
## 9 slide - 2-
## 10 bend ^ 2^
## 11 muted/dead note x 2x
## 12 slur/hammer/pull off () 2( 2)
## 13 rest r r
## 14 silent rest s s
## 15 expansion operator * ceg*8, 1*4

For guitarists, there’s also the ability to plot out guitar tab (hence the name; strangely the notes have been transposed by an octave):

plot_music_guitar(melody, header=list(title = "Jingle Bells"))

It should be noted that {tabr} is not as flexible as LilyPond when it comes to creating musical scores, and indeed, the author recommends that “If you are only creating sheet music on a case by case basis, write your own LilyPond files manually”. The truth is, I got a lot of errors while experimenting with {tabr}, but it was still a fun experiment.

Blue Lines: Adding scores to an app

I originally wanted to randomly-generate music phrases that I could interpret myself. And {tabr} looked like a good fit for just printing out notes to an app.

We sample from two octaves of the ‘white notes’ of the C major scale:

# C major notes from G below middle-C
notes <- c("g", letters[1:7], letters[1:7]) |>
 paste(
 c(rep("", 3), rep("'", 7), rep("''", 5)),
 sep = ""
 )

To get a valid musical string, we can do the following:

sample_notes = function(x, n) {
 sample(x, size = n, replace = TRUE) |> paste("4", sep = "")
}

rand_melody = sample_notes(notes, 8)
rand_melody

## [1] "b4" "c''4" "f'4" "g''4" "e'4" "b4" "a'4" "b4"

rand_melody |> as_music() |> plot_music()

As a way of sampling melodies this is as simple as it gets. And it works in an app quite nicely too:

library("shiny")
library("tabr")

ui = fluidPage(
 plotOutput("music")
)

server = function(input, output, session) {
 melody = reactive({
 invalidateLater(10000) # sample a new melody every 10s
 sample_notes(notes, 8)
 })
 output$music = renderPlot({
 melody() |> as_music() |> plot_music()
 })
}

There was a couple of issues with the app (and I made it a bit more complicated before I realised this).

The main issue was that, if I deployed to shinyapps.io, LilyPond wasn’t available - so to use the app for real, I would have had to take a laptop, rather than just my phone, to the open-mic with me - and I’m a rather heavy-handed pianist so something expensive could well have broken….

The other issue was that rendering the music was a little slow and updating the score was glitchy - a png is created on the server side and transferred to the browser every 10 seconds. There are JavaScript libraries that can render musical scores, for example, the Humdrum library has a JavaScript plugin. Using such a library would mean that our shiny app could transfer some Humdrum notation to the browser, which might speed up rendering. The website for the Humdrum plugin includes an example of how to use it in a Shiny app - however, extending these examples to dynamically update after a new melody was sampled didn’t work for me. So, my next project is to work out how to write an htmlwidget package for Humdrum….

Endtroducing: Why didn’t the app deploy?

When you deploy an app to shinyapps.io, any packages it depends upon are installed on the shinyapps.io server. This would typically include {shiny}, {bslib} and a few other app-related things, but could include packages for any number of other things: numerics, data processing, visualisation. Many of these packages will depend on system libraries - the {quarto} package requires the Quarto command-line tool to be installed on a machine, for example. These system dependencies are encoded in the SystemRequirements section of the R package DESCRIPTION file, the same content you see on CRAN when looking at a single package. For {quarto} , for example, the SystemRequirements state “Quarto command line tool (https://github.com/quarto-dev/quarto-cli).”.

Now, the SystemRequirements is a freely-structured text field. As a package author you can write whatever you want in there, and it is up to the users of your package to ensure that their system has the SystemRequirements available. This makes sense because on different operating systems, the system libraries have different names. But it’s a little problematic when attempting to deploy to a server - if you need an R package that has a system-requirement that isn’t already available on that server, and you can’t log in to the server to install system libraries, how do you ensure it gets installed?

The {pak} package helps here. This provides an enhanced way to install R packages. When {pak} installs packages, it uses the free-text SystemRequirements field to determine the OS-specific system libraries that an R package needs. It does this by making use of rules specified in the r-hub/r-system-requirements repository. This is outlined in a blog post by Hugo Gruson.

Ultimately what happened, is that while {pak} was installing the R packages for my {tabr}-dependent app to shinyapps.io it saw that there was a dependency on LilyPond, but because there is no LilyPond rule at r-hub/r-system-requirements, it couldn’t work out what libraries or system tools it needed to install. So {tabr} installed, but the ‘lilypond’ library that it depends upon didn’t.

For updates and revisions to this article, see the original post

Diffify & Posit Package Manager

Thu, 12 Dec 2024 23:59:00 +0000

The latest release of Posit Package Manager introduces several enhancements, including:

Python Git Builders: Build Python packages (wheels) directly from Git.
Blocklists: Easily block specific packages or versions.
Improved Documentation: Clearer and more accessible information.

All great stuff, I’m sure. But most of them don’t directly impact the end user. But there is an exception to this rule, and that’s the ability to add custom metadata to a package page.

What is Package Metadata?

Custom metadata lets administrators define key-value pairs for packages. For instance, you could tag packages as part of the tidyverse with is_tidyverse: TRUE|FALSE. Other use cases include:

Assigning scores to packages.
Linking additional documentation.

Metadata can apply globally (e.g., all versions of {dplyr}) or to specific versions in a repository.

How to Add Metadata

Metadata is added via the API. Start by creating a token:

# Care should be taken over expires and scope
rspm create token --scope=global:admin --expires=never --description="Allows global admin access"

## Generated an access token. Be sure to record this token immediately since you will not be able to retrieve it later.
# eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

This will generate an access token (e.g., eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...). As with most tokens, it’s not retrievable, so put it somewhere secure.

Test the token on the API page, ensuring you prefix it with Bearer when authorizing.

For example, a /verify-auth GET request with a valid token should return a 200 response, confirming successful authorization.

Linking with diffify.com

The diffify.com website has a predictable URL structure: diffify.com/language/package-name/old-version/new-version, where

language: either r or python
package-name: name of the R or Python package
versions: Optional, specify one or both.

diffify.com & Posit Package Manager

Adding the diffify links to PPM is performed using the

POST /metadata/records/

and/or

POST /metadata/records/bulk

API calls. Depending on how precise you want to be you can either add a global meta tag, e.g.

Diffify: https://diffify.com/r/datasauRus

which would work for all versions of diffify. This is less work for the admin, but the user has to perform an extra click.

Or specify the version number in the URL,

Diffify: https://diffify.com/r/datasauRus/0.1.2

which is more work for the admin, but nicer for the user. The end result after this hard work is a nice link near the top of the page.

To learn more about diffify.com, check out our blog posts here!

For updates and revisions to this article, see the original post

Positron vs RStudio - is it time to switch?

Thu, 05 Dec 2024 23:59:00 +0000

Positron is the new beta Data Science IDE from Posit. Though Posit have stressed that maintenance and development of RStudio will continue, I want to use this blog to explore if Positron is worth the switch. I’m coming at this from the R development side but there will of course be some nuances from other languages in use within Positron that require some thought.

And I hope to put out another version of this for Python!

A “polyglot” IDE

Whilst RStudio is an IDE aimed at Data Science using R, Posit say that Positron is an IDE aimed at “Data Science” using any programming language i.e. a “polyglot” IDE. At the moment, it’s just R and Python but with the possibility to extend. Its current target audience is those Data Scientists who think RStudio is too niche yet VS Code is too general.

Everything inside the RStudio window, for all its beauty, is run using one R process. This is why when R crashes, RStudio does too. However, Positron is built using the same base as VS Code (a fork of Code OSS) which enables Positron to run R (and Python) through communication with a kernel. Sparing you the gory details, for us programmers it means we have the incredible ability to be able to switch between not only versions of R, but other languages too. All through just two clicks of a button!

Settings and the command palette

Like RStudio, there is a command palette to manage settings and initiate operations. Though I confess, I didn’t actually know this about RStudio until I wrote this blog. That’s also the key difference. In Positron, the command palette is the primary way to manage settings, and there’s a very clear prompt at the top of the screen. In RStudio it feels more like a hidden feature.

Also, by default Positron does not save your .RData to your workspace, nor does it ask you! You can change this if you want.

Workspaces / R projects

R projects are no longer the main way of grouping files. Instead, Positron uses workspaces. A workspace is analogous to any folder on your device. By default the working directory is set to whichever folder you have open. I’ve found this useful, as it means I don’t need to create an .Rproj file to reap (most of the) the benefits of project-based development. As you can see below, there are a LOT of hints that opening a folder is the best way to work in Positron.

If you still need an R project file, then Positron provides the ability to create these too (but it doesn’t really mean anything in Positron).

Layout

The biggest difference in layout is the addition of the sidebar to the left. This houses the (more advanced) file explorer, source control, search and replace, debug and extensions. We’ll talk about each one of these in turn throughout the blog.

The file explorer is a big plus for me. Firstly, it is just easier to work with and takes up less real estate. But it also directly integrates with the source control and the R interpreter. This means you have live feedback for the git status of your files and if your interpreter has detected any problems. Whilst this is nice, it does mean Positron will nearly always indicate there’s problems with your code before any code has been run.

For the configuration of the panes etc, check out the layout options in the command palette. I’m using the “Side-by-Side Layout” and have dragged the “variables” and “plots” panes adjacent with the console.

Extensions

As Positron is made from the same stuff as VS Code, we now get VS Code extensions, but only from the OpenVSX marketplace. Still, there’s nearly everything you could ever want in there. Including themes, rainbow CSV, and Git integrations.

Using Git

I think this one will divide people. I very much enjoy the RStudio Git GUI - the simplicity of it is probably it’s best feature and definitely what I will miss the most. However, it was limited. Positron’s “source control” section gives you far more control over what you can do using Git without having to head to the terminal.

As well as Positron’s built-in Git support, there are extensions too. There’s a GitLab workflow extension for viewing merge requests, issues and more and about a million extensions for GitHub. I’m particularly enjoying the Git Graph extension, which allows me to view the branch graph in a separate tab. Please enjoy this ridiculous example of a git branch graph.

Data explorer

Posit have pushed this element of Positron a lot and to be fair, it is an upgrade on the RStudio data explorer. There aren’t too many additional features compared to RStudio - it’s probably more of a win for Python users, who won’t be used to a data explorer. In my opinion, the welcome new additions are:

The column summary in the left hand side is a welcome addition and does make for quicker browsing of data.
The UI design in general. For instance having filters as tabs across the top instead of above their respective column makes so much sense.
Multi column sorting (!!)
Larger data sets load into the explorer view much, much quicker.

Debugging and testing

The interface for R package testing has greatly improved, in that there now is one. You can view all tests from the “Testing” section of the sidebar whilst being able to jump to and run any tests from this section.

There is now a completely separate interface for debugging too, with separate sections for the environment state and call stack. Too many times have I mistaken my debug environment for my global in RStudio! During Posit conf, it was announced that within debug mode users can now jump to and from C code as well though I haven’t tested this out yet.

R-package development

For a more comprehensive analysis of full R package development see this blog by Stephen Turner.

What’s not quite there?

For all the good there are a few things that just aren’t quite there yet:

So far there’s no support for RStudio addins.
Most of the functions that make calls to {rstudioapi} work (i.e. {testthat}), but there are some that don’t.
The big annoying one for me at the moment is that the console doesn’t retain code formatting and colour for the results and code once the code has been run. There is an issue about this and a fix is coming apparently.

Conclusion

Positron is still a beta product and I’m going to be switching from RStudio for most of my programming. I would, however, say to anyone thinking of making the switch, it’s taken me a couple weeks to get used to the layout and I’m still not sure I have my settings nailed down. But that will come in time.

For updates and revisions to this article, see the original post

R Dev Day @ SIP 2024

Thu, 14 Nov 2024 23:59:00 +0000

R Dev Day @ SIP 2024

This year Shiny in Production hosted an “R Dev Day” split over the two days before the pre-conference workshops. R Dev Days are a new initiative of the R Contribution Working Group, providing an opportunity for R developers to get involved in contributing to the R Project. R Dev Day will be back at SIP 2025, so read on to find out what participants got up to and consider coming along next year!

Translation

An R user’s local environment, or locale sets their preferred human language. If translations are available, R will display messages, errors and warnings in that language. So one important way that the community contributes to R is to develop and maintain these translations.

At the R Dev Day, Gabriela de Lima Marin learnt how to contribute translations via R’s Weblate, which provides a user-friendly browser interface for translation. In the first session, she worked in the conventional way, translating one string at a time. In the second session, she explored translating messages in bulk using machine translation. The second method was a little faster, but the automatic translations required careful review - sometimes they had the meaning completely wrong!

Overall, Gabriela translated over 200 messages at the R Dev Day! If you want to start contributing translations, you can find links to resources on issue 2 of the r-dev-day repository.

Translation dashboard

The R Contributor site hosts a translations dashboard to show the status of translations in the development version of R (R-devel) and on Weblate. Contributors can update translations on Weblate at any time, then these translations are collated around once a quarter to update R-devel, which will become the next major/minor release of R that is usually released in April. Mario Reiman, Md Mursalin Hossain Rabbi and Murad Khalilov reviewed the open issues on the translation dashboard GitHub repository and picked two to work on - #9: avoid using {stringr}, to reduce the number of dependencies required by the R scripts that are run using GitHub actions to update the data sources, and #38: switch from using the {flexdashboard} package to using Quarto to create the dashboard. Good progress was made on both fronts during the R Dev Day and work will continue to integrate these updates.

When Mario cloned the translations dashboard repository on Windows, he faced difficulties due to version-controlled files containing ? and & characters. Investigating further, we discovered these were supplementary files from the R Markdown rendering, that weren’t needed any more. This lead to Heather Turner and Cam Race reviewing the GitHub actions that rendered the dashboard and adapting them to remove old files from the repository before rebuilding. They did a wider review of the GitHub actions and found several had stopped working, meaning the dashboards were not fully updating daily, when scheduled. Heather continued work on this on the train home from SIP 2024 and got them all working again by the end of the journey!

Bug in Cairo graphics with R

Bugs in base R are tracked on R’s Bugzilla. There are many ways that contributors can help with reported bugs: reviewing the reports to assess if the issue is a valid bug that has not yet been fixed in R-devel; creating a simple reproducible example (or reprex); debugging the R or underlying C code to analyse the root cause of the bug; discussing how to fix the bug, or proposing an update to the source code to fix a bug. For R Dev Days, a number of bug reports are selected where there is a clear next step for contributors to make.

At R Dev Day @ SIP 2024, Ella Kaye and George Stagg looked at Bug 16721, which is an issue affecting Cairo graphics in R (< 4.5.0). In an image plot that is expected to be a full block of colour, a white stripe would appear, as in the example below:

The Cairo device is implemented in the {grDevices} package, which is part of base R. Ella and George built R from source so they were able to debug both the R and C code that gets called in the reprex above. They had to troubleshoot some issues that cropped up when building R on Ella’s computer, including complications working with multiple versions of R. Sorting these issues took most of the first session, but Ella appreciated the opportunity to learn some best practices from George, as the more experienced developer. In the second session they were able to focus on the debugging. Following advice that had been given by R Core member Paul Murrell in advance of the R Dev Day, they tried print debugging, i.e. adding a print statement to the source code to print out key information, while plotting a thin rectangle with grid::grid.rect(). The hypothesis was that nothing would be drawn when the width was less than a pixel. They managed to create an example that plotted nothing in a Cairo graphics device, yet plotted a thin black rectangle in a Quartz device. They looked more closely at C code for the Quartz device and discovered it had a specific workaround with the comment:

in the case of borderless rectangles snap them to pixels.
this solves issues with image() without introducing other artifacts.

So they worked on updating the C code for the Cairo device, to use the same workaround as the Quartz device. Rebuilding R with this change fixed the issue in both the original reprex and their simpler grid.rect() example!

In a plot twist, they discovered the Xlib device had different behaviour again, showing no issue with original reprex, but failing on the grid.rect() example. Digging into the code again, they found that Quartz rounded values to whole pixels, while Xlib truncated values. They shared their findings on Bugzilla at the end of the R Dev Day and have since had some feedback from Paul Murrell on the next steps to get a fix accepted into base R.

R Dev Container

As noted above, building R on your own computer can be a big timesink for contributors. An alternative (currently only recommended for contributions that don’t involve C code) is to use the R Dev Container: a development environment for R that can be launched in the browser using GitHub Codespaces or Gitpod. This has the pre-requisites for building R already installed and is isolated from the user’s computer, avoiding many of the issues of building R on your own machine. It comes with documentation and a few helpers, so you can launch the container, get a copy of the source code for R and build R in around half an hour.

Although it was designed to be used in the browser, some contributors to prefer to use the container on their own machine, to avoid using up their internet data or their free time/space allowance on GitHub Codespaces or Gitpod. Unfortunately, the Dev Container is currently built with a specific operating system and for a specific architecture, so it does not work well across platforms.

At the R Dev Day, Seb Mellor looked into building the Docker container for arm64 architecture, so that it would work better on recent macOS computers. The steps for building a Docker container are specified in a Dockerfile. Previous work by others had found the existing Dockerfile would work on arm64 up until the step where it tried to install the r-base-dev package from the Ubuntu repository. Seb tested the container at this point and confirmed you could still build R in the container, but it was missing the pre-installed version of R that is usually there. If we could build the container on an arm64 machine, then we could build the r-base-dev package as part of the Docker build, but Seb noted arm64 machines are not available on GitHub actions for non-Enterprise customers. So he investigated some alternatives with the conclusion that an arm64 dev container may be cross compiled with additional research, or emulated with a very long build time.

When Seb reported back he said he found it odd that there wasn’t an amd64 build of the r-base-dev package, so Heather did some further investigation and found that we could get it from a Personal Package Archive (PPA) maintained by Michael Rutter, who compiles the packages for the official Ubuntu repository. This should solve a large part of the problem, so we have a strong lead going forward - this work is being tracked on issue #112 of the r-dev-env GitHub repository.

R Dev Guide

Even from this handful of tasks that were selected for R Dev Day @ SIP 2024, you can see there is much to learn about contributing to R. One of the first initiatives of the R Contribution Working Group was to create an R Development Guide (or “R Dev Guide” for short), to document some of the processes. Like the translations dashboard and the R Dev Container, this is a resource maintained by the contributor community.

At the R Dev Day, Cam Race worked on two issues related to the R Dev Guide. In both cases, some initial work had been done by others at a previous R Dev Day, so his task was to review their contribution and continue where they left off. The first issue was to add a new section on websites relevant to R contributors, particular those under the r-project.org domain. The second issue was to improve the documentation on how to contribute to R’s documentation, including adding some examples of successfully closed bugs. Cam opened two pull requests to propose his changes (#186 and #188 respectively), along with another pull request to fix minor issues such as broken links.

Getting involved

As this post shows, there is a large range of activities to get involved in at an R Dev Day, suiting different levels of skills and experience. R Dev Day @ SIP 2025 will take place on the afternoon of Tuesday 8 October and the morning of Wednesday 9 October. We’d love for next year’s R Dev Day to be bigger and better - if you’re inspired to come along, the registration form is open already!

Meanwhile, for news of other R contributor events and links to resources to help you get started with contributing to base R at any time, head to the R Contributor Site: contributor.r-project.org.

For updates and revisions to this article, see the original post

Training Lineup for 2025: January-June

Thu, 07 Nov 2024 23:59:00 +0000

All of our public training courses for the first half of 2025 are now open for registration! Head over to the public courses page on our website to book in and start building your programming skills in the new year! Below is a list of all of our upcoming courses with a description, bookable dates, course level and a link to the course webpage to find out more!

There is still time to book yourself on to the final public courses of 2024. We are running Reporting with Quarto and Advanced Machine Learning with Tidymodels, both on the 18 of November.

R Stats and Programming

Introduction to R

Course level: Foundation

Upcoming course dates: 15th January 2025 & 22nd April 2025

R is a versatile language for statistical computing and graphics. In this course you will learn the advantages of using R and how to get started. You will gain familiarity with the RStudio interface and learn the R basics. Also included is an introduction to the Tidyverse and how to use various packages for data storage, visualisation and manipulation. This course provides a great foundation to begin your R journey!

Data Wrangling in the Tidyverse

Course level: Foundation

Upcoming course dates: 22nd January 2025 & 29th April 2025

If you work with data, you probably spend a lot of time cleaning it and wrangling it into the correct shape. This course will show you how you can use R to efficiently clean and wrangle your data into a format that’s ready for analysis. You will learn about the Tidyverse, what tidy data really is, and how to practically achieve it with packages such as {dplyr}, {tidyr}, {lubridate} and {forcats}.

Programming with R

Course level: Intermediate

Upcoming course dates: 29th January 2025 & 20th May 2025

The benefit of using a programming language such as R is that we can automate repetitive tasks. This course covers the fundamental techniques such as functions, for loops and conditional expressions. By the end of this course, you will understand what these techniques are and when to use them. This is a one-day intensive course on R.

R Best Practices

Course level: Intermediate

Upcoming course dates: 12th February 2025

So you can write code? Great. But can you write code which is easy to read, simple to maintain, and reproducible? Under the pressure of deadlines even the best of us can fall victim to bad-practices. In this course we motivate the importance of good-practices, and show how we can make best practices second nature by incorporating them into our normal workflow.

Data Visualisation with ggplot2

Course level: Intermediate

Upcoming course dates: 5th February 2025 & 10th June 2025

Want to learn how to effectively visualise your data in R using the elegant {ggplot2} package? With {ggplot2} it’s easy to customise everything from plot layouts and themes to scales, colours, and more! This course will comprehensively take you through basic plot types such as bar and line charts as well as cover more advanced topics such as interactive graphics with {plotly}.

Statistical Modelling with R

Course level: Intermediate

Upcoming course dates: 26th February 2025 & 3rd June 2025

From the very beginning, R was designed for statistical modelling. Out of the box, R makes standard statistical techniques easy. This course covers the fundamental modelling techniques. We begin the day by revising hypotheses tests, before moving onto ANOVA tables and regression analysis. The class ends by looking at more sophisticated methods such as clustering and principal components analysis (PCA).

Machine Learning and Bayesian Techniques

Machine Learning with Tidymodels

Course level: Intermediate

Upcoming course dates: 4th March 2025 & 17th June 2025

Machine learning is the process of applying statistical techniques to gain systematic information about a quantity of interest. We will be specifically focusing on how we can use the {tidymodels} suite of packages to implement these techniques. We cover key reasons for model fitting, such as prediction and inference, on quantitative and qualitative responses.

Advanced Machine Learning with Tidymodels

Course level: Advanced

Upcoming course dates: 18th March 2025 & 24th June 2025

A course that builds on the material covered in our Machine Learning with Tidymodels course. We take a look at how we can fit linear discriminant analysis (LDA) models using {discrim}, assessing model reliability using V-fold cross validation, pre-processing, tree-based models & more. If you wish to explore the abundance of model fitting techniques {tidymodels} has to offer, then this course is certainly for you!

Introduction to Bayesian Inference using RStan

Course level: Intermediate

Upcoming course dates: 13th January 2025

Despite the promise of big data, inferences are often limited by its systematic structure. Only by carefully modelling this structure can we take full advantage of the data. Stan is a platform for facilitating this modelling, providing an expressive modelling language to implement state-of-the-art algorithms, to draw subsequent Bayesian inferences. This course will teach participants how to interface with Stan through R!

Automatic Reporting

Reporting with Quarto

Course level: Intermediate

Upcoming course dates: 25th March 2025 & 24th June 2025

Do you create interactive documents that always need to be updated when the data changes? Then this course is for you. In this course you will learn how to use Quarto to create high quality, dynamic, fully reproducible documents. Quarto is a multi-language open source publishing tool that allows for the creation of dynamic content with Python, R, Julia and Observable.

Python

Introduction to Python

Course level: Foundation

Upcoming course dates: 26th February 2025 & 13th May 2025

Python is a general-purpose programming language popular among data scientists and statisticians. In this one-day introductory course, participants will learn to import, summarise and visualise their data. At each step, we avoid using “magic code”, and stress the importance of understanding what Python is doing.

Programming with Python

Course level: Intermediate

Upcoming course dates: 4th March 2025 & 3rd June 2025

The benefit of using a programming language such as Python is that we can automate repetitive tasks. This course covers the fundamental techniques such as functions, for loops and conditional expressions. By the end of this course, you will understand what these techniques are and how they can be applied to solve real-world data wrangling tasks.

Data Visualisation with Python

Course level: Intermediate

Upcoming course dates: 18th March 2025 & 17th June 2025

Python has a number of packages for the effective creation of graphics to communicate your data insights. This course will examine two popular libraries for creating static 2D plots: Matplotlib and Seaborn. During the training session, we’ll cover plotting basics and customisation of figures with Matplotlib, before moving onto complex statistical visualisations with Seaborn.

SQL

Introduction to SQL

Course level: Foundation

Upcoming course dates: 12th February 2025

The Structured Query Language (SQL) defines a standard for communicating with a relational database. In this half-day introductory course, participants will learn the basic SQL syntax for data extraction, filtering and insertion. We will then discuss some considerations for working with databases on the cloud, and finish by learning basic techniques for joining tables.

The course can be taken either independently or as a precursor to our Intro to SQL with R and Intro to SQL with Python courses (see below).

An Introduction to SQL with R

Course level: Intermediate

Upcoming course dates: 15th April 2025

Using databases is a fundamental part of a data scientist’s role. The main focus of this training course is to introduce SQL databases, write your first SQL queries, and show how R can be used to retrieve and manipulate data stored in a relational database. The course uses both the {DBI} and {dbplyr} packages.

We use the PostgreSQL database as an example for public courses. For in-house training, we are happy to adapt the course to match your database requirements.

Introduction to SQL with Python

Course level: Intermediate

Upcoming course dates: 15th April 2025

Using databases is a fundamental part of a data scientist’s role. This training course introduces SQL databases and the SQL command syntax, and shows how Python can be used to retrieve and manipulate data held in a relational database. The course also discusses how SQLAlchemy can be used to define and interact with databases using object-oriented Python code.

We use a PostgreSQL database as an example, and communicate with this using a psycopg2 connection.

So what now?

If you’re interested in attending any of our public courses, then you can head straight over to the public booking page! If you’re looking for training for your team, or maybe even something a bit more bespoke, then get in touch and we’ll see what we can do! All of our training courses (including courses not mentioned above) can be found in our course catalogue.

For updates and revisions to this article, see the original post

Vetiver: Monitoring Models in Production

Thu, 31 Oct 2024 23:59:00 +0000

This post is the third in our series of blogs on MLOps with vetiver:

Part 1: Vetiver: First steps in MLOps
Part 2: Vetiver: Model Deployment
Part 3: Vetiver: Monitoring Models in Production (this post)
Part 4: Vetiver: MLOps for Python

In Parts 1 and 2, we introduced the {vetiver} package and its use as a tool for streamlined MLOps. Using the {palmerpenguins} dataset as an example, we outlined the steps of training a model using {tidymodels} then converting this into a {vetiver} model. We then demonstrated the steps of versioning our trained model and deploying it into production.

Getting your first model into production is great! But it’s really only the beginning, as you will now have to carefully monitor it over time to ensure that it continues to perform as expected on the latest data. Thankfully, {vetiver} comes with a suite of functions for this exact purpose!

Preparing the data

A crucial step in the monitoring process is the introduction of a time component. We will be tracking key scoring metrics over time as new data is collected, therefore our analysis will now depend on a time dimension even if our deployed model has no explicit time dependence.

To demonstrate the monitoring steps, we will be working with the World Health Organisation Life Expectancy data which tracks the average life expectancy in various countries over a number of years. We start by loading the data:

download.file("https://www.kaggle.com/api/v1/datasets/download/kumarajarshi/life-expectancy-who",
 "archive.zip")
unzip("archive.zip")
life_expectancy = readr::read_csv("./Life Expectancy Data.csv")

We will attempt to predict the life expectancy using the percentage expenditure, total expenditure, population, body-mass-index (BMI) and schooling. Let’s select the columns of interest, tidy up the variable names and drop any missing values:

life_expectancy = life_expectancy |>
 janitor::clean_names(case = "snake",
 abbreviations = c("BMI")) |>
 dplyr::select("year", "life_expectancy",
 "percentage_expenditure",
 "total_expenditure", "population",
 "bmi", "schooling") |>
 tidyr::drop_na()

life_expectancy
#> # A tibble: 2,111 × 7
#> year life_expectancy percentage_expenditure total_expenditure population
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 65 71.3 8.16 33736494
#> 2 2014 59.9 73.5 8.18 327582
#> 3 2013 59.9 73.2 8.13 31731688
#> 4 2012 59.5 78.2 8.52 3696958
#> 5 2011 59.2 7.10 7.87 2978599
#> 6 2010 58.8 79.7 9.2 2883167
#> 7 2009 58.6 56.8 9.42 284331
#> 8 2008 58.1 25.9 8.33 2729431
#> 9 2007 57.5 10.9 6.73 26616792
#> 10 2006 57.3 17.2 7.43 2589345
#> # ℹ 2,101 more rows
#> # ℹ 2 more variables: bmi <dbl>, schooling <dbl>

The data contains a numeric year column which will come in handy for monitoring the model performance over time. However, the {vetiver} monitoring functions will require this column to use <date> ("YYYY-MM-DD") formatting and it will have to be sorted in ascending order:

life_expectancy = life_expectancy |>
 dplyr::mutate(
 year = lubridate::ymd(year, truncated = 2L)
 ) |>
 dplyr::arrange(year)

life_expectancy
#> # A tibble: 2,111 × 7
#> year life_expectancy percentage_expenditure total_expenditure
#> <date> <dbl> <dbl> <dbl>
#> 1 2000-01-01 54.8 10.4 8.2 
#> 2 2000-01-01 72.6 91.7 6.26
#> 3 2000-01-01 71.3 154. 3.49
#> 4 2000-01-01 45.3 15.9 2.79
#> 5 2000-01-01 74.1 1349. 9.21
#> 6 2000-01-01 72 32.8 6.25
#> 7 2000-01-01 79.5 347. 8.8 
#> 8 2000-01-01 78.1 3557. 1.6 
#> 9 2000-01-01 66.6 35.1 4.67
#> 10 2000-01-01 65.3 3.70 2.33
#> # ℹ 2,101 more rows
#> # ℹ 3 more variables: population <dbl>, bmi <dbl>, schooling <dbl>

Finally, let’s imagine the year is currently 2002, so our historical training data should only cover the years 2000 to 2002:

historic_life_expectancy = life_expectancy |>
 dplyr::filter(year <= "2002-01-01")

Later in this post we will check how our model performs on more recent data to illustrate the effects of model drift.

Training our model

Before we start training our model, we should split the data into “train” and “test” sets:

library("tidymodels")

data_split = rsample::initial_split(
 historic_life_expectancy,
 prop = 0.7
)
train_data = rsample::training(data_split)
test_data = rsample::testing(data_split)

The test set makes up 30% of the original data and will be used to score the model on unseen data following training.

The code cell below handles the steps of setting up a trained model in {vetiver} and versioning it using {pins}. For a more detailed explanation of what this code is doing, we refer the reader back to Part 1.

We will again use a basic K-nearest-neighbour model, although this time we have set up the workflow as a regression model since we are predicting a continuous quantity. Note that this requires the {kknn} package to be installed.

# Train the model with {tidymodels}
model = recipe(
 life_expectancy ~ percentage_expenditure +
 total_expenditure + population + bmi + schooling,
 data = train_data
) |>
 workflow(nearest_neighbor(mode = "regression")) |>
 fit(train_data)

# Convert to a {vetiver} model
v_model = vetiver::vetiver_model(
 model,
 model_name = "k-nn",
 description = "life-expectancy"
)

# Store the model using {pins}
model_board = pins::board_temp(versioned = TRUE)
vetiver::vetiver_pin_write(model_board, v_model)

Here the model {pins} board is created using pins::board_temp() which generates a temporary local folder.

At this point we should check how our model performs on the unseen test data. The maximum absolute error (mae), root-mean-squared error (rmse) and R² (rsq) can be computed over a specified time period using vetiver::vetiver_compute_metrics():

metrics = augment(v_model, new_data = test_data) |>
 vetiver::vetiver_compute_metrics(
 date_var = year,
 period = "year",
 truth = life_expectancy,
 estimate = .pred
 )

metrics
#> # A tibble: 9 × 5
#> .index .n .metric .estimator .estimate
#> <date> <int> <chr> <chr> <dbl>
#> 1 2000-01-01 46 rmse standard 4.06 
#> 2 2000-01-01 46 rsq standard 0.836
#> 3 2000-01-01 46 mae standard 3.05 
#> 4 2001-01-01 44 rmse standard 4.61 
#> 5 2001-01-01 44 rsq standard 0.844
#> 6 2001-01-01 44 mae standard 3.43 
#> 7 2002-01-01 36 rmse standard 4.14 
#> 8 2002-01-01 36 rsq standard 0.853
#> 9 2002-01-01 36 mae standard 3.04

The first line of code here sends new data (in this case the unseen test data) to our model and generates a .pred column containing the model predictions. This output is then piped to vetiver::vetiver_compute_metrics() which includes the following arguments:

date_var: the name of the date column to use for monitoring the model performance over time.
period: the period ("hour", "day", "week", etc) over which the scoring metrics should be computed. We are restricted by our data to using "year"; for more granular data it may be more sensible to monitor the model over shorter timescales.
truth: the actual values of the target variable (in our example this is the life_expectancy column of the test data).
estimate: the predictions of the target variable to compare the actual values against (in our example this is the .pred column computed in the previous step).

We will come back to these metrics later in this post, so for now let’s store them along with our model using {pins}:

pins::pin_write(model_board, metrics, "k-nn")

We will skip over the details of deploying our model since this is already covered in Part 2.

Monitoring our model

Over time we may notice our model start to drift, where its predictions gradually diverge from the truth as the data evolves. There are two common causes of this:

Data drift: the statistical distribution of an input variable changes.
Concept drift: the relationship between the target and an input variable changes.

Taking the example of life expectancy data:

A country’s expenditure is expected to vary over time due to changes in government policy and unexpected events like pandemics and economic crashes. This is data drift.
Advances in medicine may mean that life expectancy can improve even if BMI remains unchanged. This is concept drift.

Going back to our model which was trained using data from 2000 to 2002, let’s now check how it would perform on “future” data up to 2010:

# Generate "new" data from 2003 to 2010
new_life_expectancy = life_expectancy |>
 dplyr::filter(year > "2002-01-01" &
 year <= "2010-01-01")

# Score the model performance on the new data
new_metrics = augment(v_model, new_data = new_life_expectancy) |>
 vetiver::vetiver_compute_metrics(
 date_var = year,
 period = "year",
 truth = life_expectancy,
 estimate = .pred
 )

new_metrics
#> # A tibble: 24 × 5
#> .index .n .metric .estimator .estimate
#> <date> <int> <chr> <chr> <dbl>
#> 1 2003-01-01 141 rmse standard 5.21 
#> 2 2003-01-01 141 rsq standard 0.760
#> 3 2003-01-01 141 mae standard 3.64 
#> 4 2004-01-01 141 rmse standard 5.14 
#> 5 2004-01-01 141 rsq standard 0.761
#> 6 2004-01-01 141 mae standard 3.60 
#> 7 2005-01-01 141 rmse standard 5.83 
#> 8 2005-01-01 141 rsq standard 0.684
#> 9 2005-01-01 141 mae standard 4.19 
#> 10 2006-01-01 141 rmse standard 6.23 
#> # ℹ 14 more rows

Let’s now store the new metrics in the model {pins} board (along with the original metrics):

vetiver::vetiver_pin_metrics(
 model_board,
 new_metrics,
 "k-nn"
)

We can now load both the original and new metrics then visualise these with vetiver::vetiver_plot_metrics():

# Load the metrics
monitoring_metrics = pins::pin_read(model_board, "k-nn")

# Plot the metrics
vetiver::vetiver_plot_metrics(monitoring_metrics) +
 scale_size(name = "Number of\nobservations", range = c(2, 4)) +
 theme_minimal()

The size of the data points represents the number of observations used to compute the metrics at each period. Up to 2002 we are using the unseen test data to score our model; after this we are using the full available data set.

We observe an increasing model error over time, suggesting that the deployed model should only be trained using the latest data. For this particular data set it would be sensible to retrain and redeploy the model annually.

Summary

In this blog we have introduced the idea of monitoring models in production using the Vetiver framework. Using the life expectancy data from the World Health Organisation as an example, we have outlined how to track key model metrics over time and identify model drift.

As you start to retire your old models and replace these with new models trained on the latest data, make sure to keep ALL of your models (old and new) versioned and stored. That way you can retrieve any historical model and establish why it gave a particular prediction on a particular date.

The {vetiver} framework also includes an R Markdown template for creating a model monitoring dashboard. For more on this, check out the {vetiver} documentation.

The next post in our Vetiver series will provide an outline of the Python framework. Stay tuned for that sometime in the new year!

For updates and revisions to this article, see the original post

Highlights from Shiny in Production (2024)

Thu, 17 Oct 2024 23:59:00 +0000

Hot on the heels of Shiny in Production 2022 & 2023, we were excited to dive back into all things Shiny for a third consecutive year. In this post we recap the highlights from the two days of talks and workshops.

Workshops

As with previous iterations of the conference, we began on Day 1 with an afternoon of insightful workshops:

Level up your plots: Tips, tricks and resources for crafting compelling visualisations with R and ggplot2: Following her stand-out talk at Shiny in Production 2023, we were delighted to welcome back Cara Thompson for both a talk AND a workshop this year! Cara’s hands-on workshop offered attendees a chance to craft appealing and informative visualisations of their data without compromising on accessibility.
Building Responsive Shiny Applications: Our very own Shiny expert, Pedro Silva, shared some responsive design principles and best practices for Shiny developers to build fluid web pages that run on various screen sizes from desktops to mobile devices.
Asynchronous Shiny: Our data scientist and trainer, Russ Hyde, introduced the idea of asynchronous programming, providing attendees with the basics to solve between-session and within-session blocking in a Shiny app.
Building Apps for Humans: Osheen MacOscar (another JR data scientist!) explored the basics of human-computer interaction and outlined how layout, colour, size and motion in a Shiny interface can be used to enhance the user experience.

Noticing a trend here? At Jumping Rivers we offer training and upskilling in all things Shiny! If you’re keen to learn more about Shiny (or data science more generally) check out our full list of available training courses here.

Talks

On Day 2 we enjoyed talks from some fabulous speakers across a range of industries!

Keynote: Cara Thompson (Data visualisation consultant)

Data-To-Wow: Leveraging Shiny as a no-code solution for high-end parameterised visualisations

The vast majority of data visualisations start from the data, and while you may not know exactly how the final image will look at the start, you can tweak and refine your way to a result that looks good. But Cara had a slightly different challenge: take an existing data visualisation the client has designed, and recreate it in {ggplot2} so the plots can be quickly generated from any future data.

Cara guided us through how she tackled some of the challenges encountered along the way, such as creating your own {ggplot2} geom in order to draw straight lines between points when the plot uses polar coordinates. There were also lessons in why we shouldn’t always rely on a single numerical summary like the mean in a plot, when the raw data has the potential to show us patterns we’d ordinarily lose.

But in order to be useful for the client, all this hard-work needs to be easy-to-use. And when the client has no prior experience with running R code or using an IDE like RStudio, Shiny becomes a valuable tool for allowing anyone to run R code, without needing to know how to run the R code. To make it as easy as possible to run the Shiny application, Cara provided her client with a desktop shortcut; click on it to automatically execute the Shiny application in a background R process, and displays the Shiny app in a web browser as normal. The net result is the client can locally run the Shiny app on their computer, just like they would any other software application.

Pedro Silva (Jumping Rivers)

Convincing IT that R packages are safe

When IT departments are responsible for ensuring the security and integrity of the systems they manage, it’s understandable that IT managers will be reluctant to install software if they can avoid it. Combine this with the nature of open-source software often being maintained by thousands of volunteers—with some operating entirely under online pseudonyms—and they can also start to view some software with great scepticism. The issue becomes even more serious when you work in a heavily regulated industry—such as banking, pharmaceuticals or critical national infrastructure—where the systems could be scrutinised in an audit or the consequences of a compromised system can be severe.

Pedro provided an insight into the need to validate R packages, and the solutions Jumping Rivers is currently working on with organisations in industry. The aim is to provide information that summarises the risk of using any R package on CRAN based on the quality of its development. Users can specify additional and stricter test criteria for what should be checked and apply weighting of what testing criteria should have a greater influence on the final risk summary scores.

With this information, organisations will be able to determine if the package they want to use is safe enough to use. Where packages are identified to carry too much risk, they can invest time in fixing the issues in the weakest areas of the package, such as increasing test coverage on some of the package methods.

Pedro rounded off the talk by demonstrating the use of Quarto to generate summary reports and a Shiny dashboard that allows users to explore breakdowns of validation scores from different packages.

Vikki Richardson (Audit Scotland)

Faster than a Speeding Arrow - R Shiny Optimisation In Practice

When the size of your data is large enough to cause considerable loading times, it’s time to start optimising how your data is handled. Vicky talked us through how her team went about cutting their application loading times.

There are many strategies for trying to make a more performant application. The most obvious is to throw more compute resources at the problem; have more instances of the application run so concurrent users each get a faster experience. But it doesn’t necessarily solve the underlying issue, it inflates your compute costs, and as a solution, it lacks a certain elegance.

Data caching from the {memoise} package can provide great speed-up, but with Arrow, there was more to be found. With Vicky explaining how they were able to interface with Apache Arrow commands using {dplyr} syntax, and highlighting some of the drawbacks with this method such as when certain {dplyr} verbs aren’t supported by Apache Arrow, the end result was certainly impressive. The combination of caching and use of Apache Arrow saw data processing times slashed from several minutes to under 2 seconds.

Gareth Burns (Exploristics)

Shiny in Secondary Education: Supplementing traditional learning resources to allow students to explore statistical concepts

Gareth describes this project as “a passion project that wouldn’t have been successful without Shiny”. It all started with a call to action by Steve Mallett, that led to Gareth volunteering his time towards the development of a Shiny application that could be used for science communication in secondary schools, that makes learning fun, engaging and interactive.

The app addressed a number of issues with how the existing workshops were performed, such as removing the need for most physical materials to be sent to different locations. There were valuable lessons along the way too, such as the importance of making your application robust to the mischievous minds of secondary-level students who will find creative ways to break your work, and ensuring your data visualisations will be understandable to your target audience. And ideas for minimising potential human data-input errors by having the data captured within the application itself.

The live demonstration of the Shiny application showcased some well-designed custom-made widgets and modules, crafted using self-made HTML, CSS and JavaScript. And inspiration for making the application more engaging to young students was provided to gamify the activity. The slides can be found here and the (messy) code is on Github.

Tan Ho (Zelus Analytics)

A minimum viable Shiny infrastructure for serving 95,000 monthly users

How do you support many – many – users of a Shiny app? Tan took us through the lifetime of the DynastyProcess Fantasy Football app. This was originally built by Tan and his friend Joe Sydlowski. Neither of them had made a Shiny app, and Tan had never written any R, before this was built and within 2 years of running they had 200,000 unique users per month. A series of top tips were presented to ensure that your app keeps running, grows with its audience, and gets you that data science job that you dreamed of. 1) Try running your own shiny server, this cheap option could help you scale up your app when you need to. 2) Don’t do too much inside your app either, try pushing as much data processing outside of your app as possible. 3) Log everything and 4) listen to your users. But most importantly, start from where you are because “there’s always much to learn”.

Talk materials available here

Katy Morgan (Government Internal Audit Agency)

More than just a chat bot: Tailoring the use of Generative AI within Government Internal Audit Agency with user-friendly R shiny applications

Katy presented an insight into three Shiny apps that are used while making government audits. These are used at different stages of the audit process and make use of ChatGPT. For example, when thinking about the risks within a project, what are the possible causes, events associated with, and consequences of those risks? A series of predefined ChatGPT prompts are used to suggest text that expert auditors can make use of within their workflow. The apps are deployed on Azure app service and make use of the Golem framework and docker to simplify development, deployment and authentication.

Lightning Talks

This year also featured a segment for lightning talks with a twist: all speakers would be presenting from auto-scrolling slides which would change every 20 seconds! This turned out to be much less dramatic than anticipated, with our lightning speakers all staying perfectly on time…

Here’s a brief synopsis of each talk.

The SK8 Project: A scalable institutional architecture for managing and hosting Shiny applications

David Carayon (INRAE) started his talk by noting that, while Shiny is a great tool for building web apps, it’s not always easy to share these with colleagues. In particular, paid solutions such as Posit and AWS are not always feasible to Shiny users. Enter SK8, which offers a cost-effective solution for deploying Shiny apps to the web using Kubernetes. The deployment process involves an automated CI/CD pipeline which unpacks the app dependencies and creates a Dockerfile which is deployed by Kubernetes to the web. In the space of just a few years, the service has grown to 100 deployed applications! The talk materials can be found here.

Monitoring and improving Posit Workbench usage behaviour at Public Health Scotland

At PHS there are over 450 active users of Posit. Alasdair Morgan showed us how he has been reporting the Posit Workbench usage habits with an aim to keep costs down by avoiding wastage of the allocated resources. User activity is tracked by Azure logs and reported using R Markdown. These reports include hard hitting visualisations of the proportion of allocated memory and CPUs that are actually being used. Remembering that this was a Shiny conference, Alasdair showed us a dashboard highlighting some of the worst offenders! (anonymised of course…)

Alasdair’s light-hearted examination of the user habits at PHS went down very well with our audience and went on to take the prize for best lightning talk!

Using Google Lighthouse to analyse Shiny Applications

Fresh from his workshop the day before, Osheen MacOscar introduced us to Google Lighthouse, an open source tool for assessing various metrics of web-based apps including load speed, interactivity and accessibility. Selecting 134 Shiny apps from Appsilon, Osheen showed that only 40 apps were listed as having “good” performance. Osheen went on to show that as complexity is added (such as interactive plots and widgets) performance can decrease due to slower load times. However, this is not always a bad thing since widgets can also improve the user experience. In summary, Lighthouse is a great tool for assessing apps but we shouldn’t let it stop us from adding (useful) complexity to our apps.

Risk Assessment as a Service (Project Roll-out)

Another of our data scientists, Astrid Radermacher, discussed the importance of risk assessment in various industries and our efforts at JR to roll out package validation as an automated service. Our process involves assessing the package (checking if it is secure and well maintained) and generating a report. If the package passes our checks, it can be included in the client’s regulated environment. Otherwise we can offer manual remediation. Having largely focused on package assessment, our next steps will be on improving package remediation and scaling the automated service to different user operating systems. We look forward to onboarding additional clients with the service in early 2025 and releasing to open source in the longer term.

Chagas Diagnostic Algorithms: an online application to estimate cost and effectiveness of diagnostic algorithms for Chagas disease

Juan Vallarta (FIND) began by outlining the challenges in diagnosing Chagas disease (a global disease which is particularly prevalent in Latin America). Diagnosis is often financially and logistically challenging and can either be conducted in a lab setting or more rapidly onsite. He then presented an online tool for estimating the cost and effectiveness of different diagnostic algorithms, taking into account the sensitivity and specificity of each test. The results can be viewed in a user interface and downloaded into an HTML report. The app has been deployed globally not just for Chagas, but other diseases including covid and mpox.

rainbowR

Our final lightning talk was given by Ella Kaye (University of Warwick) who spoke about rainbowR, which aims to connect, support and promote LGBTQ+ users within the R community. Since 2017 the rainbowR community has grown to over 130 members and runs monthly meetups. The community also manages a social data project (tidyRainbow) which collates LGBTQ+ datasets. To find out how to join and contribute, check out rainbowr.org/.

What happens next?

We want to say thank you to the sponsors of the event for your support in making it possible!

Thanks also to our speakers and attendees who travelled from near and far to make it another memorable conference! Check out our YouTube channel where the talk recordings will be released in the coming weeks!

We’re already planning on running Shiny in Production again! The 2025 edition will run on the 8th & 9th of October and you can already grab your super early bird tickets here. We can’t wait to share more details with you soon!

Sponsors

For updates and revisions to this article, see the original post

First Steps in Python Testing

Thu, 05 Sep 2024 23:59:00 +0000

Programming is a craft, and in data science we often spend countless hours coding. There isn’t a magic shortcut to improving your programming skills. But, like any craft, improvement comes from practice: challenging yourself, exploring related skills, learning from others, and teaching.

Testing code using automated tools is common throughout the software development industry. This technique can improve the quality of the code you write as a data scientist. Testing helps refine your code, supports redesign, prevents errors, and makes it harder to write single-use code.

Here, we introduce the pytest framework and show how it can be used to test Python functions. If you don’t use a testing framework as part of your daily workflow, try experimenting with the techniques presented here the next time you write or extend a function.

About pytest

pytest is a software testing framework, it is a command-line tool that automatically finds tests you’ve written, runs the tests, and reports the results. In general, pytest is known for its simplicity, scalability, and powerful features such as fixture support and parameterization, it has a concise syntax and a rich plugin ecosystem compared to python standard libraries.

Getting started with pytest

Before we start writing tests, it’s important to set up a clean, isolated environment where we can install and manage packages. This is done using a virtual environment.

We first navigate to the project directory and then create a virtual environment for our project. Then we activate the virtual environment as in the second row of the code, and install pytest.

$ python3 -m venv venv
$ source venv/bin/activate
(venv) $ pip install pytest

We have everything set up to use pytest in our project. When we are done working in the virtual environment, we can deactivate it by simply running:

$ deactivate

Now that your environment is set up, let’s explore the basics of pytest.

What is a test?

A test is a small piece of code (usually a function) that checks whether another piece of code is working as expected. For example, imagine you wrote a function to calculate the mean of a list of numbers. A test would check if the function correctly computes the mean for different inputs.

Let’s create a simple function that calculates the mean of a list of numbers x and save it in my_functions.py:

# ./my_functions.py
def calculate_mean(x):
 return sum(x) / len(x)

A very nice property of pytest is something called test discovery, a series of naming conventions that tell pytest how to go and search for tests and execute them. Any file that contains test functions should start with test_ and also the tests functions in this file should be named in the same way. Then, pytest will automatically search and find these functions and run them.

Now, let us write a test for this function using pytest. Create a file named test_my_functions.py:

# ./test_my_functions.py
from my_functions import calculate_mean


def test_calculate_mean():
 x = [1, 2, 3, 4, 5]
 result = calculate_mean(x)
 expected = 3.0
 assert result == expected

In this example, test_calculate_mean() is a test function. It checks if calculate_mean([1, 2, 3, 4, 5]) returns 3.0. When we run pytest, it will check if the assert statement holds true.

$ pytest test_my_functions.py

============================= test session starts ==============================
test_my_functions.py . [100%]

============================== 1 passed in 0.01s ===============================

We can see that the test has successfully passed. In the output, the dot $(.)$ after test_my_functions.py indicates that the test has passed.

Now, let’s have a look at an example of a failing test. Consider the following test function which is in the file test_failing.py.

# ./test_failing.py
def test_addition():
 result = 2 + 2
 expected = 5
 assert result == expected

We run pytest from the command-line and investigate the output.

$ pytest -v test_failing.py
============================= test session starts ==============================
test_failing.py::test_addition FAILED [100%]

=================================== FAILURES ===================================
________________________________ test_addition _________________________________

 def test_addition():
 result = 2 + 2
 expected = 5
> assert result == expected
E assert 4 == 5

test_failing.py:4: AssertionError
=========================== short test summary info ============================
FAILED test_failing.py::test_addition - assert 4 == 5
============================== 1 failed in 0.03s ===============================

This time pytest provides us with a message giving information on the error and also highlights any reasons that have caused the test to fail. The -v or --verbose command-line flag is used to reveal more verbose output.

The assert statement

The assert statement is used to verify that a given condition is True. If the condition is False, the test fails. In our first example the statement, assert result == expected asserts that the result from calculate_mean(x) should equal 3.0. If the assert statement is not true, pytest reports a failure.

Pytest fixtures

Suppose you had written several functions that all work on some non-trivial dataset, and you want to write a test-function for each. In each test-function, you would have to create a dataset of the required form, pass it into the function-under-test, and then compare the output to some expected value. The code for creating a test-dataset may get duplicated between the different test-functions.

Fixtures in pytest are helper functions which are used to set up conditions that we want to be available for multiple tests. This might involve putting together some test data, or preparing some other state before a test runs (connecting to a database, creating a temporary file). Fixtures are run before (and sometimes after) the actual test functions. The @pytest.fixture decorator is used to tell pytest that a function is a fixture. Fixtures can perform actions (like setting up a database connection), and can inject data into a test function.

To illustrate let us consider a fixture that provides us with a list of numbers in our test file test_my_functions.py:

# ./test_my_functions.py
import pytest

from my_functions import calculate_mean


@pytest.fixture
def sample_numbers():
 return [1, 2, 3, 4, 5]


def test_calculate_mean(sample_numbers):
 result = calculate_mean(sample_numbers)
 expected = 3.0
 assert result == expected

By using @pytest.fixture, we have defined a sample_numbers fixture that returns the list [1, 2, 3, 4, 5]. This fixture can be used in any test function by adding it as an argument. Fixtures are especially useful when you need to set up more complex objects that multiple tests will use.

The test output would be:

$ pytest -vv test_my_functions.py

============================= test session starts ==============================
test_my_functions.py::test_calculate_mean PASSED [100%]

============================== 1 passed in 0.00s ===============================

Parametrization

Parametrization is an important feature of pytest which allows us to run a test with multiple sets of parameters. This is helpful when we want to check the same logic under different conditions without writing separate test functions.

Here is how we can test calculate_mean from the test_my_functions.py file, by considering multiple inputs using parametrization:

# ./test_my_functions.py
import pytest

from my_functions import calculate_mean


@pytest.mark.parametrize("numbers, expected", [
 ([1, 2, 3, 4, 5], 3.0),
 ([10, 20, 30], 20.0),
 ([7, 14, 21], 14.0),
 ([5, 5, 5, 5], 5.0),
])
def test_calculate_mean_parametrized(numbers, expected):
 result = calculate_mean(numbers)
 assert result == expected

In this example, @pytest.mark.parametrize allows us to test calculate_mean with four different lists. Each tuple in the list passed to parametrize represents a different test case with its own numbers and expected values.

Then to run the test we use:

$ pytest -v test_my_functions.py

========================================= test session starts =========================================
test_my_functions.py::test_calculate_mean_parametrized[numbers0-3.0] PASSED [ 25%]
test_my_functions.py::test_calculate_mean_parametrized[numbers1-20.0] PASSED [ 50%]
test_my_functions.py::test_calculate_mean_parametrized[numbers2-14.0] PASSED [ 75%]
test_my_functions.py::test_calculate_mean_parametrized[numbers3-5.0] PASSED [100%]

========================================== 4 passed in 0.01s ==========================================

The output is slightly different here because we are testing for different scenarios and the result is given for each of them.

Test organization

In the above, our test scripts (test_my_functions.py and test_failing.py) and python modules (my_functions.py) were all in the same directory. We used this approach for simplicity (as our focus was on how to write and run tests). In a larger project you may have many test scripts and python modules, and this approach will quickly become difficult to manage.

To keep your project organised, it’s a good practice to place all tests in a tests/ directory. This way, when we run pytest we receive a summary of all the project’s tests. On making this change, the file structure for the above example is:

./intro-to-python/
├── my_functions.py
├── tests/
│ ├── test_failing.py
│ └── test_my_functions.py
└── venv/

However, there is a small problem here. The my_functions.py module must be imported by the test_my_functions.py test script. But if we call pytest tests/ from the project root, my_functions.py isn’t automatically included in the python search path (a collection of directories from which packages and modules can be imported by the running python session) so it can’t be imported by test_my_functions.py.

A simple solution for this is to use the following command instead of pytest tests/:

$ python -m pytest tests/

When we call python directly, any python modules in the current directory are made available on the python search path.

A more robust solution (and one we would recommend for larger projects) is to place your python modules in a package structure, though that is beyond the scope of this introduction to pytest.

Ready to start testing your code? Enjoy your journey into Python testing, and happy coding!

For updates and revisions to this article, see the original post

Shiny in Production 2024: Full speaker lineup

Thu, 08 Aug 2024 23:59:00 +0000

We are pleased to announce the full line-up for this year’s Shiny in Production conference! This year, we’re introducing a new lightning talk session. These short 5 minute talks will allow us to showcase many more uses of Shiny in Production. The conference will still feature 6 full length talks, as well as a session of lightning speakers.

Talks

Cara Thompson - Freelance Data Consultant

Data-To-Wow: Leveraging Shiny as a no-code solution for high-end parameterised visualisations

You’ve created a prototype visualisation, fine-tuned it so it looks amazing and perfectly on-brand, and turned the plot code into a function so that you can run it again on different data and highlight different aspects of the story. Others on the team have seen how good the outputs look and they want in on the magic! But they don’t want to learn R.

This talk will offer a behind-the-scenes look at the process of creating a Shiny App that functions as a black box to get straight from the data to high-end parameterised visualisations. We’ll start by looking at creating parameterised plot functions using ggplot, before exploring how to bring the data and parameterisation into Shiny to create a seamless no-code data-to-viz workflow for the users.

Gareth Burns - Exploristics

Shiny in Secondary Education: Supplementing traditional learning resources to allow students to explore statistical concepts

The Statisticians in the Pharmaceutical Industry (PSI) Schools Outreach initiative aims at promoting data literacy and statistical concepts to the next generation of Statisticians and Data Scientists. Volunteers attend secondary schools to present from specialised workshops which are designed to be interactive, engaging and aligned to the national curriculum for different age groups.

The PSI Visualisation Special Interest Group (VIS SIG) created a Shiny application to supplement an existing workshop for Asthma. This workshop aims to introduce the students to analysis of continuous data and make them think about concealing treatment assignment and consider false positive and false negative results. The application allowed electronic data capture the ability to dynamically explore their own data, re-enforcing the statistical concepts and making learning more engaging and accessible.

Each school is different in terms of class size, computer resources and student abilities, therefore the application needed to be flexible to account for this and enable independent set up by a volunteer instructor. User experience and accessibility were fundamental in the design concepts to ensure the application was appropriate for a classroom environment and data visualisation were at an appropriate level for students.

In this presentation we discuss the range of issues required to get a Shiny application being implemented by a team of volunteers into a classroom setting. This includes flexible project management for a team of volunteers, use of persistent storage to enable multiple simultaneous users and use of Shiny modules to make code flexible and scalable for future Workshops.

Cassio Felix Jardim - Data4Shiny

Creating any User Interface in Shiny: The Importance of CSS in Shaping Shiny Apps’ User Interface

The main goal of this presentation is to use CSS concepts to assist in building User Interfaces for Dashboards constructed through Programming Languages. In particular, the R language and its Dashboard creation package (shiny package).

The presentation aims to demonstrate that CSS is crucial for organizing the elements of our Dashboard on the screen and also for the aesthetic aspect of the Dashboard User Interface.

Through the concepts of CSS Flexbox and CSS Grid, the presentation will take on a tutorial format where the entire process of constructing the user interface of any dashboard will be covered from start to finish. The main idea is to consider elements of storytelling, UI Design, and UX Design in the process of building a Dashboard.

The Shiny package and its entire ecosystem include various packages that bridge the gap between Data Science and Web Design, especially languages like Html, CSS, and Javascript. Creating this “bridge” between the worlds of Data Science and Web Design is my main objective.

Katy Morgan - Government Internal Audit Agency

More than just a chat bot: Tailoring the use of Generative AI within Government Internal Audit Agency with user-friendly R shiny applications

Generative AI offers huge potential for driving creativity by suggesting new ideas and perspectives and can also improve efficiency by rapidly processing and extracting insights from large volumes of text. However, using a chatbot-style tool such as ChatGPT can be overwhelming as users have to work out, through trial and error, which questions and instructions give them the outputs they need. The Government Internal Audit Agency’s data analytics team has created two R shiny web applications, each of which simplifies the user’s experience of using generative AI by providing a user-friendly interface and implementing a set of standardised prompts. The Risk Engine walks the user through a stepwise process to explore and articulate the potential risks that might impact any given business objective. The Writing Engine enables users to analyse and generate text in several ways, including generating a draft audit report from rough notes, and summarising common themes from a set of audit reports. This presentation will cover the process of developing and deploying the web applications and the challenges we faced along the way, describing how we tailored the appearance and functionality of the apps to best meet user needs.

Keith Newman - Jumping Rivers

Title coming soon

Following a PhD in statistics at Newcastle University, Keith developed software to improve road safety modelling. He enjoys creating Shiny apps and teaching the use of R.

Vikki Richardson - Audit Scotland

Faster than a Speeding Arrow - R Shiny Optimisation In Practice

The task of optimising your R Shiny apps for great performance can be challenging. Ensuring your code is efficient, using promises where you can, caching resources, and reducing the number of widgets or reactive variables can all help. But datasets can’t be squeezed any more – or can they? By storing larger chunks of data in Arrow format and using the Arrow package for manipulation, we were able to speed up some slower computations by at least one order of magnitude - often more.

This presentation will cover a case study of migrating a financial data auditing system to Arrow data storage. Because of Arrow, we were able to drop from two Connect servers to one, making management very happy with the cost savings - and delighting our users with the new, snappier application.

Lightning talks

Yigit Aydede - Saint Mary’s University

Transforming Community Understanding: A Shiny Application for Real-Time Crime and Real Estate Market Insights in Nova Scotia

This presentation showcases the Nova Scotia Property Insights (NSPI) application, a Shiny-based tool designed to provide comprehensive neighborhood insights through the integration of crime statistics and real estate market data. NSPI leverages the power of interactive maps to offer users a dynamic and engaging experience, facilitating informed decision-making for residents, potential homebuyers, policymakers, and researchers.

The core functionality of NSPI includes real-time visualization of crime data and property market trends across Nova Scotia neighborhoods. Users can select specific areas on the map to view detailed statistics within customizable radii, offering a granular perspective on local conditions. The application features a user-friendly interface with multiple tabs, including crime type comparisons, real estate market analysis, and historical data trends.

One of the key innovations of NSPI is its ability to allow users to perform side-by-side neighborhood comparisons. By simply clicking on different map areas, users can generate comparative reports that highlight variations in crime rates and property values. This feature is particularly valuable for those considering relocation or investment in Nova Scotia.

The presentation will delve into the technical aspects of developing NSPI, including data integration, user authentication, and the creation of a responsive UI. Additionally, we will discuss the challenges encountered and the solutions implemented to ensure data accuracy and user engagement.

Abbie Brookes & Jeremy Horne - Datacove

Shiny Policies: Dashboards to Aid British Government Decisions

In collaboration with Natural England, Datacove developed a bespoke Shiny dashboard for informed government decision-making, covering Health and Wellbeing, Nature, and Sustainability (HWNS). This presentation will outline three major topics: project and data management, our approach to customization, and the route taken to enhance usability.

The first phase involved project and data management to establish clear expectations. By engaging with Natural England stakeholders, we ensured that the envisioned product met their specific needs and provided a tangible preview of the dashboard’s functionality and design. We connected to government APIs and used R to extract, process, and transform multiple sources of HWNS data, bringing this information into one place for localised decision-making.

In the second phase, we focused on customisation to ensure seamless integration with Natural England’s existing webpage. Using the brand guidelines and custom CSS/JavaScript, we ensured that the dashboard had the same look and feel as other products built outside of Shiny. This step was crucial in maintaining a cohesive user experience by complementing their established digital ecosystem. Thus, making it easy to access and increasing the likelihood of use.

In the third phase, we emphasized making the dashboard accessible to all, regardless of data literacy. We implemented user-friendly design principles, pre-calculated dynamic stats, and intuitive navigation. For example, we built interactive charts using libraries such as Leaflet and Highcharts, this ensured that comparisons were clear and easy to dynamically explore. We will demonstrate our tips for easy interactive visualisations.

Throughout the project, we adopted best practices in data interpretation and are looking forward to sharing our insights at Shiny in Production.

David Carayon - INRAE

The SK8 project: A scalable institutional architecture for managing and hosting Shiny applications

Introducing the SK8 Project (Shiny Kubernetes Service), where data scientists, statisticians and engineers from INRAE, the French national research institute for agriculture, food and environment, have teamed up to create a new solution for managing and hosting Shiny applications.

Shiny has become very popular in our institute, widely used for sharing, showcasing, and democratizing scientific work. However, the enduring challenge of establishing scalable, secure, and sustainable hosting for these apps had yet to be addressed.

So, after realizing that different research labs had each implemented their own local and makeshift solutions, we put on our thinking caps and decided to craft an open-source institutional solution. Our mission? Break down silos, unite the R community at INRAE, and make hosting applications easy for Shiny developers with no IT backgrounds.

The SK8 infrastructure allows to host Shiny code on a GitLab instance opened to all INRAE staff. We’ve got pipelines (GitLab CI/CD), stability ({renv}), containerization with Docker, scalability and seamless deployment in a Kubernetes cluster. All of this is developed, managed, and maintained by the SK8 team using open-source solutions.

Using SK8 is a piece of cake – just toss your application code into a dedicated GitLab project and hit the “play” button.

In this talk, we will be speaking about the project itself, the ecosystem that’s making it all happen and how you could replicate this in your own company.

Juan Ramon Vallarta Robledo - FIND

Chagas diagnostic algorithms: an online application to estimate cost and effectiveness of diagnostic algorithms for Chagas disease

Chagas disease, caused by the Trypanosoma cruzi parasite, is a significant public health concern in Latin America, with an estimated 6-7 million people affected and increasing incidence rates worldwide. Examining the available diagnostic tests and their cost-effectiveness is essential for improving early diagnosis, which is crucial in managing the disease and preventing severe chronic conditions. To address this, FIND, a non-profit organization dedicated to facilitating equitable access to reliable diagnosis, developed Chagaspathways to provide guidance for Chagas disease testing.

The application is entirely built using Shiny and it incorporates a separate R library (patientpathways ), developed by FIND that contains all the analysis algorithms. It is designed to let users select different scenarios and specify parameters about the target population they are analyzing, like prevalence, testing costs, and the type of test used. The results show the recommended testing approach, the expected number of diagnosed cases, the cost per diagnosed case, along with the positive and negative predictive values. A comprehensive outcomes table is included in the results section and users have the option to download the results as an html report, to help them with further dissemination.

The Chagaspathways application is designed to be a user-friendly tool for public health professionals, recommending the most economical testing approaches to maximize resources and achieve the best results for patients and healthcare infrastructures. The application is intended to expand its scope to cover additional diseases, aiming to become an essential asset in global health initiatives for disease diagnostic modeling.

For updates and revisions to this article, see the original post

Shiny in Production 2024: Workshops

Thu, 04 Jul 2024 23:59:00 +0000

Shiny in Production is returning to the Catalyst, Newcastle upon Tyne, for its third instalment this October. We’ve expanded the itinerary this year, with four workshops to choose from as well as a day of talks, with speakers soon to be announced. Full details of the workshop are below, and you can head over to the conference website to register. Join us for an immersive experience tailored for both beginners and advanced users of Shiny and other web-based R packages.

The first day of the conference (Wednesday 9th October), will consist of the four parallel workshops, followed by a drinks reception in the evening, a great opportunity for networking and debriefing from the day’s learning.

Level up your plots: Tips, tricks and resources for crafting compelling visualisations - Cara Thompson

Data visualisations are a great asset in getting people talking about your findings. From making the patterns in the data easy to see, to making a big visual statement and keeping people talking beyond the end of your presentation, transforming your plots from functional to aesthetically pleasing and visually compelling is about so much more than making things pretty.

In this workshop, we’ll explore how we can make the most of colours, different plot types, text, and interactivity to maximise the impact of our visualisations. Here’s where we’re looking to boost your dataviz confidence:

crafting intuitive dataviz-friendly colour palettes without compromising on accessibility (or creativity!)
selecting the right type of dataviz for your data and your story
making the most of typography to optimise text hierarchy and readability
using annotations wisely to both help interpretation and declutter the visualisations
turning your ggplot into an interactive plot for additional data exploration
packaging up your decisions, easy reuse across plots (and projects!)

This is intended as a hands-on workshop, so bring along a laptop, a plot you’re working on or a research question, and some data. Throughout the workshop, I will highlight free resources for each of these aspects of dataviz development. The aim is for you to leave with a plot that you’d be happy to publish, and with some resources you can continue to build on.

About the speaker

Cara is a freelance data consultant with an academic background, specialising in dataviz and in “enhanced” reproducible outputs. She lives in Edinburgh, Scotland, and is passionate about maximising the impact of other people’s expertise.

Building Responsive Shiny Applications - Pedro Silva

The diverse range of devices used for modern web browsing presents challenges when designing an application that works well for all users. Enter responsive design: the practice of building fluid web pages that “work” on huge 4k and 5k monitors, tiny smartphones and all things in between. This course will look at responsive design principles and best practices for Shiny developers, covering page layout, easy-to-add widgets and some simple CSS tricks for when built-in solutions don’t quite cut it.

By the end of the workshop, participants will…

know what responsive web design is
know how to use flexible grids to adjust page layout for mobile, tablet and desktop
be able to use HTML5 elements and Shiny Widgets to use limited space efficiently and effectively
know how to add CSS and JavaScript snippets to an app for finer customisations
understand how to test Shiny apps on various screen sizes from desktop to mobile

About the speaker

Pedro is a full stack developer with over 15 years of experience in the field, loves front-end and R Shiny development, and is a moonlight practitioner of JavaScript dark arts.

Asynchronous Shiny - Russ Hyde

By the end of the workshop, participants will…

understand how within-session and between-session blocking can arise in a Shiny app
understand the basics of asynchronous computation
solve between-session blocking with future/promise
solve blocking the modern way, with ExtendedTask

About the speaker

Building Apps for Humans - Clarissa Barratt

Frameworks like Shiny and Dash can help those with a scientific or mathematical background communicate their research in a way that’s interactive and engaging. But while these tools can make constructing a graphical user interface quicker and easier, there’s no guarantee that the end product is going to be optimised for human use.

This workshop is aimed at scientists (and the curious) that are interested in learning some basics of human-computer interaction and gaining an understanding for how science itself can assist with the development of better user interfaces that, in turn, lead to improved user experiences.

By the end of the workshop, participants will…

understand the benefits that come from designing applications with the human mind in mind
know how the layout, colour, size and motion of interface and graphical components can be used to enhance (or detract from) a user’s experience
understand the importance of providing users with feedback so they can tell both whether their actions have been successful and what the current state of the application is
be able to identify some common problems found in web applications

About the speaker

While working towards her PhD in applied mathematics Clarissa discovered her love of science communications. Her goal is to make data science accessible to everyone, and to encourage people to engage with the goings on at Jumping Rivers.

What’s next?

Early bird tickets for the conference are still available till the end of July, so don’t miss out! The full line up of speakers will be announced in the coming weeks. Still not convinced? Head over to our YouTube channel to take a look at lineups from previous years to see what we have in store.

For updates and revisions to this article, see the original post

A timeline of R's first 30 years

Thu, 27 Jun 2024 23:59:00 +0000

August 2023 marked the thirtieth anniversary of the first public release of the R programming language. To celebrate this, and to show how far the language has evolved across those three decades, the timeline below shows some landmark events, packages and papers (with some Jumping Rivers items thrown in for good measure). Have we missed any of your personal favourites? Let us know via our social media channels and we’ll see if we can squeeze them in. On browsers that support it, double click/tap on any image or video on the timeline to see it full screen.

You can also view the timeline as a standalone page.

For updates and revisions to this article, see the original post

Vetiver: Model Deployment

Thu, 20 Jun 2024 23:59:00 +0000

This is Part 2 of a series of blogs on {vetiver}:

Part 1: Vetiver: First steps in MLOps
Part 2: Vetiver: Model Deployment (this post)
Part 3: Vetiver: Monitoring Models in Production
Part 4: Vetiver: MLOps for Python

Introduction

In our previous blog, we provided an overview of MLOps and the {vetiver} package, creating and deploying a simple model locally. In this post, we’ll show you how to deploy a model to production using Posit Connect, SageMaker, and Docker.

What is Docker

Docker is an open-source platform that allows developers to build, deploy, and run containers. These containers bundle application source code with the operating system libraries and dependencies needed to run that code.

Previously, we discussed deploying a Shiny Application using Docker. Similarly, we can deploy a set of APIs to access our model.

Creating a Docker file

The {vetiver} package simplifies creating a Dockerfile. We simply run:

vetiver::vetiver_prepare_docker(
 pins::board_connect(),
 "colin/k-nn",
 docker_args = list(port = 8080)
)

This command accomplishes several tasks:

Uses the {renv} package to create a list of R package dependencies required to run your model.
Creates a file named plumber.R containing the necessary code to deploy an API, essentially just vetiver_api().
Generates the Dockerfile.

The Dockerfile includes several components. The first component sets the R version, specifies the package repository, and crucially, installs the necessary system libraries.

FROM rocker/r-ver:4.4.0
ENV RENV_CONFIG_REPOS_OVERRIDE https://packagemanager.rstudio.com/cran/latest

RUN apt-get update -qq && apt-get install -y --no-install-recommends \
 ...

The second component copies the renv.lock file and installs the required R packages:

COPY vetiver_renv.lock renv.lock
RUN Rscript -e "install.packages('renv')"
RUN Rscript -e "renv::restore()"

Finally, we have the plumber/API section

COPY plumber.R /opt/ml/plumber.R
EXPOSE 8080
ENTRYPOINT ["R", "-e", "pr <- plumber::plumb('/opt/ml/plumber.R'); pr$run(host = '0.0.0.0', port = 8080)"]

which runs the API on port 8080.

The container is built via

docker build --tag my-first-model .

The --tag flag allows you to name your Docker image. You can inspect your stored Docker images with:

docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
my-first-model latest 792af21c775a About a minute ago 1.33GB

To run the image, use

docker run --rm --publish 8080:8080 my-first-model

Posit Connect / Sage Maker

We can also trivially publish the model to Posit Connect via

vetiver::vetiver_deploy_rsconnect(board = pins::board_connect(), "colin/k-nn")

Similarly, we can publish to SageMaker using the function vetiver_deploy_sagemaker().

For updates and revisions to this article, see the original post

Vetiver: First steps in MLOps

Thu, 13 Jun 2024 23:59:00 +0000

This is Part 1 of a series of blogs on {vetiver}. Future blogs will be linked here as they are released.

Part 1: Vetiver: First steps in MLOps (This post)
Part 2: Vetiver: Model Deployment
Part 3: Vetiver: Monitoring Models in Production
Part 4: Vetiver: MLOps for Python

Most R users are familiar with the classic workflow popularised by R for Data Science. Data scientists begin by importing and cleaning the data, then iteratively transform, model, and visualise it. Visualisation drives the modeling process, which in turn prompts new visualisations, and periodically, they summarise their work and report results.

This workflow stems partly from classical statistical modeling, where we are interested in a limited number of models and understanding the system behind the data. In contrast, machine learning prioritises prediction, necessitating the consideration and updating of many models. Machine Learning Operations (MLOps) expands the modeling component of the traditional data science workflow, providing a framework to continuously build, deploy, and maintain machine learning models in production.

Data: Importing and Tidying

The first step in deploying your model is automating data importation and tidying. Although this step is a standard part of the data science workflow, a few considerations are worth highlighting.

File formats: Consider moving from large CSV files to a more efficient format like Parquet, which reduces storage costs and simplifies the tidying step.

Moving to packages: As your analysis matures, consider creating an R package to encourage proper documentation, testing, and dependency management.

Tidying & cleaning: With your code in a package and tests in place, optimise bottlenecks to improve efficiency.

Versioning data: Ensure reproducibility by including timestamps in your database queries or otherwise ensuring you can retrieve the same dataset in the future.

Modelling

This post isn’t focused on modeling frameworks, so we’ll use {tidymodels} and the {palmerpenguins} dataset for brevity.

library("palmerpenguins")
library("tidymodels")
# Remove missing values
penguins_data = tidyr::drop_na(penguins, flipper_length_mm)

We aim to predict penguin species using island, flipper_length_mm, and body_mass_g. A scatter plot indicates this should be feasible. The scatter plot points to an obvious separation of Gentoo, to the other species. But pulling apart Adelie / Chinstrap looks a little more tricky.

Modelling wise, we’ll again keep things simple - a straight forward nearest neighbour model, where we use the island, flipper length and body mass to predict species type:

model = recipe(species ~ island + flipper_length_mm + body_mass_g,
 data = penguins_data) |>
 workflow(nearest_neighbor(mode = "classification")) |>
 fit(penguins_data)

The model object can now be used to predict species. Reusing the same data as before, we have an accuracy of around 95%.

model_pred = predict(model, penguins_data)
mean(model_pred$.pred_class == as.character(penguins_data$species))
#> [1] 0.9474

Vetiver Model

Now that we have a model, we can start with MLOps and {vetiver}. First, collate all the necessary information to store, deploy, and version the model.

v_model = vetiver::vetiver_model(model,
 model_name = "k-nn",
 description = "blog-test")
v_model
#> 
#> ── k-nn ─ <bundled_workflow> model for deployment 
#> blog-test using 3 features

The v_model object is a list with six elements, including our description.

names(v_model)
#> [1] "model" "model_name" "description" "metadata" "prototype" 
#> [6] "versioned"

v_model$description
#> [1] "blog-test"

The metadata contains various model-related components.

v_model$metadata
#> $user
#> list()
#> 
#> $version
#> NULL
#> 
#> $url
#> NULL
#> 
#> $required_pkgs
#> [1] "kknn" "parsnip" "recipes" "workflows"

Storing your Model

To deploy a {vetiver} model object, we use a pin from the {pins} package. A pin is simply an R (or Python!) object that is stored for reuse at a later date. The most common use case of the {pins} package (at least for me) is for caching data for a shiny application or quarto document. Basically an easy way to cache data.

However, we can pin any R object - including a pre-built model. We pin objects to “boards” - boards can exist in many places, including Azure, Google drive, or a simple s3 bucket. For this example, I’m using using Posit Connect:

vetiver::vetiver_pin_write(board = pins::board_connect(), v_model)

To retrieve the object, use:

# Not something you would normally do with a {vetiver} model
pins::pin_read(pins::board_connect(), "colin/k-nn")
#> $model
#> bundled workflow object.
#> 
#> $prototype
#> # A tibble: 0 × 3
#> # ℹ 3 variables: island <fct>, flipper_length_mm <int>, body_mass_g <int>

Deploying as an API

The final step is to construct an API around your stored model. This is achieved using the {plumber} package. To deploy locally, i.e. on your own computer, we create a plumber instance and pass the model using {vetiver}

plumber::pr() |>
 vetiver::vetiver_api(v_model) |>
 plumber::pr_run()

This deploys the APIs locally. When you run the code, a browser window will likely open. If it doesn’t simply navigate to http://127.0.0.1:7764/__docs__/.

If the API has successfully deployed, then

base_url = "127.0.0.1:7764/"
url = paste0(base_url, "ping")
r = httr::GET(url)
metadata = httr::content(r, as = "text", encoding = "UTF-8")
jsonlite::fromJSON(metadata)

should return

#$status
#[1] "online"
#
#$time
#[1] "2024-05-27 17:15:39"

The API also has endpoints metadata and pin-url allowing you to programmatically query the model. The key endpoint for MLops, is predict. This endpoint allows you to pass new data to your model, and predict the outcome

url = paste0(base_url, "predict")
endpoint = vetiver::vetiver_endpoint(url)
pred_data = penguins_data |>
 dplyr::select("island", "flipper_length_mm", "body_mass_g") |>
 dplyr::slice_sample(n = 10)
predict(endpoint, pred_data)

Summary

This post introduces MLOps and its applications. In the next post, we’ll discuss deploying models in production.

For updates and revisions to this article, see the original post

June 2024 Training Update

Thu, 06 Jun 2024 23:59:00 +0000

Our courses for the second half of 2024 have now been released. We have everything from the very basics of R and Python for data science, to advanced statistical modelling and machine learning. Interested in dashboards and reporting? We have courses on reporting with Quarto, as well as both introductory and advanced Shiny. Already know the basics but want to hone your skills? We have plenty of intermediate courses for you, as well as a course to take a look at some best practices in R and Python.

R

Introduction to R

Course Level: Foundation

Upcoming course dates: 3rd July, 7th October

Programming with R

Course Level: Intermediate

Upcoming course dates: 15th July, 21st October

Data Wrangling in the Tidyverse

Course Level: Foundation

Upcoming course dates: 10th July, 16th October

Data Visualisation with ggplot2

Course Level: Intermediate

Upcoming course dates: 22nd July, 4th November

R Best Practices

Course Level: Intermediate

Upcoming course dates: 22nd July

Object Oriented Programming in R

Course Level: Advanced

Upcoming course dates: 15th July

The training course will cover R object-oriented programming techniques. We’ll discuss what OOP is and the different varieties within R. Beginning with the popular S3 and S4 OOP frameworks, we’ll finish with the new {R6} package that is used extensively in Shiny applications. By the end of the course, participants will be able to use OOP within their own code.

Shiny

Introduction to Shiny

Course Level: Intermediate

Upcoming course dates: 10th July, 7th October

Do you want to provide interactive visualisation and data exploration features for users who do not have R and data science skills? Discover how easy it can be to use R and {shiny} to create your own apps and dashboards for exploring data without relying on web development or external BI tools. We will show you various examples of input widgets and outputs to display tables and visualisations.

Advanced Concepts in Shiny

Course Level: Advanced

Upcoming course dates: 23rd September, 14th October

Take your interactive {shiny} skills to the next level by creating more robust, responsive and maintainable applications. In this course, we’ll visit more advanced topics that can be used to improve the experience for both those producing the apps and those using them. Subjects will cover: additional ways to react to and validate user inputs; restructuring your app with modules; and an introduction to testing your {shiny} apps.

Python

Introduction to Python

Course Level: Foundation

Upcoming course dates: 9th September, 14th October

Programming with Python

Course Level: Intermediate

Upcoming course dates: 16th September, 23rd October

Data Visualisation with Python

Course Level: Intermediate

Upcoming course dates: 17th June, 23rd September, 11th November

Python Best Practices

Course Level: Intermediate

Upcoming course dates: 22nd July

Reporting

Reporting with Quarto

Course Level: Intermediate

Upcoming course dates: 24th June, 23rd September, 18th November

Machine Learning

Machine Learning with Tidymodels

Course Level: Intermediate

Upcoming course dates: 16th September, 11th November

Advanced Machine Learning with Tidymodels

Course Level: Advanced

Upcoming course dates: 23rd September, 18th November

SQL

An Introduction to SQL with R

Course Level: Intermediate

Upcoming course dates: 2nd October

We use the PostgreSQL database as an example for public courses. For in-house training, we are happy to adapt the course to match your database requirements.

Introduction to SQL with Python

Course Level: Intermediate

Upcoming course dates: 2nd October

We use a PostgreSQL database as an example, and communicate with this using a psycopg2 connection.

Statistics

Statistical Modelling with R

Course Level: Intermediate

Upcoming course dates: 9th September, 23rd October

Introduction to Bayesian Inference using RStan

Course Level: Intermediate

Upcoming course dates: 1st July, 14th October

Introduction to Bayesian Inference using PyStan

Course Level: Intermediate

Upcoming course dates: 15th July, 21st October

The course will teach participants how to interface with Stan through Python!

For updates and revisions to this article, see the original post

Shiny in Production 2024: Call for Abstracts

Thu, 30 May 2024 23:59:00 +0000

Call for abstracts now open

We are excited to announce the Call for Abstracts for Shiny in Production 2024, to be held on 9th-10th October 2024 in Newcastle upon Tyne, UK. This event brings together industry experts, data scientists, and developers to explore the latest advancements and best practices in deploying Shiny applications in production settings.

About the Conference

Topics of Interest

We invite abstracts on a wide range of topics, including but not limited to:

Scalable Architectures: Techniques for scaling Shiny applications to handle large datasets and high user loads.
Security Best Practices: Ensuring the security and privacy of data within Shiny applications.
Performance Optimisation: Strategies for improving the speed and responsiveness of Shiny apps.
Integration with Other Technologies: Combining Shiny with other tools and platforms for enhanced functionality.
Python: Developing Python Shiny apps
Case Studies: Real-world examples of successful Shiny deployments in various industries.
Automated Testing and Continuous Deployment: Best practices for maintaining high-quality applications through automated workflows.

To get an idea of past topics, check out our YouTube channel, where we have playlists of talks from Shiny in Production 2022 and 2023.

Submission Guidelines

To submit your abstract, please follow these guidelines:

Abstract Length: Up to 250 words.
Deadline: Submissions must be received by 11:59 on 30th June 2024.
Submission Portal: Submit your abstract here.

Important Dates

Abstract Submission Deadline: 30th June 2024
Notification of Acceptance: 1st August 2024
Conference Dates: 9th-10th October 2024

For more information, visit our conference website.

For updates and revisions to this article, see the original post

SatRdays London 2024: Thanks for coming!

Thu, 09 May 2024 23:59:00 +0000

We wanted to say a huge thank you to everybody who attended SatRdays London 2024! It was brilliant to see you all there, and we hope you enjoyed the day as much as we did. Thank you to all of our speakers for your contributions, it was great to see such a range of talks and hear about the different ways you can use R in your fields.

Of course the day wouldn’t have been the same without our generous sponsors, so we want to say a huge thank you to CUSP London for providing the excellent venue, as well as R Consortium for your generous support.

Couldn’t make it on the day? Keep your eyes peeled on our blog and social media, as we’ll be releasing recordings of the talks on our YouTube channel in the coming months. Can’t wait that long? Check out last year’s SatRdays London recordings as well as those from our Shiny in Production conference from 2022 and 2023.

What’s next?

Registration is now open for Shiny in Production 2024! This event consists of an afternoon of Shiny based workshops including:

Level Up Your Plots with Cara Thompson
Building Responsive Shiny Apps
Asynchronous Shiny
Building Apps for Humans

Followed by a day of talks from Shiny experts across a variety of industries. If you’re interested in submitting an abstract, head over to the conference website now. Submissions are open until 30th June!

For updates and revisions to this article, see the original post

What's new in R 4.4.0?

Thu, 25 Apr 2024 23:59:00 +0000

R 4.4.0 (“Puppy Cup”) was released on the 24th April 2024 and it is a beauty. In time-honoured tradition, here we summarise some of the changes that caught our eyes. R 4.4.0 introduces some cool features (one of which is experimental) and makes one of our favourite {rlang} operators available in base R. There are a few things you might need to be aware of regarding handling NULL and complex values.

The full changelog can be found at the r-release ‘NEWS’ page and if you want to keep up to date with developments in base R, have a look at the r-devel ‘NEWS’ page.

A tail-recursive tale

Years ago, before I’d caused my first stack overflow, my Grandad used to tell me a daft tale:

It was on a dark and stormy night,
And the skipper of the yacht said to Antonio,
"Antonio, tell us a tale",
So Antonio started as follows...
It was on a dark and stormy night,
And the skipper of the yacht .... [ad infinitum]

The tale carried on in this way forever. Or at least it would until you were finally asleep.

At around the same age, I was toying with BASIC programming and could knock out classics such as

>10 PRINT "Ali stinks!"
>20 GOTO 10

Burn! Infinite burn!

That was two example processes that demonstrate recursion. Antonio’s tale quotes itself recursively, and my older brother will be repeatedly mocked unless someone intervenes.

Recursion is an elegant approach to many programming problems - this usually takes the form of a function that can call itself. You would use it when you know how to get closer to a solution, but not necessarily how to get directly to that solution. And unlike the un-ending examples above, when we write recursive solutions to computational problems, we include a rule for stopping.

An example from mathematics would be finding zeros for a continuous function. The sine function provides a typical example:

We can see that when x = π, there is a zero for sin(x), but the computer doesn’t know that.

One recursive solution to finding the zeros of a function, f(), is the bisection method, which iteratively narrows a range until it finds a point where f(x) is close enough to zero. Here’s a quick implementation of that algorithm. If you need to perform root-finding in R, please don’t use the following function. stats::uniroot() is much more robust…

bisect = function(f, interval, tolerance, iteration = 1, verbose = FALSE) {
 if (verbose) {
 msg = glue::glue(
 "Iteration {iteration}: Interval [{interval[1]}, {interval[2]}]"
 )
 message(msg)
 }
 # Evaluate 'f' at either end of the interval and return
 # any endpoint where f() is close enough to zero
 lhs = interval[1]; rhs = interval[2]
 f_left = f(lhs); f_right = f(rhs)

 if (abs(f_left) <= tolerance) {
 return(lhs)
 }
 if (abs(f_right) <= tolerance) {
 return(rhs)
 }
 stopifnot(sign(f_left) != sign(f_right))

 # Bisect the interval and rerun the algorithm
 # on the half-interval where y=0 is crossed
 midpoint = (lhs + rhs) / 2
 f_mid = f(midpoint)
 new_interval = if (sign(f_mid) == sign(f_left)) {
 c(midpoint, rhs)
 } else {
 c(lhs, midpoint)
 }
 bisect(f, new_interval, tolerance, iteration + 1, verbose)
}

We know that π is somewhere between 3 and 4, so we can find the zero of sin(x) as follows:

bisect(sin, interval = c(3, 4), tolerance = 1e-4, verbose = TRUE)
#> Iteration 1: Interval [3, 4]
#> Iteration 2: Interval [3, 3.5]
#> Iteration 3: Interval [3, 3.25]
#> Iteration 4: Interval [3.125, 3.25]
#> Iteration 5: Interval [3.125, 3.1875]
#> Iteration 6: Interval [3.125, 3.15625]
#> Iteration 7: Interval [3.140625, 3.15625]
#> Iteration 8: Interval [3.140625, 3.1484375]
#> Iteration 9: Interval [3.140625, 3.14453125]
#> Iteration 10: Interval [3.140625, 3.142578125]
#> Iteration 11: Interval [3.140625, 3.1416015625]
#> [1] 3.141602

It takes 11 iterations to get to a point where sin(x) is within 10⁻⁴ of zero. If we tightened the tolerance, had a more complicated function, or had a less precise starting range, it might take many more iterations to approximate a zero.

Importantly, this is a recursive algorithm - in the last statement of the bisect() function body, we call bisect() again. The initial call to bisect() (with interval = c(3, 4)) has to wait until the second call to bisect() (interval = c(3, 3.5)) completes before it can return (which in turn has to wait for the third call to return). So we have to wait for 11 calls to bisect() to complete before we get our result.

Those function calls get placed on a computational object named the call stack. For each function call, this stores details about how the function was called and where from. While waiting for the first call to bisect() to complete, the call stack grows to include the details about 11 calls to bisect().

Imagine our algorithm didn’t just take 11 function calls to complete, but thousands, or millions. The call stack would get really full and this would lead to a “stack overflow” error.

We can demonstrate a stack-overflow in R quite easily:

blow_up = function(n, max_iter) {
 if (n >= max_iter) {
 return("Finished!")
 }
 blow_up(n + 1, max_iter)
}

The recursive function behaves nicely when we only use a small number of iterations:

blow_up(1, max_iter = 100)
#> [1] "Finished!"

But the call-stack gets too large and the function fails when we attempt to use too many iterations. Note that we get a warning about the size of the call-stack before we actually reach it’s limit, so the R process can continue after exploding the call-stack.

blow_up(1, max_iter = 1000000)
# Error: C stack usage 7969652 is too close to the limit

In R 4.4, we are getting (experimental) support for tail-call recursion. This allows us (in many situations) to write recursive functions that won’t explode the size of the call stack.

How can that work? In our bisect() example, we still need to make 11 calls to bisect() to get a result that is close enough to zero, and those 11 calls will still need to be put on the call-stack.

Remember the first call to bisect()? It called bisect() as the very last statement in it’s function body. So the value returned by the second call to bisect() was returned to the user without modification by the first call. So we could return the second call’s value directly to the user, instead of returning it via the first bisect() call; indeed, we could remove the first call to bisect() from the call stack and put the second call in it’s place. This would prevent the call stack from expanding with recursive calls.

The key to this (in R) is to use the new Tailcall() function. That tells R “you can remove me from the call stack, and put this cat on instead”. Our final line in bisect() should look like this:

bisect = function(...) {
 ... snip ...
 Tailcall(bisect, f, new_interval, tolerance, iteration + 1, verbose)
}

Note that you are passing the name of the recursively-called function into Tailcall(), rather than a call to that function (bisect rather than bisect(...)).

To illustrate that the stack no longer blows up when tail-call recursion is used. Let’s rewrite our blow_up() function:

# R 4.4.0
blow_up = function(n, max_iter) {
 if (n >= max_iter) {
 return("Finished!")
 }
 Tailcall(blow_up, n+1, max_iter)
}

We can still successfully use a small number of iterations:

blow_up(1, 100)
#> [1] "Finished!"

But now, even a million iterations of the recursive function can be performed:

blow_up(1, 1000000)
#> [1] "Finished!"

Note that the tail-call optimisation only works here, because the recursive call was made as the very last step in the function body. If your function needs to modify the value after the recursive call, you may not be able to use Tailcall().

Rejecting the NULL

Missing values are everywhere.

In a typical dataset you might have missing values encoded as NA (if you’re lucky) and invalid numbers encoded as NaN, you might have implicitly missing rows (for example, a specific date missing from a time series) or factor levels that aren’t present in your table. You might even have empty vectors, or data-frames with no rows, to contend with. When writing functions and data-science workflows, where the input data may change over time, by programming defensively and handling these kinds of edge-cases your code will throw up less surprises in the long run. You don’t want a critical report to fail because a mathematical function you wrote couldn’t handle a missing value.

When programming defensively with R, there is another important form of missingness to be cautious of …

The NULL object.

NULL is an actual object. You can assign it to a variable, combine it with other values, index into it, pass it into (and return it from) a function. You can also test whether a value is NULL.

# Assignment
my_null = NULL
my_null
#> NULL

# Use in functions
my_null[1]
#> NULL
c(NULL, 123)
#> [1] 123
c(NULL, NULL)
#> NULL
toupper(NULL)
#> character(0)

# Testing NULL-ness
is.null(my_null)
#> [1] TRUE
is.null(1)
#> [1] FALSE
identical(my_null, NULL)
#> [1] TRUE

# Note that the equality operator shouldn't be used to
# test NULL-ness:
NULL == NULL
#> logical(0)

R functions that are solely called for their side-effects (write.csv() or message(), for example) often return a NULL value. Other functions may return NULL as a valid value - one intended for subsequent use. For example, list-indexing (which is a function call, under the surface) will return NULL if you attempt to access an undefined value:

config = list(user = "Russ")

# When the index is present, the associated value is returned
config$user
#> [1] "Russ"

# But when the index is absent, a `NULL` is returned
config$url
#> NULL

Similarly, you can end up with a NULL output from an incomplete stack of if / else clauses:

language = "Polish"

greeting = if (language == "English") {
 "Hello"
} else if (language == "Hawaiian") {
 "Aloha"
}

greeting
#> NULL

A common use for NULL is as a default argument in a function signature. A NULL default is often used for parameters that aren’t critical to function evaluation. For example, the function signature for matrix() is as follows:

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

The dimnames parameter isn’t really needed to create a matrix, but when a non-NULL value for dimnames is provided, the values are used to label the row and column names of the created matrix.

matrix(1:4, nrow = 2)
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
matrix(1:4, nrow = 2, dimnames = list(c("2023", "2024"), c("Jan", "Feb")))
#> Jan Feb
#> 2023 1 3
#> 2024 2 4

R 4.4 introduces the %||% operator to help when handling variables that are potentially NULL. When working with variables that could be NULL, you might have written code like this:

# Remember there is no 'url' field in our `config` list

# Set a default value for the 'url' if one isn't defined in
# the config
my_url = if (is.null(config$url)) {
 "https://www.jumpingrivers.com/blog/"
} else {
 config$url
}
my_url
#> [1] "https://www.jumpingrivers.com/blog/"

Assuming config is a list:

when the url entry is absent from config (or is itself NULL), then config$url will be NULL and the variable my_url will be set to the default value;
but when the url entry is found within config (and isn’t NULL) then that value will be stored in my_url.

That code can now be rewritten as follows:

# R 4.4.0
my_url = config$url %||% "https://www.jumpingrivers.com/blog"
my_url
#> [1] "https://www.jumpingrivers.com/blog"

Note that the left-hand value must evaluate to NULL for the right-hand side to be evaluated, and that empty vectors aren’t NULL:

# R 4.4.0
NULL %||% 1
#> [1] 1

c() %||% 1
#> [1] 1

numeric(0) %||% 1
#> numeric(0)

This operator has been available in the {rlang} package for eight years and is implemented in exactly the same way. So if you have been using %||% in your code already, the base-R version of this operator should work without any problems, though you may want to wait until you are certain all your users are using R >= 4.4 before switching from {rlang} to the base-R version of %||%.

Any other business

A shorthand hexadecimal format (common in web-programming) for specifying RGB colours has been introduced. So, rather than writing the 6-digit hexcode for a colour “#112233”, you can use “#123”. This only works for those 6-digit hexcodes where the digits are repeated in pairs.

Parsing and formatting of complex numbers has been improved. For example, as.complex("1i") now returns the complex number 0 + 1i, previously it returned NA.

There are a few other changes related to handling NULL that have been introduced in R 4.4. The changes highlight that NULL is quite different from an empty vector. Empty vectors contain nothing, whereas NULL represents nothing. For example, whereas an empty numeric vector is considered to be an atomic (unnestable) data structure, NULL is no longer atomic. Also, NCOL(NULL) (the number of columns in a matrix formed from NULL) is now 0, whereas it was formerly 1.

sort_by() a new function for sorting objects based on values in a separate object. This can be used to sort a data.frame based on it’s columns (they should be specified as a formula):

mtcars |> sort_by(~ list(cyl, mpg)) |> head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

Try the latest version out for yourself

To take away the pain of installing the latest development version of R, you can use docker. To use the devel version of R, you can use the following commands:

docker pull rstudio/r-base:devel-jammy
docker run --rm -it rstudio/r-base:devel-jammy

Once R 4.4 is the released version of R and the r-docker repository has been updated, you should use the following command to test out R 4.4.

docker pull rstudio/r-base:4.4-jammy
docker run --rm -it rstudio/r-base:4.4-jammy

SatRdays London 2024: Registration Closing Soon

Fri, 12 Apr 2024 23:59:00 +0000

SatRdays registration will be closing soon!

Here’s a reminder of what we have lined up for you.

We’ll be welcoming 9 fantastic speakers from across a variety of industries to give you an insight into how you can use R for many different applications, including use case examples such as modelling humanitarian crises and risks to road users, as well as systems involving high performance computing, and general overviews of new additions to the tidyverse, quarto and much more!

Check out the abstracts below. Don’t miss out on this excellent opportunity, sign up now on the website and get 20% off the ticket price!

Andrie de Vries - Posit

Lessons learnt from Product Management, applied to Data Science

As a Data Scientist you build data products all the time. You may even have worked with a Product Manager to create analyses and dashboards for decision making.

But are you applying the skills of product management in your data science role?

In this talk Andrie provides an overview of Product Management (PM), and what he’s learnt over two decades of managing products, ranging from hardware (Psion PDAs) to software (Microsoft R Open, Posit Workbench) and hosted services (MRAN).

Every product manager must consider the new product adoption life cycle, managing the stages from finding the first innovators, managing growth and ultimately the end-of-life process.

During this process you must manage your product so that it’s usable (customers want it), feasible (you can build it) and valuable (you can do this sustainably). Many frameworks exist to think about discovering what customers want, the jobs they must get done, forming a value proposition, managing a product roadmap, working with dev teams to build it, and working with marketing and sales to create a compelling sales pitch.

As a data scientist, you can benefit from product management knowledge by thinking of your app as a product. You must convince your users (internal customers) to use this app (at the cost of changing their workflow).

I will leave you with a map to get started with classic resources, including Geoffrey Moore, Marty Cagan, Teresa Torres, April Dunford and Lenny’s Podcast.

Hannah Frick - Posit

Survival analysis is coming to tidymodels

If you have time-to-event data, such as data on customer churn, data on the lifetime of machines, or similar, survival analysis with its censored regression models gives you the ability to include all your observations in the model appropriately, including those where you may not have observed the event yet.

The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data. The framework’s consistency makes switching between models easy, its guardrails against common pitfalls such as overfitting due to data leakage make it safe. It covers the entire modeling workflow: preprocessing and feature engineering, models, resamples, performance metrics, and tuning.

We are now extending support for survival analysis across the entire tidymodels framework with dedicated models and metrics, allowing the same ease and expressiveness as for classification and regression, across all steps of the modeling process.

Charlie Gao - Hibiki AI Limited

mirai’ for Shiny and Plumber Applications

‘mirai’ is Japanese for ‘future’. Some of the existing solutions for parallelization in R have not fundamentally changed in 20 years. The technologies behind ‘mirai’ are, in contrast, modern and minimalist, and provide a level of performance that will be noticeable for demanding, client-facing workloads typical of Shiny and Plumber applications.

As a scheduler for distributed tasks, ‘mirai’ currently powers the high performance computing needs for the ‘targets’ reproducible-workflow ecosystem, whether locally, on traditional HPC clusters or the cloud. It has undergone the validation required to reliably handle demanding scientific workloads such as clinical trials simulations. At R Project Sprint 2023, it was integrated as a backend for the base R ‘parallel’ package at the request of R-Core.

The same industrial-strength, yet incredibly lightweight solution is now available to power large-scale Shiny and Plumber applications.

This presentation demonstrates how ‘mirai’ works in typical example situations which benefit from parallelization of computations, and the different ways they may be distributed to background processes on the same machine or across a network of servers.

A particular highlight will be the zero-configuration TLS option. This ‘just works’ to protect remote connections using single-use certificates generated on-the-fly. This was developed under an R Consortium infrastructure grant that aims to make such technologies available to the wider R community.

Michael Hogers - NPL Markets Ltd

Modular Shiny(Proxy) - a SaaS setup

I aim to provide a talk that displays how one can use R, Shiny and ShinyProxy (or other deployment methods) to create a modular SaaS platform that later allows to swap out modules of the platform with new languages or frameworks. The key ingredients are: use a database back-end across Shiny modules, deploy modules as relatively small apps to dedicated URL endpoints, use a shared UI library across Shiny modules and package your Shiny apps (+ use CI/CD) while keeping business logic separated to later on export business logic functions.

Matthew Lam & Matthew Law - Mott MacDonald

How Mott MacDonald unlocks the power of geospatial data with R

Mott MacDonald is a global engineering, management, and development consultancy with a broad portfolio of projects across various engineering disciplines. Geospatial data plays an instrumental role in supporting projects in these sectors, enabling us to understand the world around us so that we can make better informed decisions, improve efficiencies, and drive digital innovation.

In this presentation, we will illustrate how we use R at Mott MacDonald to harness the power of geospatial data with two examples – Risk Modelling for Ash Dieback and Creative Geospatial Visualisation for Impactful Communication.

The Ash Dieback Pipeline is a computer vision project which attempts to identify trees with the Ash Dieback disease from video footage of roadways around the UK. We intend to showcase how we use R to process a variety of geospatial datasets and attempt to model the risk to road users associated with a diseased tree remaining untreated.

Our work at Mott MacDonald often involves wrangling complex datasets to answer multifaceted questions. R provides excellent toolkits for integrating, analysing, and visualising geospatial datasets. We intend to demonstrate how R can be used for creative visualisation of geospatial data to extract and communicate actionable insights.

Through these examples, we hope to outline our team’s maturity journey towards building multilingual spatial data science capabilities alongside traditional GIS platforms.

Myles Mitchell - Jumping Rivers

Using R to teach R

At Jumping Rivers, we teach over forty courses covering data science topics, including programming, data visualisation and machine learning, in R as well as Python, Tableau, Git, Docker and Stan. Most courses follow the same template: static notes, live coding scripts and presentation slides. For every taught course we also have to spin up a bespoke virtual environment, collect feedback and generate certificates.

In this talk, I will explain how we have used R to streamline the course writing process, automate the course build and deployment to Posit Workbench, and conduct post-course administrative tasks. With over 100 courses taught every year, each step in this pipeline must be rigorously tested so that, on the day, the trainer can focus on the attendees without having to worry about technical issues.

I will draw on our process’s successes (and shortcomings) and share some take-home lessons applicable to any big coding project, including packaging of source code, automated testing and scheduled builds.

Nicola Rennie - Lancaster University

Typst or LaTeX? Styling PDF documents with Quarto extensions

Quarto is an open-source scientific and technical publishing system that allows you to combine text with code to create fully reproducible documents in a variety of formats. The addition of custom styling to documents can make them look more professional and recognisable. In this talk, I’ll give an overview of ways to create customised PDF documents using Quarto. Until recently, this meant getting to grips with LaTeX. Now, there’s a new kid on the block: Typst. Typst is an open-source typesetting system that is designed to be as powerful as LaTeX while being much easier to learn and use.

Extensions are a powerful way to modify and extend the behaviour of Quarto, including adding styling to your documents with LaTeX or Typst. To demonstrate the differences between LaTeX and Typst, I’ll walk through the process of converting a LaTeX-based style extension to Typst, allowing users to easily switch between them. We’ll compare the two – discussing error messages (we all get them!), render time, and customisability along the way.

Matt Thomas - British Red Cross

Where data meets disaster: A journey through the British Red Cross’s ‘humaniverse’

The ‘Humaniverse’ is a suite of R packages produced by the British Red Cross’s data scientists for sharing humanitarian data and tools. Open data and analyses are vital for 21st Century humanitarianism and these packages have transformed the speed and scale at which we can provide answers about emerging and ongoing humanitarian crises in the UK. In this talk, I will offer an overview of the Humaniverse and will share some of the ways we have used this infrastructure to inform how the British Red Cross supports people affected by disasters, displacement, and health crises. I will cover our core R packages, discuss how and why we work in the open, demonstrate some of the analyses and apps we’ve built using this infrastructure, and share our ambitions for the future of the Humaniverse.

For updates and revisions to this article, see the original post

Reading large spatial data

Thu, 04 Apr 2024 23:59:00 +0000

I love playing with spatial data. Perhaps because I enjoy exploring the outdoors, or because I spend hours playing Geoguessr, or maybe it’s just because maps are pretty but there’s nothing more fun than tinkering with location data.

However, reading in spatial data, especially large data sets can sometimes be a pain. Here are some simple things to consider when working in spatial data in R and breaking large data sets into more manageable chunks.

Choose the right resolution

Before you even start playing with your data, ask yourself if you’ve got the appropriate data set for the job. Spatial data can come in different resolutions, and depending on the type of analysis or visualisation you are doing you might not need really accurate boundaries. Choosing a smaller file at the cost of a little accuracy can massively reduce the file size and read in times. Of course, you don’t always have the luxury of choosing your data, but if you can it can make a big difference.

For example, I live in the UK so I often use the Open Geography portal. This is hosted by the Office for National Statistics (ONS) and provides free and open access geographic data for the UK. The ONS provide boundaries for each geography at both full resolution and generalised formats that provide a smoothing of the full boundaries. The full resolution is the highest resolution data available, which can result in very large file sizes. Generalised formats preserve much of the original detail but are much smaller in size providing a good compromise.

For the types of visualisations I make, generalised data is sufficient. As a small example, I downloaded the UK Lower layer Super Output Areas datasets with Full, Generalised and Super Generalised boundaries and calculated the file sizes and time to read in. I also plotted the three different resolutions with geom_sf() so you can compare how they look.

File size and read times for various resolutions of the same data set.
Resolution	Generalised to	File size	Time to read
Full	0m	546 MB	2s
Generalised	20m	50MB	750ms
Super generalised	200m	16MB	620ms

File size and read times for various resolutions of the same data set.

The full resolution file is 10 times bigger than the generalised one, but visually it’s hard to see the difference between the boundaries. Remember that higher resolution data sets will also take longer to render when you plot them.

Read only what you need with SQL queries

Sometimes you only need a subset of the data you’ve been given. Let’s say I have data for the UK, but I only need the LSOAs in Wales. It’s inefficient to load the entire data set, to then immediately throw away most of the rows. It would make much more sense for me to only load into memory the rows that I need. In R, we use the st_read() function from {sf} to parse spatial data. The query argument in st_read() allows for reading just parts of the file into memory using SQL queries to filter the data on disk.

The format of the query is SELECT columns FROM layer WHERE condition. So to select all columns from my LSOA layer with code starting with “W” (for Wales 🏴󠁧󠁢󠁷󠁬󠁳󠁿) we would use the following query.

library("sf")
st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg",
 query = "SELECT * FROM LSOA_2021_EW_BGC_V3 WHERE LSOA21CD LIKE 'W%'",
 quiet = TRUE)
## Simple feature collection with 1917 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 146615.2 ymin: 164586.3 xmax: 355312.8 ymax: 395982.3
## Projected CRS: OSGB36 / British National Grid
## First 10 features:
## LSOA21CD LSOA21NM BNG_E BNG_N LONG LAT
## 1 W01000003 Isle of Anglesey 001A 244606 393011 -4.33934 53.41098
## 2 W01000004 Isle of Anglesey 001B 242766 392434 -4.36671 53.40525
## 3 W01000005 Isle of Anglesey 005A 259172 377173 -4.11332 53.27280
## 4 W01000006 Isle of Anglesey 006A 240111 379172 -4.39991 53.28535
## 5 W01000007 Isle of Anglesey 009A 240423 370062 -4.39067 53.20362
## 6 W01000008 Isle of Anglesey 008A 253221 372359 -4.20027 53.22795
## 7 W01000009 Isle of Anglesey 007C 237013 376457 -4.44494 53.26002
## 8 W01000010 Isle of Anglesey 002A 250545 382806 -4.24524 53.32103
## 9 W01000011 Isle of Anglesey 008B 254994 372208 -4.17367 53.22708
## 10 W01000012 Isle of Anglesey 006B 246450 374976 -4.30288 53.24954
## GlobalID SHAPE
## 1 {C18AD6F8-CD89-453E-A34A-B9ACE9B58203} MULTIPOLYGON (((244811.2 39...
## 2 {0ED47DC7-B1FE-4E63-84A6-995B701A39C0} MULTIPOLYGON (((241027.3 39...
## 3 {EA47EE1B-C4F6-442F-B2F3-58EEA678DB1E} MULTIPOLYGON (((259509.4 37...
## 4 {8FA5312C-C4CF-4B38-B7BD-44C824D15ED0} MULTIPOLYGON (((241039 3817...
## 5 {A6509D8B-7C3D-4260-BA0F-BCECFFBEEA66} MULTIPOLYGON (((245072.1 37...
## 6 {73F66A80-7EE7-4A98-899C-9702711DA427} MULTIPOLYGON (((253481.6 37...
## 7 {7AB711C3-C230-4236-877B-8746ED3E1DCA} MULTIPOLYGON (((235911.7 37...
## 8 {B5A77ECA-8CDC-4F82-B07E-3305622D1175} MULTIPOLYGON (((251300.9 38...
## 9 {F7037CC7-D94A-4B6B-BEDA-7F02CF2CC5A5} MULTIPOLYGON (((256049.2 37...
## 10 {A1A42AA3-FF44-4F13-9B60-82F9E7FB5681} MULTIPOLYGON (((246333.7 37...

Here * means SELECT all columns, and LIKE is used to match strings against a pattern in the OGR SQL dialect.

But what if you don’t know the names of the layer you want to read in? You can use st_layers() to identify the layer(s) of interest without reading in the entire data.

st_layers("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg")
## Driver: GPKG 
## Available layers:
## layer_name geometry_type features fields
## 1 LSOA_2021_EW_BGC_V3 Multi Polygon 35672 7
## crs_name
## 1 OSGB36 / British National Grid

But what if you don’t know the names of your columns? You can look at the first polygon only to get an idea about the structure without loading the entire data set. Just use the feature ID attribute to read in just the first row of your data with WHERE FID = 1.

st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg",
 query = "SELECT * FROM LSOA_2021_EW_BGC_V3 WHERE FID = 1",
 quiet = TRUE)
## Simple feature collection with 1 feature and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 531948.3 ymin: 181263.5 xmax: 532308.9 ymax: 182011.9
## Projected CRS: OSGB36 / British National Grid
## LSOA21CD LSOA21NM BNG_E BNG_N LONG LAT
## 1 E01000001 City of London 001A 532123 181632 -0.09714 51.51816
## GlobalID SHAPE
## 1 {1A259A13-A525-4858-9CB0-E4952BA01AF6} MULTIPOLYGON (((532105.3 18...

This only reads in the top row of the data, which is an LSOA E01000001 in the City of London.

Spatial filtering

In the last example, we filtered our dataset before reading it into R by using some of the metadata that was attached to our spatial polygons. But what if you don’t have any columns that provide a useful filter? You can also filter by the spatial properties of your data. Let’s try and read in only the Welsh LSOAs again, but this time, using the spatial property only.

First we need to create a polygon that we want our LSOAs to overlap with. A boundary for Wales is available within the countries data set on Open GeoPortal.

library(dplyr)
library(ggplot2)

uk = st_read("data/Countries_December_2022_GB_BGC.gpkg")

wales = filter(uk, CTRY22NM == "Wales")

We then turn that geometry into a well-known text string. This is simply a text representation of the polygon. We use st_geometry() to grab the geometry column of the data frame, and then st_as_text() to convert to a well-known text string.

wales_wkt =
 wales |>
 st_geometry() |>
 st_as_text()

This well-known text is just a string defining the outline of the polygon we want to use as our bounding box (here Wales). It looks a bit like this.

"MULTIPOLYGON (((313022.3 384930.5, 312931.3 385007.4, 312644.5 38519.8, ...)))"

We can then use that string in the wkt_filter argument of st_read() to only read in LSOAs that overlap with the Wales polygon.

wales_lsoa =
 st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg",
 wkt_filter = wales_wkt)

We can see that only the LSOAs that overlap with the polygon of Wales have been read in. Note that spatial intersection can be a little bit complicated. We’ve actually read in some English LSOAs along the Wales/England border in addition to the Welsh LSOAs because these technically overlap with the Wales polygon on the border itself. So it’s not perfect, but as a tool for selecting the rows of interest, before reading into R’s memory - it’s still pretty handy.

For another great example of using Spatial SQL to read in data efficiently, check out this nice blog post by Rob Williams.

Hopefully these quick tips will help you the next time you’re working with spatial data. If you want to learn more, check out our course on Spatial Data Analysis with R.

For updates and revisions to this article, see the original post

SatRdays London 2024: Sponsors

Tue, 19 Mar 2024 23:59:00 +0000

SatRdays London wouldn’t be possible without our sponsors, so we wanted to take the time to tell you a little bit about them.

Don’t miss out on this great chance to learn from R experts and network with fellow data science enthusiasts! Tickets are on sale now!

CUSP London

CUSP London is the Centre for Urban Science and Progress based at King’s College London, UK. Their mission is to support interdisciplinary research and innovation using Data Science in and for London.

Find them on X(/Twitter) @CuspLondon

R Consortium

Find them on X(/Twitter) @RConsortium

Jumping Rivers

Jumping Rivers is an analytics company specialising in creating bespoke solutions for modern business problems. Their team of data science and engineering experts come from many different backgrounds, and their wealth of knowledge and experience allows them to think outside the box and solve problems in new and innovative ways.

Find them on X(/Twitter) @jumping_uk and Mastodon @jumpingrivers@fosstodon.org.

For updates and revisions to this article, see the original post

Spring clean your R packages

Thu, 07 Mar 2024 23:59:00 +0000

Did you know we maintain a number of public R packages on our GitHub page. Some of these packages were developed way back in 2019. Since then, the standards in R package development have changed a little and we thought it was time to have a little spring clean of our packages.

In this blog post, we’ll be using functions from the {usethis} package to spruce up some of our old packages. We’ve chosen five quick improvements which you should be able to implement in 15 minutes or less. Grab your duster and come along.

Rename master to main

We wrote a whole blog post on why it’s a good idea to move the default branch name from master to the more neutral name, main. Luckily, renaming a single repository is straightforward and this one command will basically do everything for you.

usethis::git_default_branch_rename()

Tidy your description file

Your DESCRIPTION file is argubly the most important file in your R package as it defines the purpose of your code and contains important metadata. Take a minute to check that the key fields are still correct, in particular the contact email address, description and any URLs.

You can then run

usethis::use_tidy_description()

which will put the fields in a standard order and alphabetise the dependencies. It’s looking tidier already. 😌

Migrate to GitHub Actions

TravisCI used to be the most popular tool for continuous integration in the #RStats community. In recent years, many R package developers have moved away from Travis CI to GitHub Actions. Dean Attali wrote a detailed guide explaining the migration process in full. However, for most simple packages, all we need to do is delete the existing travis.yml file, and then run

usethis::use_github_action("check-standard")

to set up the standard GitHub action. This action will run R CMD check using R-latest on Linux, Mac, and Windows. This is a good baseline if you plan on submitting your package to CRAN. It will also add a lovely badge to your README.md that will show users that your package is passing the check.

If your R package has tests, you might also want to run

usethis::use_github_action("code-coverage")

which will calculate your test coverage and report to codecov.io.

Create a hex sticker

We all know that the most important part of any R package is the hex sticker. If you don’t already have one, you easily can create one in R with {hexSticker}.

You can choose any image or plot to position on your sticker. You can then customise it by changing the colours, fonts and adding a url.

sticker(subplot = "hoover.png", s_x=1, s_y=.75,
 h_fill = "#4898a8", h_color = "#516e7a",
 package = "springClean", p_size=20,
 url = "jumpingrivers.com", u_color = "#FFFFFF", u_size = 6,
 filename = "sticker.png")

You can add the hex sticker as a logo to your package with another helpful {usethis} function.

usethis::use_logo("sticker.png")

Contributing and Code of Conduct

One of the great things about R package developement is that it’s a team effort. If you want people to contribute to the development of your R packages, you need to tell them how to contribute. It’s also a good idea to add a code of conduct, to set an example of how we should work together.

At Jumping Rivers, we follow the standard contributing and CoC guides that the tidyverse developers use. Again, {usethis} provides functions that make adding these files to your package really easy.

usethis::use_tidy_contributing()
usethis::use_coc(contact = "hello@jumpingrivers.com")

Get involved

That’s it for our quick spring clean. We always welcome new contributors to our R packages. If you have any issues or want to make a PR head over to our GitHub page.

And if we’ve inspired you to dust off your old R packages and give them some love, let us know on Twitter/X.

For updates and revisions to this article, see the original post

An introvert's guide to networking at a conference

Thu, 29 Feb 2024 23:59:00 +0000

Uh oh. Conferences. I’ve been trying not to think about them, but I can’t put it off any longer.

That’s okay! It might not be too late to book on to the conferences you want to attend. Some will even still have early bird offers available if you’re quick.

Oh that’s not a problem. I’m already booked on, travel is arranged, my talk is written. It’s just …

… just what?

… there’s so many people and events. My manager said “they’re great networking opportunities” so I think she’s expecting me to come back with a whole list of new contacts too.

Oh! Well I love attending conferences. Spending time with interesting people, and meeting new folk leaves me all energised and buzzy. But conferences aren’t just for extroverts like me.

Setting your manager’s possible expectations aside for a second, just focus on what you can benefit from when networking: Meeting the people behind interesting projects that could be helpful to you; encountering contacts with interesting job opportunities; finding like-minded people to attend the social events with.

Maybe I can help you out with a few tips?

That would be great! How should I prepare?

Before you get there, you can use social media. Conferences often have #hashtags associated with them. If you use social media, posting about the event can be a great, low-stress way to engage with other attendees.

I often post something like

“Excited to be on my way to Seattle for #positconf2024 ✈️ Anyone else there this year?”

when I’m travelling.

That way, if someone I already know is attending, we can make plans to meet up and hang out together.

That’s a nice idea. But what happens if I don’t already know people who are going?

Check the event schedule for the conference. There is often a welcome event aimed at people who are new to the conference.

The day before the RSS International Conference, the Young Statisticians arrange a workshop to help early career statisticians get to know the organisers and make some conference friends before the main event starts.

Sounds useful. But I’m going to encounter people at some point and I never know what to say.

First of all, people love talking about themselves, and they like you for taking an interest and listening to them. Think up and remember a few standard questions that can get the ball rolling and see where the conversation goes from there.

Where do you work?
What parts of the work do you enjoy?
Did you travel far to get here?
Did you go to University? Where did you study?
What’s the funniest issue you’ve encountered in your work?

Oh so I just to need to remember some questions to ask?

Yes, and come up with some nice icebreakers if you can.

Or nicebreakers. See what I did there?

You must be very pleased with yourself.

I am actually.

Well don’t forget the time will come where you have to introduce yourself in return. A little preparation goes a long way here.

Prepare a short sentence explaining who you are, what you do and what you’re interested in. At networking events, you’ll constantly be asked these questions. Preparing a two-minute introduction and practising it slowly out loud can help you feel more confident in the moment.

And it might sound silly, but make a conscious effort to say your name and company slowly when you first introduce yourself. I find myself saying “Hi, I’m Rhian and I work for Jumping Rivers” so many times, that I naturally rush over the words, creating an awkward situation where the other person didn’t catch your name.

So I should rehearse by talking to myself the mirror like they do on The Sims?

If that helps you, sure. It doesn’t have to be perfectly scripted, but making a coherent introduction of yourself creates a good first impression.

When I’m there, won’t everyone already know each other already?

Remember, plenty of people go to conferences alone and don’t know anyone else attending. If there’s a pub quiz, or other team-based social activity, it’s okay to turn up by yourself. It’s the organisers’ job to make you feel welcome and help you find others to chat with.

There’s also going to be events such as poster sessions and welcome drinks, where it’s entirely normal for anyone to be interacting with new people. The individuals you do meet may go on to introduce you to other people they know at the conference.

Oh yeah, this schedule has a few of those sessions with free snacks and drinks.

Take it easy on the free wine and coffee. It’s easy to use alcohol and copious amounts of coffee to boost your networking bravery at the poster session. However, too much caffeine or alcohol can cause social anxiety and leave you feeling queasy the next morning.

Are you speaking from personal experience here?

Maybe 😳

What if I don’t drink alcohol?

There’s also no obligation to drink alcohol. Younger generations are increasingly becoming non-drinkers, and most events will have a decent selection of non-alcoholic drinks. Some conferences organise alcohol-free socials too, like an evening walking-tour or a morning run group.

These networking events sound busy.

Yes, some of these events can be crowded and noisy. For some people, that can be a bit overwhelming. It’s perfectly okay to step outside for fresh air or to sit in a quieter corner to regain some energy. In fact, it can be a good idea because it’s unlikely that you’ll be the only person to do this—other likeminded introverts will also have the same idea, giving you an opportunity to meet someone new in a quieter location.

This is a tricky one. Some people will be aiming to create business connections and opportunities, and some will be in a more relaxed and sociable mode. This is a situation where you have to read the intentions of the other person and adapt.

It’s totally okay to talk about non-work topics at any point during the conference. Chatting about hobbies and finding common interests is a great way to connect with people. If someone wants to talk to you about a business opportunity, they will likely steer the conversation back to work, and move on if they don’t think you can help them. Don’t take this personally! People have different reasons for attending conferences.

It can also depend on the conference. Some massive machine learning and AI expos can be very business focused, whilst smaller academic or programming conferences can have more of a community feel. If you aren’t sure about the vibe, you can always ask someone who has attended that conference before.

How do I maintain contact with the people I meet?

Traditionally, you would exchange business cards. In Japan there’s even an etiquette for receiving a business card where you take a moment to admire and study the person’s details on the card you’ve just received—they’d never take a business card and rush it straight into their wallet alongside countless other business cards to be forgotten.

But in this modern world, business cards are becoming less common. So if you don’t have cards to hand out, try connecting with them over LinkedIn—the LinkedIn app allows you to share a QR code to link straight to your profile. Or if either of you don’t use LinkedIn, you can exchange email addresses. If the other person agrees, consider taking a selfie with them and emailing it to their address straight away—not only have you exchanged details, but provided a visual memento and opened a conversation channel for further discussions after the conference.

I’m not used to a conference this big. There’s multiple talks going on at the same time. What’s the etiquette if I want to be in different sessions for different talks?

Look at the conference programme in advance to plan which talks you want to attend. Big conferences often have “streams” meaning multiple talks will be happening concurrently.

The start and end times of the talks are often scheduled to match across sessions, so you can switch between streams during the sessions between each talk. It’s quite common to see a few people leave at the end of a talk to jump to the other stream. If you want to change session, go at the same time as this crowd—just move briskly and quietly so you cause minimal disruption, and sit strategically close to the exit for a quick escape.

But sometimes it can be a bit awkward to sneak out of the back mid-session, especially if the room is small or talks are overrunning, so sometimes it’s easier just to commit yourself to a single stream per session.

Okay, gotcha. Plan the sessions to attend in advance. While I’m doing that, are there any useful sessions I should consider?

Some conferences have specific networking sessions for people that already have things in common like the “Birds of a Feather” sessions at Posit Conference. These are short, one-off meetups for people who have something in common, e.g. R educators, people working with R in insurance, R-Ladies or R users from Africa. These groups are often smaller and you already have something in common, making networking a little less daunting and the connections a little more relevant.

Right, I’ve been through the whole schedule. I think I’ve picked something for each session now, and wow, I’m going to be busy!

Great! Just remember that you don’t have to go to every session and social event. It’s totally okay to skip a session and recharge. Conferences can be exhausting, even I need a quiet minute to myself sometimes. Grab yourself a cup of tea and find a quiet corner to have a little break from the stimulus. Or head back to your hotel room for an afternoon nap. Then you’ll be refreshed for the next session. (Just remember to set an alarm… ⏰)

Well thank you for the advice. I’m not sure I can think of any more questions.

Well on the topic of questions, I have one last tip. If you enjoyed someone’s talk, but don’t want to ask a question in the session, you can go and talk to them in the coffee break following their talk. From the speaker’s perspective, it’s nice to chat with people who got value from your talk, and you’ll have another friendly face in the lunch queue.

Do you have any suggestions for good conferences to go to?

How about North East Data Scientists Meetup, Leeds Data Science Meetup or AI in Production in June 2026? You can register to attend by visiting those websites. They’re being organised by Jumping Rivers.

I’ll take a look, thanks.

And hey, if I see you there, you’ll have an extroverted friend who can introduce to you everyone at the conference.

oh no 😰

For updates and revisions to this article, see the original post

SatRdays London 2024: Speakers

Tue, 27 Feb 2024 23:59:00 +0000

SatRdays London is fast approaching and we are happy to announce our full lineup of speakers for the event! Read on for more info. If you want to join in the fun, head over to the conference website to sign up!

We’ll be at the amazing Bush House, courtesy of CUSP London on 27th April 2024!

Andrie de Vries - Posit

Applying product management in data science

Andrie is Director of Product Strategy at Posit (formerly RStudio) where he works on the Posit commercial products. He started using R in 2009 for market research statistics, and later joined Revolution Analytics and then Microsoft, where he helped customers implement advanced analytics and machine learning workflows. To keep healthy, he practices yoga and does some recreational running and canoeing.

Hannah Frick - Posit

Survival analysis is coming to tidymodels

If you have time-to-event data, such as data on customer churn, data on the lifetime of machines, or similar, survival analysis with its censored regression models gives you the ability to include all your observations in the model appropriately, including those where you may not have observed the event yet.

The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data. The framework’s consistency makes switching between models easy, its guardrails against common pitfalls such as overfitting due to data leakage make it safe. It covers the entire modeling workflow: preprocessing and feature engineering, models, resamples, performance metrics, and tuning.

We are now extending support for survival analysis across the entire tidymodels framework with dedicated models and metrics, allowing the same ease and expressiveness as for classification and regression, across all steps of the modeling process.

Charlie Gao - Hibiki AI Limited

mirai’ for Shiny and Plumber Applications

‘mirai’ is Japanese for ‘future’. Some of the existing solutions for parallelization in R have not fundamentally changed in 20 years. The technologies behind ‘mirai’ are, in contrast, modern and minimalist, and provide a level of performance that will be noticeable for demanding, client-facing workloads typical of Shiny and Plumber applications.

As a scheduler for distributed tasks, ‘mirai’ currently powers the high performance computing needs for the ‘targets’ reproducible-workflow ecosystem, whether locally, on traditional HPC clusters or the cloud. It has undergone the validation required to reliably handle demanding scientific workloads such as clinical trials simulations. At R Project Sprint 2023, it was integrated as a backend for the base R ‘parallel’ package at the request of R-Core.

The same industrial-strength, yet incredibly lightweight solution is now available to power large-scale Shiny and Plumber applications.

This presentation demonstrates how ‘mirai’ works in typical example situations which benefit from parallelization of computations, and the different ways they may be distributed to background processes on the same machine or across a network of servers.

A particular highlight will be the zero-configuration TLS option. This ‘just works’ to protect remote connections using single-use certificates generated on-the-fly. This was developed under an R Consortium infrastructure grant that aims to make such technologies available to the wider R community.

Michael Hogers - NPL Markets Ltd

Modular Shiny(Proxy) - a SaaS setup

I aim to provide a talk that displays how one can use R, Shiny and ShinyProxy (or other deployment methods) to create a modular SaaS platform that later allows to swap out modules of the platform with new languages or frameworks. The key ingredients are: use a database back-end across Shiny modules, deploy modules as relatively small apps to dedicated URL endpoints, use a shared UI library across Shiny modules and package your Shiny apps (+ use CI/CD) while keeping business logic separated to later on export business logic functions.

Matthew Lam & Matthew Law - Mott MacDonald

How Mott MacDonald unlocks the power of geospatial data with R

Mott MacDonald is a global engineering, management, and development consultancy with a broad portfolio of projects across various engineering disciplines. Geospatial data plays an instrumental role in supporting projects in these sectors, enabling us to understand the world around us so that we can make better informed decisions, improve efficiencies, and drive digital innovation.

In this presentation, we will illustrate how we use R at Mott MacDonald to harness the power of geospatial data with two examples – Risk Modelling for Ash Dieback and Creative Geospatial Visualisation for Impactful Communication.

The Ash Dieback Pipeline is a computer vision project which attempts to identify trees with the Ash Dieback disease from video footage of roadways around the UK. We intend to showcase how we use R to process a variety of geospatial datasets and attempt to model the risk to road users associated with a diseased tree remaining untreated.

Our work at Mott MacDonald often involves wrangling complex datasets to answer multifaceted questions. R provides excellent toolkits for integrating, analysing, and visualising geospatial datasets. We intend to demonstrate how R can be used for creative visualisation of geospatial data to extract and communicate actionable insights.

Through these examples, we hope to outline our team’s maturity journey towards building multilingual spatial data science capabilities alongside traditional GIS platforms.

Myles Mitchell - Jumping Rivers

Using R to teach R

At Jumping Rivers, we teach over forty courses covering data science topics, including programming, data visualisation and machine learning, in R as well as Python, Tableau, Git, Docker and Stan. Most courses follow the same template: static notes, live coding scripts and presentation slides. For every taught course we also have to spin up a bespoke virtual environment, collect feedback and generate certificates.

In this talk, I will explain how we have used R to streamline the course writing process, automate the course build and deployment to Posit Workbench, and conduct post-course administrative tasks. With over 100 courses taught every year, each step in this pipeline must be rigorously tested so that, on the day, the trainer can focus on the attendees without having to worry about technical issues.

I will draw on our process’s successes (and shortcomings) and share some take-home lessons applicable to any big coding project, including packaging of source code, automated testing and scheduled builds.

Nicola Rennie - Lancaster University

Typst or LaTeX? Styling PDF documents with Quarto extensions

Quarto is an open-source scientific and technical publishing system that allows you to combine text with code to create fully reproducible documents in a variety of formats. The addition of custom styling to documents can make them look more professional and recognisable. In this talk, I’ll give an overview of ways to create customised PDF documents using Quarto. Until recently, this meant getting to grips with LaTeX. Now, there’s a new kid on the block: Typst. Typst is an open-source typesetting system that is designed to be as powerful as LaTeX while being much easier to learn and use.

Extensions are a powerful way to modify and extend the behaviour of Quarto, including adding styling to your documents with LaTeX or Typst. To demonstrate the differences between LaTeX and Typst, I’ll walk through the process of converting a LaTeX-based style extension to Typst, allowing users to easily switch between them. We’ll compare the two – discussing error messages (we all get them!), render time, and customisability along the way.

Matt Thomas - British Red Cross

Where data meets disaster: A journey through the British Red Cross’s ‘humaniverse’

The ‘Humaniverse’ is a suite of R packages produced by the British Red Cross’s data scientists for sharing humanitarian data and tools. Open data and analyses are vital for 21st Century humanitarianism and these packages have transformed the speed and scale at which we can provide answers about emerging and ongoing humanitarian crises in the UK. In this talk, I will offer an overview of the Humaniverse and will share some of the ways we have used this infrastructure to inform how the British Red Cross supports people affected by disasters, displacement, and health crises. I will cover our core R packages, discuss how and why we work in the open, demonstrate some of the analyses and apps we’ve built using this infrastructure, and share our ambitions for the future of the Humaniverse.

For updates and revisions to this article, see the original post

A Blog Post About the Blog

Thu, 15 Feb 2024 23:59:00 +0000

If you’re a regular visitor to our blog you may have noticed some recent changes we hope will make it easier to find what you’re looking for (or interesting stuff you weren’t).

The blog “home” page, “/blog” now shows a card for every post with title, author, excerpt, tags and image. (But don’t worry, we won’t clog your browser up by trying to force every image to load at once.) This new layout means no more hunting backwards and forwards through pages to find what you’re looking for. Moreover, the new search bar can help you find things with a couple of taps on the keyboard. The URL is updated when you finish a search so you can easily share the results with others. For example, here are all cards that mention Shiny in Production.

The tags pages have also been updated in a similar fashion, with the addition of a search bar if there are more than five results listed.

And, finally, we’ve added brand-new author pages so you can quickly find all blog posts written (or co-written) by any of our team here at Jumping Rivers.

Feel free to tell us what you think via the usual social media channels and let us know if there’s something you think we’re missing.

For updates and revisions to this article, see the original post

Parquet vs the RDS Format

Thu, 01 Feb 2024 23:59:00 +0000

This is part of a series of related posts on Apache Arrow. Other posts in the series are:

The benefit of using the {arrow} package with parquet files, is it enables you to work with ridiculously large data sets from the comfort of an R session. Using the NYC-Taxi data from the previous blog post we can perform standard data science operations, such as,

library("arrow")
nyc_taxi = open_dataset(nyc_data)
nyc_taxi |>
 dplyr::filter(year == 2019) |>
 dplyr::group_by(month) |>
 dplyr::summarise(trip_distance = max(trip_distance)) |>
 dplyr::collect()

with a speed that seems almost magical. When your dataset is as large as the NYC-Taxi data, then standard file formats, such as, CSV files and R binary files, simply aren’t an option.

However, let’s suppose you are in the situation where your data is inconvenient - not big, just a bit annoying. For example, if we take a single year and a single month

taxi_subset = open_dataset(nyc_data) |>
 dplyr::filter(year == 2019 & month == 1) |>
 dplyr::collect()

The data is still large, with around eight million rows

nrow(taxi_subset)

and takes around 1.2GB of RAM when we load it into R. The data isn’t big, just annoying! In this situation, should we use the native binary format or stick with parquet?

In theory, we could use CSV, but that’s really slow!

RDS vs Parquet

The RDS format is a binary file format, native to R. It has been part of R for many years, and provides a convenient method for saving R objects, including data sets.

The obvious question is which file format should you use for storing tabular data? RDS or parquet? For this comparison, I’m interested in the following characteristics:

the time required to save the file;
the file size;
the time required to load the file.

I’m also a firm believer of keeping things stable and simple. So if both methods are roughly the same or even if parquet is little better, then I would stick with R’s binary format. Consequently, I don’t really care about a few MBs or seconds.

Reading and writing the data

To save the taxi data subset, we use saveRDS() for the rds format and write_parquet() for the parquet format. The default compression method used by RDS is gzip, whereas the parquet uses snappy. As you might guess, the gzip method produces smaller files, but takes longer.

saveRDS(taxi_subset, file = "taxi.rds")
# Default parquet compression is "snappy"
tf1 = tempfile(fileext = ".parquet")
write_parquet(taxi_subset, sink = tf1, compression = "snappy")
tf2 = tempfile(fileext = ".gzip.parquet")
write_parquet(taxi_subset, sink = tf2, compression = "gzip")

Reading in either file type is also straightforward

readRDS("taxi.rds")
# Need to use collect() to make comparison far
open_dataset(file_path) |>
 dplyr::collect()

Results

Each test was run a couple of times, and the average is given in the table below. The read times and size were fairly deterministic, but the write times had massive variability.

Method	Compression	Size (MB)	Write Time (s)	Read Time (s)
RDS	gzip	115	27	5.7
Parquet	snappy	143	4	0.3
Parquet	gzip	105	12	0.4

For me the results suggest that for files of this size, I would consider using the native binary R format only if

the writing and reading file times weren’t an issue;
and/or using the native binary R format (and the implied stability) was really important.

However, parquet and {arrow} do look appealing.

When Should we use Parquet over RDS?

The above timings are for a particular size data set (110MB). However, a few quick experiments show the performance improvement is fairly consistent for different file sizes:

Writing (parquet vs rds): around 6 time faster using snappy, and twice as fast using gzip;
Reading (parquet vs rds): around 16 times faster using parquet.

So to answer the question, when should we use parquet over rds? For me that depends. If it was for a standard analysis, and the files were fairly modest (less than 20 MB), I would probably just go for an RDS file. However, if I had a Shiny application, then this would significantly lower the threshold where I would use parquet, for the simple reason that one second on a web application feels like a lifetime. Remember that if you are using {pins}, then pin_write() can handle parquet files without any issue.

For updates and revisions to this article, see the original post

Events at Jumping Rivers 2024

Thu, 25 Jan 2024 23:59:00 +0000

SatRdays London 2024

Once again, we’re partnering up with CUSP London to bring you a day of R themed talks in the centre of the UK capital. We’ll be returning to the amazing Bush House to hear experts in all things R share their knowledge and experience. The day is a great opportunity to meet like-minded data science enthusiasts - whether you’re brand new to R and data science, or been working in the field for years, the wide range of talks and networking opportunities make this a conference for all.

Still not sure? Take a look at some of last year’s talks on our YouTube channel! We recently closed for abstract submissions, so watch this space, as we’ll be releasing the final speaker lineup soon! Head over to the conference website for more details, and to register!

Shiny in Production 2024

Shiny in Production is returning in 2024, and we’re looking forward to bringing you a wide range of speakers and workshops on all things Shiny (as well as other web-based R and visualisation themes)! We recently released the recordings on YouTube, so head over to see what you can expect!

We’re currently accepting abstracts for next year’s Shiny in Production! If you want to get involved, head over to the conference website to submit your work!

For updates and revisions to this article, see the original post

Reading and Writing Data with {arrow}

Thu, 18 Jan 2024 23:59:00 +0000

This is part of a series of related posts on Apache Arrow. Other posts in the series are:

Understanding the Parquet file format
Reading and Writing Data with {arrow} (This post)
Parquet vs the RDS Format

What is (Apache) Arrow?

Apache Arrow is a cross-language development platform for in-memory data. As it’s in-memory (as opposed to data stored on disk), it provides additional speed boosts. It’s designed for efficient analytic operations, and uses a standardised language-independent columnar memory format for flat and hierarchical data. The {arrow} R package provides an interface to the ‘Arrow C++’ library - an efficient package for analytic operations on modern hardware.

There are many great tutorials on using {arrow} (see the links at the bottom of the post for example). The purpose of this blog post isn’t to simply reproduce a few examples, but to understand some of what’s happening behind the scenes. In this particular post, we’re interested in understanding the reading/writing aspects of {arrow}.

Getting started

The R package is installed from CRAN in the usual way

install.packages("arrow")

Then loaded using

library("arrow")

This blog post uses the NYC Taxi data. It’s pretty big - around ~40GB in total. To download it locally,

data_nyc = "data/nyc-taxi"
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
 dplyr::filter(year %in% 2012:2021) |>
 write_dataset(data_nyc, partitioning = c("year", "month"))

Once this has completed, you can check everything has downloaded correctly by running

nrow(open_dataset(data_nyc))
## [1] 1150352666

Loading in data

Unsurprisingly, the first command we come across is open_dataset(). This opens the data and (sort of) reads it in.

library("arrow")
open_dataset(data_nyc)
## FileSystemDataset with 120 Parquet files
## vendor_name: string
## pickup_datetime: timestamp[ms]
## dropoff_datetime: timestamp[ms]
## passenger_count: int64
## trip_distance: double
## ...

Reading is a lazy action. This allows us to manipulate much larger data sets than R could typically deal with. The default print method lists the columns in the data set, with their associated type. These data types come directly from the C++ API so don’t always have a corresponding R type. For example, the year column is an int32 (a 32 bit integer), whereas passenger_count is int64 (a 64 bit integer). In R, these are both integers.

As you might guess, there’s a corresponding function write_dataset(). Looking at the (rather good) documentation, we come across a few concepts that are worth exploring further.

File formats

The main file formats associated are

parquet: a format designed to minimise storage - see our recent blog post that delves into some of the details surrounding the format;
arrow/feather: in-memory format created to optimise vectorised computations;
csv: the world runs on csv files (and Excel).

The common workflow is storing your data as parquet files. The Arrow library then loads the data and processes the data in the arrow format.

Storing data in the Arrow format

The obvious thought (to me at least) was, why not store the data as arrow? Ignoring for the moment that Arrow doesn’t promise long-term archival storage using the arrow format, we can do a few tests.

Using the NYC-taxi data, we can create a quick subset

# Replace format = "arrow" with format = "parquet" 
# to create the correspond
# parquet equivalent
open_dataset(file.path(data_path, "year=2019")) |>
 write_dataset("data/nyc-taxi-arrow", partitioning = "month",
 format = "arrow")

A very quick, but not particularly thorough test suggests that

the arrow format requires ten times more storage space. So for the entire nyc-taxi data set, parquet takes around ~38GB, but arrow would take around 380GB.
storing as arrow makes some operations quicker. For the few examples I tried, there was around a 10% increase in speed.

The large storage penalty was enough to convince me of the merits of storing data as parquet, but there may be some niche situations where you might switch.

Hive partitioning

Both open_dataset() and write_dataset() functions mention “Hive partitioning” - in fact we sneakily included a partioning argument in the code above. For the open_dataset() function, it guesses if we use Hive partitioning, whereas for the write_dataset() function we can specify the partition. But what actually is it?

Hive partitioning is a method used to split a table into multiple files based on partition keys. A partition key is a variable of interest in your data, for example, year or month. The files are then organised in folders. Within each folder, the key has a value is determined by the name of the folder. By partitioning the data in this way, we can make it faster to do queries on data slices.

Suppose we wanted to partition the data by year, then the file structure would be

taxi-data
 year=2018
 file1.parquet
 file2.parquet
 year=2019
 file4.parquet
 file5.parquet

Of course, we can partition by more than one variable, such as both year and month

taxi-data
 year=2018
 month=01
 file01.parquet
 month=02
 file02.parquet
 file03.parquet
 ...
 year=2019
 month=01
 ...

See the excellent vignette on datasets in the {arrow} package.

Example: Partitioning

Parquet files aren’t the only files we can partition. We can also use the same concept with CSV files. For example,

tmp_dir = tempdir()
write_dataset(palmerpenguins::penguins,
 path = tmp_dir,
 partitioning = "species",
 format = "csv")

This looks like

list.files(tmp_dir, recursive = TRUE, pattern = "\\.csv$")
## [1] "species=Adelie/part-0.csv" "species=Chinstrap/part-0.csv"
## [3] "species=Gentoo/part-0.csv"

You can also partition using the group() function from {dplyr}

palmerpenguins::penguins |>
 dplyr::group_by(species) |>
 write_dataset(path = tmp_dir, format = "csv")

In my opinion, while it makes conceptual sense to partition CSV files, in practice it’s probably not worthwhile. Any CSV files that you partition to get speed benefits, you might as well use parquet.

Single files vs dataset APIs

When reading in data using Arrow, we can either use the single file function (these start with read_) or use the dataset API (these start with open_).

For example, using read_csv_arrow() reads the CSV file directly into memory. If the file is particularly large, then we’ll run out of memory. One thing to note, is the as_data_frame argument. By default this is set to TRUE, meaning that read_csv_arrow() will return a tibble. The upside of this is that we have a familiar object. The downside is that it takes up more room than Arrow’s internal data representation (an Arrow Table)

This blog post by François Michonneau goes into far more detail, and discusses the R and Python implementations of the different APIs.

Acknowledgements

This blog was motivated by the excellent Arrow tutorial at Posit Conf 2023, run by Steph Hazlitt and Nic Crane. The NYC dataset came from that tutorial, and a number of the ideas that I explored were discussed with the tutorial leaders. I also used a number of resources found on various corners of the web. I’ve tried to provide links, but if I’ve missed any, let me know.

For updates and revisions to this article, see the original post

Security Headers for Shiny Applications

Thu, 11 Jan 2024 23:59:00 +0000

Over the last few years, we have been performing audits on Posit set-ups, Shiny Applications and general R set-ups. One of our standard checks is to examine the server headers of a Shiny Server. Numerous websites do this check for you, but as we have an R-based/Quarto workflow, it was helpful to write a quick R package.

The package isn’t on CRAN, but is on the R-universe, so installing is straightforward

install.packages("serverHeaders",
 repos = c("https://jumpingrivers.r-universe.dev",
 "https://cloud.r-project.org"))

There are only a couple of exported functions. The core function is check(). As an example, let’s use jumpingrivers.com.

# check returns an invisible data frame of results
serverHeaders::check("jumpingrivers.com")
## 
## ── Checking Server ──
## 
## ✔ Status code: 301 → 301 → 200
## ✔ SSL available
## ✔ SSL redirection successful: http -> https
## ✔ content-security-policy: Policy present but not parsed
## ✔ content-type: charset set
## ✔ permissions-policy: Value present but not verified
## ✔ referrer-policy: Acceptable setting found
## ✔ strict-transport-security: max_age = 365 days and is greater than 1 year
## ✔ x-content-type-options: Acceptable setting found
## ✔ x-frame-options: Acceptable setting found

The output to the console highlights key server headers that we are interested in. Of course, the definition of key is open to a lot of discussion, but we just used securityheaders.com for guidance.

Comments on jumpingrivers.com

Before we go further, it’s worth noting that a few years ago we decided to move from Wordpress to a static site generator - Hugo. We made this decision based on

static sites are faster;
static sites are easier to maintain;
our previous site (WordPress) had to be constantly updated; dealing with numerous WordPress plugins always worried us - too much much for what is essentially a simple site.

One of the significant consequences of having a static site is the attack surface is significantly reduced.

Status codes

The first header is the status code. You’re probably familiar with a status code of 200 indicating a successful request, and the dreaded 404 indicating a missing page. However, when we look at jumpingrivers.com, we actually got three status codes: 301, 301, and then the magical 200. This is fairly standard. What happens is that jumpingrivers.com is actually the same as http://jumpingrivers.com. This redirects (code 301) to https://jumpingrivers.com which redirects to https://www.jumpingrivers.com

A “bad” site, wouldn’t redirect to the “https” version.

Content security policy

We’ve covered Content Security Policies (or CSP) in previous blog posts. By being explicit about where external resources are loaded from, e.g. Javascript, it gives applications an extra layer of security.

For example, we can state that Javascript can only be loaded from jumpingrivers.com and example.com. Any JavaScript resource that is loaded from another site is automatically blocked by the browser. This safeguards against attacks such as cross-site scripting.

As jumpingrivers.com is a static site (we use Hugo), we don’t need to worry about cross-site scripting quite as much; it’s probably overkill. However, adding CSP to our site has highlighted exactly where we load external resources from and has encouraged us to keep resources local where possible.

Permissions policy

Permissions policy is similar to CSPs. Essentially, we specify the resources we would load on our website. For example, would we expect to use a camera or microphone? Again, for our static site this is overkill, but for a Shiny application it’s certainly something you should consider.

Referrer policy

When someone clicks a link on a site that takes them to another domain, the destination site receives information about where that user came from. This is how we get website analytics about our site traffic.

This isn’t too important for a site like jumpingrivers.com as we don’t have anything private on our site - everything is open to the world! However, if your URL contains potentially private information that you don’t want to be leaked, e.g. example.com/private-info then you should set the Referrer Policy.

For jumpingrivers.com, we set it to no-referrer-when-downgrade. This means when going from https to http, we won’t send the referrer header. Other than that, we’ll send the full path.

Strict transport security

This header informs browsers that a site should only be accessed using HTTPS. Once set, any future visits will automatically convert http to https. Remember, from the status code, that typing jumpingrivers.com into a browser, the URL automatically resolves to http://jumpingrivers.com, so this (after the first visit) tightens up this issue.

X content type options

This stops a browser from trying to MIME-sniff the content type. This should be set to x-content-type-options: nosniff.

X frame options

This tells the browser whether or not you want to allow your site to be framed. At jumpingrivers.com this is set to DENY.

Shiny servers

The {serverHeaders} package checks common security related headers. There are certainly others, but the headers described above are certainly the important one. Many Shiny applications we work with contain sensitive data, help make business critical decisions and/or are fundamental to a business process. As such, spending some time securing your server is to be recommended (a little bit of understatement here).

Acknowlegements

This package is based on a package originally created by Bob Rudis - hdrs.

For updates and revisions to this article, see the original post

Effect of Shiny Widgets with Google Lighthouse

Thu, 14 Dec 2023 23:59:00 +0000

Part 1: Using Google Lighthouse for Web Pages
Part 2: Analysing Shiny App start-up Times with Google Lighthouse
Part 3: Effect of Shiny Widgets with Google Lighthouse (This post)

This is the third blog in our series on the Google Lighthouse tool. In Part 1, we looked at what Lighthouse is and how it can be used to assess the start-up times of webpages, and in Part 2, we used Lighthouse to test Shiny apps and performed some analysis on the 2021 Shiny App Contest submissions. In this final part I am going to create a few Shiny Apps with different content and use Lighthouse to see the differences. I have creatively named my apps app1, app2, …, app6.

The Apps

The default app (app1) I’m using as baseline is shiny::runExample("01_helLo"), it is just a simple app with a slider input and a histogram, where the slider input dictates the number of histogram bins. It looks like this:

To actually see what factors cause changes in load time, I’m going to be building upon this app incrementally. So, app2 is identical to the first, apart from we have a {plotly} histogram instead of a base hist() plot.

For app3 I am adding a simple data table using shiny::renderTable() on top of the second app. App 3 looks like this:

Then app4 is the same as the third only we are replacing the data table with a {DT} data table using DT::renderDT() and DT::DTOutput().

For app5 I have added a date input widget in the sidebar. In the 6th and final app, I have changed the {plotly} histogram to a reactive object (so it’s not computed twice) and rendered it twice, the original place and in the sidebar to see if that has any impact on the scores. The final app looks like this:

So now I have a series of 6 apps of increasing complexity. I can now test to see what impact each component I have added does to the Lighthouse reports. I will test each app 10 times to give more accuracy in the results and so we can see variance in the Lighthouse reports. The main things I’m looking at from the report the following Lighthouse metrics (covered in part 1 of the series):

FCP (First Contentful Paint - ms)
SI (Speed Index - ms)
LCP (Largest Contentful Paint - ms)
TTI (Time to Interactive - ms)
TBT (Total Blocking Time - ms)
CLS (Cumulative Layout Shift)
Score

Histogram showing all metrics measured in the Lighthouse report

To get a feel for the data obtained, here is a histogram for each of the metrics reported by Lighthouse across the different apps:

With this we can see the spread of score in each metric. This plot gave me a good idea of how to further explore the data.

Time to Interactive vs Speed Index

In this scatter plot of TTI (Time to Interactive - ms) vs SI (Speed Index - ms), we can see the times increasing with each iteration of the app. There is a two-fold difference in TTI between the simplest and the most-complex apps. We can see groupings in the data like {app1}, {app2, app3} and {app4, app5, app6}. This suggests that the first {plotly} graph and the {DT} data table are the most influential components.

Boxplot of App vs Speed Index

This box plot of App vs speed index shows that as the we iterate on the app it’s not just the loading times that increase but the variability in loading times as well.

Score vs First Contentful Paint

Here the more complex apps have slight decreases in overall Lighthouse score as the time for first contentful paint increases. Those complex apps also show a wide variability in the overall score. Some runs for an app gained a “Good” user experience rating (90+) and others a “Poor” experience (50-89). The first contentful paint scores were relatively constant for a given app, so I investigated why the difference on 10 score points arose for those apps.

Different Lighthouse Scores From the Same App

I mentioned in the last blog that you can getting different Lighthouse scores across runs and suggested doing a few reports to get the best results (I included some information on why this might be in part 2). I now have some evidence of it happening and I want to see why we have the exact same app going from high 80’s to high 90’s Lighthouse score. Of the 60 app tests I did 9 of them had sub 90 scores, all of them coming in apps 4 and 5.

Radar plot

This radar plot compares the mean scores for the apps with Lighthouse scores over 90 vs the ones without. 0% represents the lowest score I recorded for a metric and 100% represents the highest. We can see the sub 90 apps have performed noticeably worse (higher times in each metric, bar score where higher is better), particularly cumulative layout shift. This metric measures movements in the layout of a page, a good example is clicking a button before a page has fully loaded and then the page moves and you have clicked the wrong thing, a better explanation is here.

Score vs Fetch Time

It seemed odd that there was such a big difference between different runs of the same apps, and also that the most complex app (app6) wasn’t affected in the same way. What else could explain why app4 and app5 had these poor runs?

Perhaps the best explanation for this is a drop in my network speed during the runs…

If we look at the overall scores for the Lighthouse runs against the time when the run was started, there is a clump of sub 90 scores between 15:21 and 15:25. This plot looks very similar to the score vs speed index earlier. I do not have data about my network speed at the time of running the apps, but it looks like there was a dip in network speed at this time. This is backed up by the fact the sixth app has no sub 90 score despite being the most complex.

So even when your app works well, factors beyond your control may affect your Lighthouse results.

Final Words

At our Shiny in Production conference in October our final keynote speaker, the data visualisation expert Cara Thompson, was asked about her thoughts on interactive visualisations in Shiny apps and in the ensuing discussion Andre de Vries from Posit mentioned that {plotly} plots add about a second of loading time each to an app.

Overall the results are pretty straight forward; adding more widgets to your Shiny app is going to slow it down. We can clearly see that adding interactive elements such as {plotly} plots and {DT} tables to your apps will slow them down. I’m not going to recommend not using them, because they don’t add that much time if you use them sensibly. One of the main points of Shiny is interactivity after all - you may as well have a markdown report otherwise.

That being said, don’t have a hundred plotlys in your app, because it will be slow. By all means, put a {plotly} in because it “looks cool” but just remember you are sacrificing a little bit of performance. At the same time maybe think twice about putting a widget when something static would be better for the user.

Retrospectively I wish I had made a few more apps with more interactive content and tried some interactive maps, as I imagine that maps would have a big impact on load times. Why not just add more apps to this analysis and generate more reports? The 60 Lighthouse reports that were covered here were run consecutively on the same day. Including additional Lighthouse reports after those initial reports may introduce extra complications to the analysis, due to internet speed variability, updates to Lighthouse etc.

All being said, Lighthouse is just one tool for assessing the user-experience of your app, and it won’t tell you if your app is “good” or not. Having a fast and efficient app is important for usability, but how enjoyable and easy it is to use are more important for users. This type of feedback is only going to be obtained via user testing and asking users what they are gaining from your app.

For updates and revisions to this article, see the original post

Analysing Shiny App start-up Times with Google Lighthouse

Thu, 07 Dec 2023 23:59:00 +0000

This is part one of a three part series on Lighthouse for Shiny Apps.

Part 1: Using Google Lighthouse for Web Pages
Part 2: Analysing Shiny App start-up Times with Google Lighthouse (This post)
Part 3: Effect of Shiny Widgets with Google Lighthouse

Intro

In the last blog I spoke about using Google Lighthouse to test the speed of web pages. I wanted to build upon that and use Lighthouse to test some Shiny apps.

To get a feel for Shiny’s performance in a Lighthouse analysis, I needed a lot of shiny apps that I could test and create a dataset from, so I used the entries to the 2021 Shiny app contest, which is a competition where people enter Shiny apps to be judged on technical merit and artistic achievement. I used the 2021 apps as there has unfortunately not been a competition since. A full list of the submissions can be found on the Posit Community website.

To actually obtain data from these apps I used Google Lighthouse in the same way I described for general web pages in the previous blog in this series. This generated a Lighthouse report for each app.

Google Lighthouse

To test a singular app from the contest it was exactly the same as testing a normal webpage, I simply ran:

lighthouse --output json --output-path data/output_file.json url

Where url is the app I’m testing. You can also test in browser using devtools (as demonstrated in the last blog), but I was testing a lot of apps so I needed to do it programmatically.

Before we get into the data it’s important to point out that Google Lighthouse scores do vary; you may run a report on an app that I’ve covered and get a different score. There are a number of reasons for this covered here, so the devs recommend running multiple tests. I’d also like to point out I have only run the report once for each app due to length of time it would take to run reports on all the apps a few times.

App data

The entries to the 2021 Shiny app contest were great! Loads of unique and interesting apps, given it was 2021 there were plenty of COVID- and election-related apps. I ran Lighthouse reports locally on 268 of the Shiny app contest submissions (some of the links were broken), and have compiled a few plots to summarise the performance of the apps.

Below is a histogram showing the distribution of overall performance scores for the apps. The Lighthouse docs give the following advice for apps based on performance scores:

90-100 is an app with good performance;
50-89 is an app that needs some improvement;
and 0-49 is an app with poor performance.

As we can see many of the apps (79 / 268) have good performance, whereas the bulk of the apps are in need of some improvement (149 / 268) or have poor overall performance (40 / 268).

We can dive deeper into the distribution of the raw values that are used in calculating the overall performance score. The performance score is a weighted sum of some metrics formed from these raw (time) values - see the previous blog for more details on what each of these metrics means. The scores follow a similar trend - most of the measurements fall on the faster side of the spectrum then decrease as the time increases. I think this was to be expected based on the distribution of the performance score seen earlier, as most of the apps scored pretty well.

The Apps

I don’t highlight any of the apps on the lower end of the performance spectrum here, as it would be unfair on the creators. The Shiny app contest has two sections one for < 1 year’s experience and another for > 1 years experience, so people new to Shiny are likely to have been experimenting with what’s possible in an app and not focusing on performance.

That being said I’d like to reiterate what Colin Fay said in his talk “Destroy All Widgets” about being sensible with widget use within apps, and understanding that they can hinder performance and increase wait times when they are not always necessary. For instance do you really need a {plotly} plot or would a ggplot suffice? The same could be said about interactive data tables.

I will highlight a couple of high scoring apps:

This app by Rabii Bouhestine is a really cool Geoguessr-esque game where you are trying to pinpoint the location of world wonders. This app received the overall score from Lighthouse of 95!

Another high scoring app is “Mix Things Up” by Sam Parmar a previous competition winner who was a judge on the 2021 contest. This app is a simple yet efficient way to generate random work outs.

The last one I’m going to highlight is this app by Edgar Cáceres, which is an app for visualising air quality data from the station in La Oroya, Junin, Peru. This app is particularly impressive in it’s scores as it actually has two interactive {leaflet} plots.

Google Lighthouse is a good starting point for testing the start-up times of your apps, however it is worth noting that the score can be misleading. An app may score very highly but not actually load fast for the user. This may happen, for example, if Lighthouse thinks that the contentful paints have loaded when it was the background for the app. A way to check this is looking at the screenshots of the Google Lighthouse report within the browser. You can do this by adding --view after the url argument when running a test in the terminal. I will be using the next blog in this series to investigate this further.

So if you are developing a desktop Shiny app and want see see how it does you can use Lighthouse and this blog for a benchmark, although with a pinch of salt as there are many different kinds of apps that we tested - games and data visualations etc. Roughly, however, if your app scores better than 73 then that’s a good start. If you can’t bring your app load time down for whatever reason, maybe due data processing for example, then something you can do is use a loading screen to let your app-users know that something is happening. This is covered excellently at the start of this blog on Shiny extensions.

In the final blog in this series, we will be investigating the impact various widgets have on Shiny app Lighthouse scores.

For updates and revisions to this article, see the original post

Using Google Lighthouse for Web Pages

Thu, 30 Nov 2023 23:59:00 +0000

This is part one of a three part series on Lighthouse for Shiny Apps.

Part 1: Using Google Lighthouse for Web Pages (This post)
Part 2: Analysing Shiny App start-up Times with Google Lighthouse
Part 3: Effect of Shiny Widgets with Google Lighthouse

Intro

This blog post was partly inspired by Colin Fay’s talk “Destroy All Widgets” at our “Shiny In Production” conference in 2022. In that talk, Colin spoke about HTML widgets and highlighted how detrimental they can be to the speed of a Shiny app. Speaking of which, the next Shiny In Production conference is taking place on 9th and 10th of October 2024, and recordings for this year’s events are coming soon to our YouTube channel.

I wanted to see if I could measure the speed of a collection of shiny apps. To do so, I was directed to Google Lighthouse, and this blog is dedicated to the use and understanding of lighthouse before I start using it on Shiny Apps.

Google Lighthouse

Google Lighthouse is an open source tool which can be used to test webpages (or web hosted apps like Shiny apps). For a specified webpage, Lighthouse generates a report summarising several aspects of that webpage. For Shiny, the most important aspects are summarised in the “Overall Performance Score” and the “Accessibility Score”, with one of the best parts being the feedback given by the report on how you can improve.

Before you can use Lighthouse you must install it (and npm if you don’t already have it):

npm install -g lighthouse

Then to run a Google Lighthouse assessment in the command line you simply run:

lighthouse --output json --output-path data/output_file.json url

Where you specify:

the output format, either json and csv are available, I used json as more information is stored.
The output path for where you would like the data to be stored.
The url of the Shiny app you would like to test (the location of your deployed app or, if developing locally, the URL that Shiny prints out when the app starts: Listening on http://127.0.0.1:4780).

One cool feature of Lighthouse is that you can test apps in both desktop and mobile settings. The default is mobile but you can specify desktop by adding --preset desktop after the url argument.

When you run the command a new Chrome browser will open with the specified URL, where Lighthouse will run the report. This browser will automatically be closed by Lighthouse when it is finished. For all the Lighthouse demos in this blog I am going to use our website for consistency.

Another way to access Lighthouse is to simply use it in a Chrome browser and open the DevTools panel, as described in the Chrome Developer documentation. A Lighthouse tab should be visible in the “more tabs” section, where you can run performance checks interactively.

From DevTools all you do is tick the boxes to specify the device type and performance metrics you want to assess. Then press “Analyze page load” to start the Lighthouse report generation.

Lighthouse Output

Depending on how you’ve run the Lighthouse report, the way you access the results will be different. Firstly if you have used the terminal and saved the lighthouse output you will have a csv or json file containing the data displayed in the report (json output contains more in depth data).

Alternatively from the terminal you can add --view after the URL and the Lighthouse report will open in your browser to view it when ready. Here is an example of this:

Lastly, if you have run Lighthouse through DevTools in a Chrome browser, the report will become visible in the DevTools panel. Location aside, the report should look identical to the browser version created with the --view option. It should look similar to this:

You may have noticed that I have got different scores in the separate screenshots even though I am using the same URL for both. This gives me a great opportunity to bring up one of the drawbacks of Lighthouse, and that is the variability in results. For example you could run a test on our website and get a different score. There are a number of reasons for this including internet or device performance and browser extensions, so the Lighthouse developers recommend running multiple tests. This topic is covered in more detail here.

Lighthouse Performance Metrics

Lighthouse scores apps on 5 measures: Performance, Accessibility, Best Practices, SEO (search engine optimization) and PWA (progressive web app).

Here, we will look at the overall performance score. This is based on a weighted combination of several different metrics. As of Lighthouse 10 (8 was slightly different) the score is made up of:

10% First Contentful Paint - This is the time from the page starting to any part of the page’s content is rendered on the screen. “Content” can be text, images, <svg> elements or non-white <canvas> elements.
10% Speed Index - This is how quickly the contents of a page are visibly populated.
25% Largest Contentful Paint - This metric is the time between the page starting and the largest visible image or text block loading.
30% Total Blocking Time - This is the time between first contentful paint and another metric called time to interactive, which measures how long the app takes to become interactive for the user.
25% Cumulative Layout Shift - This is measure of the largest layout shift which occurs during the lifespan of a page, a good explanation can be found here.

Performance scores lie in a range between 0 (worst) and 100 (best).

Lighthouse Performance Suggestions

Another cool feature of Google Lighthouse is the performance improvement suggestions. I am going to use the Surfline website as an example for this section. These suggestions can be found underneath the performance score on the report and should look similar to the image below.

For each suggestion you have the ability to expand for more information along with the visible estimated time savings from implementing the suggestion. These suggestions can be helpful if you want to improve a particular aspect of your website or just generally streamline it.

This was an overview of Google Lighthouse covering the many ways to run reports on web pages and some guidelines for interpreting Lighthouse reports. We can also use it to analyse Shiny applications, which will be covered in the next installment of this blog series.

For updates and revisions to this article, see the original post

Training Lineup for 2024: January-June

Tue, 28 Nov 2023 23:59:00 +0000

All of our public training courses for the first half of 2024 are now open for registration! Head over to the public courses page on our website to book in and start building your programming skills in the new year! Below is a list of all of our upcoming courses with a description, upcoming dates, course level and a link to the page to find out more!

R

Introduction to R

Course level: Foundation

Upcoming course dates: 15th January 2024 & 22nd April 2024

Data Wrangling in the Tidyverse

Course level: Foundation

Upcoming course dates: 22nd January 2024 & 29th April 2024

Programming with R

Course level: Intermediate

Upcoming course dates: 29th January 2024 & 20th May 2024

R Best Practices

Course level: Intermediate

Upcoming course dates: 12th February 2024

Data Visualisation with ggplot2

Course level: Intermediate

Upcoming course dates: 5th February 2024 & 10th June 2024

Statistical Modelling with R

Course level: Intermediate

Upcoming course dates: 26th February 2024 & 3rd June 2024

Machine Learning

Machine Learning with Tidymodels

Course level: Intermediate

Upcoming course dates: 4th March 2024 & 17th June 2024

Advanced Machine Learning with Tidymodels

Course level: Advanced

Upcoming course dates: 18th March 2024 & 24th June 2024

Automatic Reporting

Reporting with Quarto

Course level: Intermediate

Upcoming course dates: 25th March 2024 & 24th June 2024

Statistics

Introduction to Bayesian Inference using RStan

Course level: Intermediate

Upcoming course dates: 15th January 2024

Python

Introduction to Python

Course level: Foundation

Upcoming course dates: 26th February 2024 & 13th May 2024

Programming with Python

Course level: Intermediate

Upcoming course dates: 4th March 2024 & 3rd June 2024

Data Visualisation with Python

Course level: Intermediate

Upcoming course dates: 18th March 2024 & 17th June 2024

SQL

Introduction to SQL

Course level: Foundation

Upcoming course dates: 14th February 2024

The Structured Query Language (SQL) defines a standard for communicating with a relational database. In this one-day introductory course, participants will learn the basic SQL syntax for data extraction, filtering and insertion. We will start by querying a local database before connecting to a remote database held on an AWS server. Here, we will stress important considerations when working with shared databases in the cloud.

An Introduction to SQL with R

Course level: Intermediate

Upcoming course dates: 15th April 2024

We use the PostgreSQL database as an example for public courses. For in-house training, we are happy to adapt the course to match your database requirements.

Introduction to SQL with Python

Course level: Intermediate

Upcoming course dates: 15th April 2024

We use a PostgreSQL database as an example, and communicate with this using a psycopg2 connection.

So what now?

For updates and revisions to this article, see the original post

Getting started with theme()

Thu, 23 Nov 2023 23:59:00 +0000

The theme() function in {ggplot2} is awesome. Although it’s only one function, it gives you so much control over your final plot. theme() allows us to generate a consistent, in-house style for our graphics, modify the text within our plots and more. Getting comfortable with theme() will really take your {ggplot2} skills up a notch.

Normally, when people want help with an R function I tell them to use the built-in documentation about the function. This is normally done by typing ?function_name into the console. It’s usually pretty informative and often enough to help people understand a new function.

So let’s try this with theme() …

library("ggplot2")
?theme

You probably feel a bit like this now:

If you’re coding along with me you’ll be able to see there’s loads of arguments to theme(). If you’re patient and like counting, you’ll find that there are ninety-nine arguments to the theme() function.

Do you need to know all of these arguments? No.

Do I know all of these arguments? Also no.

What to we want to achieve?

By the end of this blog post, you are going to:

become familiar with a handful of theme() arguments
be able to understand how to modify theme elements
have built the confidence to try modifying aspects of a theme on your own

What I’m not aiming to do:

construct a the world’s most elegant {ggplot2} theme
show you every single thing that can be modified via theme()

Basically, I want to give you the tools to make your plots look the way you want them to.

It’s also worth mentioning that this post is peppered with personal opinion; I want you to absorb how I’ve managed to implement my stylistic choices, not take these choices as the “truth”.

Building a basic plot

In order to modify a plot theme, we’re going to need a plot to start with. We’re going to work with a simple scatter plot derived from the Palmer Penguins data set. The data are freely available via the {palmerpenguins} package.

library("tidyr")
library("RColorBrewer") # pkg for nice colours
penguins = palmerpenguins::penguins
palette = brewer.pal(3, "Set2") # pick some nice colours

base_plot = penguins %>%
 drop_na() %>% # remove missing values
 ggplot(aes(x = body_mass_g, y = bill_length_mm, colour = species)) +
 geom_point() +
 scale_colour_manual(values = palette) +
 ggtitle("Do heavier penguins have longer bills?") +
 labs(colour = "Species") +
 xlab("Body mass (g)") +
 ylab("Bill length (mm)")

base_plot

We’ve created a basic scatter plot here. It’s perhaps one you’ve seen before. Perfectly functional, but lacks personality.

Using built in {ggplot2} themes

A good starting point for modifying your plot theme is actually to side-step theme() and use one of the themes provided by {ggplot2}. The themes all have similar names: theme_*(). I personally tend to start with theme_minimal(), but feel free to try some others. For example, theme_classic() or theme_light(). The usage of these themes is super simple, just add it to your plot a bit like a geom_*():

plot = base_plot +
 theme_minimal()

We’ve now got a plot which looks cleaner. There’s still some things that I don’t like. For example, I don’t like having my legend at the side of the plot and I don’t like the grid lines. We can modify these with theme().

Our first theme() modification

The first thing I like to do is move the legend to the bottom of the plot. This is where we start to use the theme() function to modify our plot appearance. This one isn’t too tricky, we just specify (as a string) where we want our legend to sit.

plot +
 theme(
 legend.position = "bottom"
 )

And here’s the result:

Okay, that wasn’t too bad. The other options here would be "top", "left" or "right" to modify the position, or "none" to remove the legend entirely.

I’d like to you meet my friends: the element_*() functions

Most arguments of the theme() function don’t take simple values like character or numeric values as arguments. A large number of the arguments take in a list of a specific class, where the list elements describe what the plot looks like. We can generate this list via an element_*() function. The functions are element_blank(), element_rect(), element_line(), element_text(). Each of these functions has arguments which modify a given feature of our plot. I’ll run you through how each of the element_*() functions might be used, but remember just like any other function ?element_*() will show you more

Modifying line elements

Now our plot doesn’t have any lines on to indicate where the axes are. Suppose I want to make it clear where the axis lines are. Because these are, well, lines, I can use element_line() to include them as so:

plot +
 theme(
 legend.position = "bottom",
 axis.line = element_line(colour = "grey50")
 )

obviously we all have different favourite colours, use whatever you like. Neutral colours (probably just shades of grey) are probably best for anything professional or for publication.

So here we’ve specified the colour of the line element which corresponds to axis.line. We can change other characteristics here like linewidth or linetype, but I’ll leave that for you to experiment with later.

Removing plot elements

The next thing I want to modify is the grid lines. I personally don’t like them so I’m going to remove them. This could be done with element_line() by matching the grid line colour to the background colour, but off the top of my head I don’t know what the background colour is, and there’s a much simpler solution: element_blank(). This removes aspects of our plot by generating an empty list entry for that plot component.

plot +
 theme(
 legend.position = "bottom",
 axis.line = element_line(colour = "grey50"),
 panel.grid = element_blank()
 )

assigning element_blank() to panel.grid() removes all grid lines. If we only wanted to remove the minor ones, we would set panel.grid.minor = element_blank(). If we wanted to remove only the vertical lines (for whatever reason), we can set panel.grid.minor.x = element_blank() and panel.grid.major.x = element_blank(). Removing only the horizontal ones is the same, but swap x for y. This shows that although theme() might have 99 arguments, there is structure to argument names which reduces how many you really need to remember.

Modifying text features

Text features are the next thing I’m going to change. We’re going to do two things at once here:

change the font for all text features with the text argument
change the positioning and size of the plot title with the plot.title argument

we’ll adjust both of these via the element_text() function

plot +
 theme(
 legend.position = "bottom",
 axis.line = element_line(colour = "grey50"),
 panel.grid = element_blank(),
 text = element_text(family = "lato"), ## modify font
 plot.title = element_text(hjust = 0.5, size = 18) ## modify positioning and size
 )

the family argument lets us specify the font: we’re using the lato font here. I won’t delve too much into fonts, but {showtext} is a package that makes using a wide variety of fonts straightforward. I’d definitely recommend playing with fonts if you’re looking to develop a theme for a corporate identity, or simply add some personality to a plot.

The hjust argument controls the horizontal justification, essentially, the positioning of the text. hjust = 0 is left justification (the text is moved to the left), hjust = 1 is right justification (text moved to the right), and numbers between 0 and 1 will position the text somewhere between the far right and far left. Setting hjust = 0.5 centers the text for us. I’ve not used vjust, but this argument adjusts the vertical justification. Note that on the y axis, hjust moves the text along the axis, and vjust moves the text closer to/further away from the y axis. The size argument is just the font size in pts, something we will all be familiar with. The result is that the text, especially the title, shines through a little more.

Borders and backgrounds

Borders and backgrounds are next on our list of things to modify. We use element_rect() (rect as in rectangle) to change the styling of things such as the background around our legend or the entire plot background. By default, the legend background will be the same as the plot background, so it isn’t actually obvious that the legend even has a background! We’re going to modify the plot background here. The plot is currently sitting on a white background, and the web page you’re viewing also has a white background. This means that the plot melts into the webpage. This isn’t necessarily a bad thing, but you might want to frame your plot a bit by putting in on a coloured background. Alternatively, you might have a corporate slide deck with a coloured background, and want the plot to melt into this background.

plot +
 theme(
 legend.position = "bottom",
 axis.line = element_line(colour = "grey50"),
 panel.grid = element_blank(),
 text = element_text(family = "lato"),
 plot.title = element_text(hjust = 0.5, size = 18),
 plot.background = element_rect(fill = "#dffffc", colour = "#dffffc")
 )

The fill argument here changes the actual plot background. The colour argument (color will also work if you’re u-averse) controls the colour of a thin border all the way around the plot. The colour #dffffc is a very pale blue, which ensures that there is sufficient contract between the points and background of the plot. This is an important accessibility feature, so do think very carefully about how changing the background colour of your plot may impact the ability for other people to properly absorb the message you are trying to communicate. I’d generally avoid changing the background colour, but it’s useful here for demonstration purposes. To reiterate: if you do change the background colour, take care to ensure that accessibility is not compromised.

That brings to a close our introduction to each of the element_*() functions. I know it was a bit traumatic before, but if you type ?theme into your R console, you’ll notice that the help page tells you which element_*() function you need to use for each theme() argument.

Creating space

The last function that we use to modify aspects of a theme is margin(). Margin is a little bit different to the element_*() functions, instead of controlling colours, fonts and line types, margin() lets us create or remove space around certain aspects of our theme by modify distances. If an argument needs to be modified with margin(), it’s likely that the argument name looks like something.margin. You may also notice that element_text() has a margin argument - use margin() here to create space around the text aspects of your plot.

There are 5 arguments to margin(): t, r, b, l and unit. t, r, b and l are short for top, right, bottom and left - to remember the order, it’s just clockwise from the top. You should assign a number to these arguments, then unit is simply the units of these values. unit defaults to pt, which scales well with text, but you can choose something else if that makes more sense to you.

Let’s now modify the space between (a) the plot title and the scatterplot itself and also (b) the legend and the x axis.

plot +
 theme(
 legend.position = "bottom",
 axis.line = element_line(colour = "grey50"),
 panel.grid = element_blank(),
 text = element_text(family = "lato"),
 plot.title = element_text(
 hjust = 0.5, size = 18, margin = margin(b = 30) # modify title-plot spacing
 ),
 plot.background = element_rect(fill = "#dffffc", colour = "#dffffc"),
 legend.margin = margin(t = 15) # modify x axis-legend spacing
 )

Now this is our final plot! Each individual step made a minor adjustment to the plot, but added together, we have a plot with a much improved appearance.

Bringing it all together

An important programming concept is don’t repeat yourself (DRY). This applies to constructing graphics as well. We don’t want to copy and paste our theme for every plot we make. The great thing about {ggplot2} themes is that they can be effortlessly applied to basically any plot. All we have to do it turn out theme into a function, and then it can be used just like any of the “built in” {ggplot2} themes. For example:

my_theme = function(){
 theme_minimal() +
 theme(
 legend.position = "bottom",
 axis.line = element_line(colour = "grey50"),
 panel.grid = element_blank(),
 text = element_text(family = "lato"),
 plot.title = element_text(
 hjust = 0.5, size = 18, margin = margin(b = 30)
 ),
 plot.background = element_rect(fill = "#dffffc", colour = "#dffffc"),
 legend.margin = margin(t = 15)
 )
}

notice that the first line of the function is theme_minimal(), we used this theme as a starting point for our custom theme.

Then we can apply this to any other plot as we would apply a standard {ggplot2} theme:

my_boxplot = penguins %>%
 drop_na() %>%
 ggplot(aes(x = species, y = flipper_length_mm)) +
 geom_boxplot(fill = "#ff9300") +
 xlab("Species") +
 ylab("Flipper length (mm)") +
 ggtitle("Which type of penguin has the longest flippers?")

my_boxplot + my_theme()

That was easy! We can just add my_theme() to all of our plots to ensure consistent styling. If we wanted to make minor adjustments to the theme for a specific plot, we can just add on the theme() command again and make the required adjustments. If you want to apply the style to all plots in a single script or report, theme_set() is a really handy way to do this.

What next?

The theme() function has a lot of arguments, and it can feel overwhelming if you’re wanting to start modifying your own theme. We’ve managed to gain a little bit of experience with the tools that modify a {ggplot2} theme, and now you should have the confidence to modify other elements on your own, and try out the different arguments in element_*() functions.

If you’re wanting to explore what makes a really good chart, I’d recommend the RSS style guide for practical, actionable advice on constructing publication-ready graphics with examples in both R and Python. Cara Thompson gave the keynote talk at our 2023 edition of Shiny in Production; her talk walked us though 10 important considerations for making text shine within a data visualisation, and her slides are packed with with ways to make your plots awesome.

If you feel like going back to basics would really help you out, booking onto one of our upcoming Data Visualisation with {ggplot2} is a great way to get to grips with a wide variety of {ggplot2} features.

For updates and revisions to this article, see the original post

Python Virtual Environments and Barbie

Thu, 16 Nov 2023 23:59:00 +0000

Having recently been to see the Barbie movie, it got us thinking: Barbie and Python have more things in common than meets the eye (step aside Ken!). For a start, they are both pioneers in their respective fields: Barbie is a famous fashion doll owned by millions of people around the globe, while Python is a famous programming language with millions of users worldwide. Barbie is well known for her wide range of careers, outfits and accessories. Meanwhile, Python comes in many different versions and has thousands of dedicated libraries and packages.

Crucially, they are both customisable. Barbie can be dressed in different outfits from her wardrobe to meet the demands of her busy schedule, whether that’s a day at the beach, governing her country, or kicking back for a quiet night in. With Python, meanwhile, we can customise our programming environment and switch between different combinations of packages and versions to tackle our data science projects. This is made possible through virtual environments.

What is a virtual environment?

Virtual environments are tools used in software development to create isolated environments for different projects. These environments allow developers to manage dependencies and packages separately for each project. This helps avoid conflicts between different project requirements and keeps everything organised. Each virtual environment is like a contained space where you can install packages without affecting the global Python installation.

While Barbie and Python virtual environments might seem unrelated at a first glance, there are some similarities:

Customisation: Just like how you can dress up Barbie in different outfits and accessories, you can customise each Python virtual environment with specific packages and dependencies tailored to the needs of your project.
Isolation: Barbie’s different outfits don’t interfere with each other, just as Python virtual environments keep the dependencies of different projects separate, preventing conflicts.
Organisation: Barbie’s wardrobe allows her clothes to be neatly stored rather than strewn all over the floor. With Python virtual environments we can work with just the project-specific dependencies rather than hundreds of conflicting packages at once (never a good idea).
Portability: If you’re lucky enough to own multiple Barbies, you can try the same outfit on different Barbies. Similarly, with Python you can duplicate an environment to work on the same project across multiple machines and share it with your colleagues.

Virtual environment managers

There are a lot of virtual environment managers out there for Python. Below we will give a basic overview of some on the most popular options and share some useful links for more in-depth information.

Note that some of these tools double as package managers. For more on this, check out our recent blog on Python package managers.

venv

Python’s standard library includes an easy-to-use, lightweight virtual environment module called venv. To create a virtual environment called “myenv”, you can run the following command:

python -m venv myenv

This will generate a folder within the current working directory called “myenv/” (you can call it whatever you like), which will be used to activate the virtual environment and store any packages that are installed into the environment.

To activate the environment on Windows:

myenv\Scripts\activate

On macOS and Linux, you have to source the activation script:

source myenv/bin/activate

Once activated, the pip install <pkg> command will now install packages into the virtual environment, keeping them separate from the user’s system environment. If you want to share your development environment with a colleague that’s working on the same project, you can run:

pip freeze > requirements.txt

This will create a file called “requirements.txt” containing a list of installed Python packages and their version numbers. Your colleague can then install these dependencies into their environment by running:

pip install -r requirements.txt

When you are finished with the environment, it can be deactivated by running:

deactivate

To delete the envionment outright, simply delete the “myenv/” folder (or whatever you called it).

virtualenv and virtualenvwrapper

Virtualenv is a third-party library that predates venv. If it’s installed with the virtualenvwrapper extension library, it can provide additional commands and features like quick switching between multiple environments. Virtualenv can be installed with pip:

pip install virtualenv

You can then use the virtualenv command to create a virtual environment from the command line:

virtualenv myenv

Activating and deactivating the environment is similar to venv. With a Unix shell the commands would be:

source myenv/bin/activate
deactivate

and packages can again be installed or uninstalled using pip.

Virtualenvwrapper is a set of extensions for virtualenv that simplify the management of multiple virtual environments. It provides commands to create, delete, and switch between virtual environments easily without having to explicitly state the environment file path. To get started with virtualenvwrapper, you’ll first need to install it using pip:

pip install virtualenvwrapper

We then need to add the code below to the shell startup file (~/.bashrc,~/.zshrc, ~/.profile, etc) to set the location to where the virtual environments will be stored:

# Virtualenvwrapper settings:
export WORKON_HOME=$HOME/.virtualenvs
source ~/.local/bin/virtualenvwrapper.sh

Note that these commands are specific to the Unix shell. Windows users should investigate the virtualenvwrapper-win package.

In a new shell, you can now create a virtual environment and activate it as follows:

mkvirtualenv myenv
workon myenv

Virtualenvwrapper streamlines the management of virtual environments, making it especially useful when working on multiple Python projects simultaneously.

pyenv

As well as different outfits and accessories for Barbie, there are different iterations of Barbie herself: Marine Biologist Barbie and Art Teacher Barbie to name a few! Python also comes in different versions, and there are many occasions where having multiple Python installations on the same machine can be useful:

You may have upgraded to Python 3.11 but still need Python 3.8 to run some old legacy code.
Your colleagues may be using an older version of Python for a project that you’re working on, and switching to that version to test and debug the project code would be useful.

Pyenv is a tool that allows you to easily switch between multiple Python versions on your system. It also facilitates installing different Python versions and supports creating virtual environments for specific Python versions using virtualenv.

The GitHub documentation provides OS-specific instructions for installing pyenv on your machine. Once installed, you can try adding an older Python version using the pyenv install command and then create a virtual environment for that version called “myenv/”:

pyenv install 3.8.6
pyenv virtualenv 3.8.6 myenv

You can activate the environment by running:

pyenv activate myenv

and install packages into the environment using pip.

This is a great way to organise Python projects that not only require different packages but also use specific Python releases. And there is a lot more that you can do with pyenv, like specifying the Python version globally or in the current directory. Note, however, that there are some common pitfalls to be wary of when using pyenv:

It’s easy to think that you’re using your system Python installation when really you’re working with an older version through pyenv.
Be cautious when working with package managers like pip and poetry, which may be installing packages to your system Python installation rather than to the current pyenv version.

We recommend checking out the pyenv documentation and this useful blog post for more information.

Pipenv

Pipenv is a popular tool for managing both Python dependencies and virtual environments. It combines the functionality of pip and virtualenv into a single tool, and is easy to install through pip:

pip install pipenv

It even integrates with pyenv to work with specific Python versions. To create a virtual environment for Python 3.8 (assuming you have pyenv installed), you can run:

pipenv --python 3.8

This automatically sets up a “Pipfile” within the current folder to manage project dependencies. You can then activate the environment and install packages into the environment using pipenv commands:

pipenv shell
pipenv install <pkg>

When a package is installed using pipenv, it gets added to the Pipfile. Both the package and its dependencies are also stored in a “Pipfile.lock” file with the exact version numbers. These files can be shared with a colleague, who can then duplicate the environment on their machine by running pipenv install.

For more information about pipenv, Pipfiles and all of pipenv’s commands you can take a look at the official website.

conda

Conda is a cross-platform package and environment manager primarily used in data science and scientific computing. It allows you to create isolated environments with different Python versions and libraries.

Conda is included as part of the Anaconda distribution. The fastest way to obtain it is by installing the Miniconda distribution, which acts as a smaller version of Anaconda that includes conda and Python. You can check out the installation instructions in our previous blog for more info.

By default, you will be working in the conda “base” environment. To create a new environment called “myenv” with Python version 3.8:

conda create --name myenv python=3.8

You can then activate the environment by running:

conda activate myenv

and deactivate the environment by running:

conda deactivate

To install packages into the currently-active environment, you should use the conda install command. For example, NumPy and Pandas can be installed by running:

conda install numpy pandas

When you install packages into a conda environment, the package source files are retained inside a package cache folder within the conda installation directory. This allows you to quickly install the same package across multiple environments without having to perform multiple downloads.

It’s possible to export your conda environment to a YAML file which can then be shared with a colleague:

conda env export > environment.yml

Your colleague can add the environment to their machine by running:

conda env create -f environment.yml

For this to work, both you and your colleague need to have conda installed. Conda can also be used for R and other languages, and downloads its packages from secure repositories that are maintained by the community. For more on conda, check out the official documentation.

Poetry

Poetry is a modern dependency management and packaging tool for Python projects. It not only creates virtual environments but also simplifies the management of dependencies and project packaging.

Check out our previous blog for installation instructions. To create a new poetry project, run:

poetry new myproject

This initialises a project in the “myproject/” folder and automatically sets up a virtual environment for it. To add a package you can run:

poetry add <pkg>

To activate the virtual environment, run the following command from within the myproject/ folder:

poetry shell

When you install packages these are added to a “pyproject.toml” file. There is also a “poetry.lock” file which lists all dependencies plus all of their dependencies with the exact versions. By sharing the project folder and files with a colleague, they can run poetry install within the folder to duplicate the environment on their machine.

We highly recommend poetry if you’re starting on a new Python project from scratch. It helps with not only the environment management, but also installing the project dependencies and organising the project folder. It can even be used to package your project and publish it to the Python Package Index (PyPI) if you want to make it publicly-available. Check out the excellent documentation for more info.

Virtual environments with Jupyter

Hopefully we’ve convinced you that virtual environments are as invaluable to Python development as Barbie’s wardrobe is to Barbie! You may now be thinking about how to incorporate some of the options presented in this blog into your development workflow.

Before we conclude, it’s worth mentioning how to add a virtual environment to Jupyter, since this is one of the most popular IDEs for developing and testing Python code. To be able to use your virtual environments within a Jupyter notebook or the JupyterLab IDE you need to:

Activate your virtual environment
Install the Python package ipykernel into your virtual environment using the relevant command:
- pip install ipykernel
- conda install ipykernel
- etc
Then run

python -m ipykernel install --user --name=<env>

replacing <env> with the name of your virtual environment.

Next time you open a Jupyter notebook or JupyterLab, you should see your environment in the list of available kernels.

Conclusion

Choosing the right virtual environment manager for your Python project depends on your specific requirements and preferences. Each of the tools discussed in this post has its own strengths and use cases as summarised by the table below:

Environment manager	Quick/easy installation	Package manager	Quick multi-environment switching	Python version manager	Multi-language support	Packaging and publishing to PyPI
venv	✅	❌	❌	❌	❌	❌
virtualenv	✅	❌	✅	❌	❌	❌
pyenv	❌	❌	✅	✅	❌	❌
pipenv	✅	✅	❌	✅	❌	❌
conda	❌	✅	✅	✅	✅	❌*
poetry	❌	✅	✅	✅	❌	✅

* publishes to conda channels

Ultimately, the key is to ensure that your Python projects remain isolated, maintainable, and compatible with the required dependencies. Experiment with these tools and discover which one best fits your development workflow.

For updates and revisions to this article, see the original post

SatRdays London 2024

Tue, 14 Nov 2023 23:59:00 +0000

SatRdays is returning to London on 27th April 2024! We’re collaborating once again with CUSP London to bring SatRdays to the amazing Bush House venue, and we can’t wait to see you there. More information will be released in the coming weeks - in the meantime, check out the recordings of last year’s talks for an idea of what to expect!

Call for abstracts NOW OPEN

Interested in getting involved? We’re looking to feature talks from R users from a wide range of industries, public services and academia! To apply to be a speaker, please submit your abstract (max 250 words) using this form by 17th January 2023.

For updates and revisions to this article, see the original post

Sluggish system or client code?

Thu, 02 Nov 2023 23:59:00 +0000

Over the course of several weeks, we worked to deploy a one-stop data science platform for data analysis and visualisation for one of our clients. This platform consisted of interconnected applications, which are the motor that enables the productivity of the data scientists sitting at the wheel.

The components of the platform were:

Gitlab: where data scientists can develop and share their code using all the benefits of Git version control.
Posit Workbench: which hosts development environments such as RStudio on beefy servers, with far more computational power compared to IDEs on local machines.
Posit Connect: which allows data scientists to easily share data, dashboards and reports. It allows the sharing of documents, reports, interactive web applications, as well as hosting Application Programmatic Interfaces (APIs).
Posit Package Manger: which allows for the organisation, centralisation and distribution of code packages. It provides a mirror of R and Python packages, downloaded from external sources such as CRAN (the Comprehensive R Archive Network). It also provides a way for internally-developed R and Python packages to be shared, if the client wishes.

Our deployment philosophy

When we deploy these components together, we do so in such a way that they enhance each other’s functionality; the sum is greater than its parts. For instance, we:

Allow users to use the same authentication across all of these applications.
Ensure that users are able to publish documents from Workbench to Connect out-of-the-box. Users don’t have to worry about specifying the correct URLs or ports for all of this to work.
Ensure that users in Workbench can access any package they need (developed internally, or from popular external package repositories such as CRAN) via Package Manager, without any extra configurations required.

Having all of these setup out-of-the-box means users can get straight to enjoying exploring and utilising the many ways in which Posit can increase productivity, without spending time on set-up.

We also carry out disaster recovery, ensuring that in the event of the unexpected (server failure or data corruption, for example), we can recover all data from a backup.

Finally, we carry out security hardening. Each component in our system is checked to ensure it operates to appropriate security standards. This means our infrastructure is secured to UK Government (National Cyber Security Centre) standards, and certified by CREST-accredited cloud security professionals.

Workbench system performance

One of the key selling points of doing computations on a cloud-hosted server – as opposed to a data scientist’s laptop – is that it’s possible to access very powerful machines in the cloud. This improves the speed at which data scientists’ code gets executed, meaning quicker iterations on analysis. Where commands take longer than a few seconds, it can distract from analysts’ train of thought.

If users perceive that the system provided to them is less-than-performant, they may not use it. It is important that we demonstrate to our users that the platform we provide has excellent performance.

Fast Feedback

Now, back to the client project. We had nearly finished the project, and had given the client a testing environment. Out of the blue, we received this message:

Hi all, is there any reason why rowwise() is performing particularly slow in Workbench?

Time on Workbench: 5.3 minutes

Time to execute example code on client’s laptop: 8 seconds

Oh no!

We were shocked. We pride ourselves on providing applications that are useful to our data scientist users. It seemed that – even though the CPUs in our cloud instance are far more powerful than those in a typical laptop – our system was the less performant. Clearly there must be some configuration wrong – something we can change to put things right!

What we tried first

We tried everything we could think of to trace the root cause of the problem. We tried evaluating the code on our laptops. We tried other Workbench servers.

… Both showed that running on Workbench on powerful machines was much slower than on laptops. We tested across many Workbench servers and against laptops! It perplexed us!

Ok, what next?

There were a few more places that we could look:

The specifications of the CPUs involved on our servers, compared to our laptops, to see if it would explain the slowdown. It did not.
Trying in R sessions outside of Workbench, to see if somehow it was a slowdown related to the RStudio Workbench application. It was not.

R packages

One thing left to try: was the version of the R package in question, {dplyr}, the same on all machines?

The Workbench servers we provided had newer versions of that package past v1.1.0, while on all of our laptops, we had older versions cached. This may have been because we had just set up the server and users were just getting started with using it and installing the packages they needed, so they would tend to have the later versions of packages installed. On their laptops, they may have installed {dplyr} or {tidyverse} some time ago.

By downgrading the version of {dplyr}, it turned out we were able to execute the given reproducible example in 3 seconds – faster than the client’s existing solution!

Obvious solution: downgrade? Check Diffify first

You may think that the obvious solution would be to encourage the client to downgrade the version of {dplyr} that they use in production to one before v1.1.0, which would be much faster in using dplyr::rowwise().

However, we had one last thing to check: what features would be lost if we did this? Potentially there could be improvements in the later versions, which we could lose by downgrading? Would this break existing code?

Enter Diffify.

Diffify provides a comparison between different versions of R packages stored on CRAN or Python packages stored on PyPI. It allows users to select the versions of packages that they want to compare, and presents the differences in a human readable way, making it easy to pick out anything relevant quickly.

It does this by looking at things such as:

NEWS files included with packages,
Changes in functions included as part of the package
Arguments which functions take.

Diffify was useful in this case! Had we downgraded {dplyr}, we would have removed recent performance improvements in other dplyr functions. A patricular example to note is with the case_when() function. In version 1.1.0, this function would be significantly slower, an important fact to note given that the client was moving across to using case_when() as an alternative to using rowwise(), which is being deprecated. Downgrading to version 1.1.0 would have had the result of not allowing us to access these improvements.

The release notes said:

Fixed a major performance regression in case_when(). It is still a little slower than in dplyr 1.0.10, but we plan to improve this further in the future (#6674).

So, perhaps both the client’s current approach of using rowwise() and future approach of using case_when() would both perform well on v1.0.10. But this has to be tested.

Final recommendation we made to client

For this particular function, rowwise(), it turns out the key determinant of performance is the version of the {dplyr} package being used. Although downgrading the version would solve this particular problem, it’s important to make sure that doing so doesn’t affect other functions under active development, such as case_when().

In fact, the functions used in the client’s previous approach were moving to a suspended development stage. In this case, downgrading would have solved a problem that would soon no longer exist, and introduce a new problem for the code migrated to the better supported case_when() function.

Summary

Here we see some extra support we provided our client for a problem we hadn’t anticipated at the beginning of our project. Sometimes the issue appears to be in one place, but further investigation reveals it’s in another. We are glad we have a good relationship with our client, who mentioned the slowdown to us, allowing us to get to problem solving.

How can we help?

If you are looking for a data science platform, or require support maintaining your existing set-up, get in touch! As Full Service Certified Posit Partners, we are trusted by Posit to provide installation, support and maintenance services on their products, as well as resell Posit licenses at no extra cost, but with great deals on our services.

For updates and revisions to this article, see the original post

Highlights from Shiny in Production (2023)

Thu, 19 Oct 2023 23:59:00 +0000

Following on from the success of Shiny in Production 2022, last week we were delighted to host the conference for the second time. The event took place at the Catalyst in Newcastle and featured two days of workshops and talks spanning all things Shiny!

On day one, we held three interactive workshops:

Introduction to Shiny for Python - Our guest speaker, Andrie de Vries from Posit, ran a workshop introducing the Python implementation of Shiny. Andrie covered the basic building blocks of a Shiny application in Python through a nice mix of presentations and hands-on exercises.
Building Responsive Shiny Applications - Our JR data scientist and trainer, Keith Newman, ran a workshop looking at how to build responsive shiny applications. Keith covered responsive design principles and best practices as well as som CSS tricks for when built in solutions don’t quite cut it.
Shiny Testing - Russ Hyde, another JR data scientists and trainer, ran a workshop on automated testing in production grade shiny. Russ demonstrated how to utilise {shinytest2}, {testServer} and {testthat} to make app development a happier and more predictable experience.

If you’re keen to learn more about Shiny and other web frameworks (or something else entirely!) check out our full list of available training courses.

Day two featured a series of talks by prominent Shiny experts from across a range industries:

Keynote: George Stagg (Posit)

Shiny Without a Server: webR and Shinylive

Our opening keynote was given by George Stagg, a senior software enginner at Posit. George began by providing some motivation for webR and Shinylive, which allow users to run R and Shiny code in the web browser without the need for an expensive server. WebR supports graphics, presentation slides with Quarto, and interactive code in the browser. George went on to emphasise three main use-cases for Shinylive:

Building apps in the browser and sharing with colleagues
Migrating an existing Shiny app to Shinylive using the {shinylive} package
Embedding Shiny apps in presentation slides using the Quarto extension to Shinylive

Before finishing, George noted that Shinylive is still experimental and should not be used for hosting apps that contain hardcoded secrets and passwords.

Talk materials available here

Liam Kalita (Jumping Rivers)

The Road to Easier Shiny App Deployments

Liam spoke about his experiences assisting clients with bringing their apps to production. He outlined some of the most common reasons that an app can fail at deployment, including missing dependencies, incorrect credentials for external databases, and insufficient system resources. He then shared some top tips to be more proactive:

Add continuous integration / continuous deployment (CI/CD) checks that have to pass before deployment can happen
Containerise the app using tools like docker to create a portable environment that can be used across different machines
Use monitoring and alerting to track demand and performance

Liam finished by emphasising the importance of deployment logs and avoiding hardcoded secrets.

Chris Brownlie (Barnett Waddingham)

Anatomy of a Shiny app

Chris took us on a tour of the building blocks of {shiny} to explore what really goes on under the hood of a Shiny app. We learnt about the responsibilities of some of the main components used in Shiny, such as ShinySession, ReactiveEnvironment & ReactiveVal and how they fit together. Chris showed us how “de-magic-ifying” shiny can help us to improve our app design and avoid common pitfalls, as well as aid beginners learning shiny. The talk wrapped up with some very relatable comments Chris found whilst digging through the source code, showing that even developers of large tools such as Shiny sometimes have to resort to a “copy/paste job” “:sob:”.

Talk materials available here

Naomi Bradbury, Clareece Nevill and Janion Nevill (University of Leicester)

Health Data Scientists Developing Production Grade Shiny Apps

Naomi, Clareece and Janion told us the story of how they unexpectedly became developers of a suite of shiny apps for healthcare researchers. They started out with a couple of simple proof of concept apps created as part of a mini project. However, as more researchers realised how useful their apps were, they started getting emails with queries, issues and even feature requests. That’s when they realised they had inadvertently become software developers and maintainers. We heard about the lessons they learnt along the journey, including how valuable it is to include software engineer expertise early on when developing apps, and that prototypes can always become production.

Colin Gillespie (Jumping Rivers)

Securing Shiny Dashboards

Colin started with covering common pitfalls in terms of security with Shiny apps like SQL injection attacks and hard coded secrets included in a repository where he comically pointed out some of things you can actually find on github. He introduced our Shiny Health Check service where we will access your app and help improve aspects like security, code structure and version control workflows. Colin finished his talk with various policies that can be implemented to improve general web security.

Talk materials available here

Tan Ho (Zelus Analytics)

Effective Logging for Shiny

Tan spoke about his troubles with logs being difficult to find and not necessarily useful, and his subsequent journey to find a better solution. He went into the philosophy of logging and why humans and machines will need different kinds of logs. Breaking it down to the lowest level of “What are we trying to find out from the logs?”. Tan also covered all the options for logging at package level vs logging in production Shiny apps.

Talk materials available here

Anna Skrzydło (Appsilon)

3 reasons why nobody uses your app

Anna Skrzydło gave a relatable story of “Three reasons why users don’t use your app”. Spoiler alert: the main reasons are:

They don’t think they need your app
They can’t use the app
They don’t trust the app

When users don’t think they need your app, it’s possible you haven’t solved their problem. Anna suggested using user interviews, pro-typing and user personas to identify the core of the problem. If the user is struggling to use the app correctly, you can use usability heuristics to improve the user experience, rather than just offering training. Finally, if the user doesn’t trust your app, fix bugs quickly and communicate clearly when changes are coming, giving users time to prepare.

Keynote: Cara Thompson (Freelance Data Consultant)

Dynamic annotations: tips and tricks to make text shine without stealing the show

Our closing keynote was from data visualisation expert Cara Thompson. Cara gave us a whirlwind tour of detailed plot styling and taught us how to decrease reliance on text by a worked example on the Great British Bake Off data set. We had plenty of “Aha” moments watching the plot evolve from a plain {ggplot2} graphic to something that told a real story. Key takeaways were to write the text in the order you speak it, and use text hierarchy to present your story in an organised way. You can read all of Cara’s top tips in her slides.

Talk materials available here

What happens next?

We want to say thank you to the sponsors of the event for your support in making it possible!

Thanks also to our speakers for their incredibly insightful presentations and workshops, and of course to all our attendees who travelled from near and far to make Shiny in Production such a memorable event! The talk recordings will be released on our YouTube channel in the coming weeks, so keep your eyes peeled for that!

We had such a great time running the Shiny in Production conference, that we’re planning on doing it all again next year! Shiny in Production 2024 will be taking place on 9th & 10th October 2024 - Super Early Bird tickets are available now - Look out for more details coming soon!

Can’t wait that long? We’ll be hosting SatRdays London 2024 on April 27th, in collaboration with CUSP London. More details will be announced in an upcoming blog!

Silver Sponsors

For updates and revisions to this article, see the original post

An Introduction to Python Package Managers

Thu, 05 Oct 2023 23:59:00 +0000

Python is a general purpose, high level language which, thanks to its simplicity and versatility, has become very popular, especially within the data science community. The extensive Python community has developed and contributed thousands of libraries and packages over the years in a plethora of different disciplines to aid developers with their applications. Managing these packages can be a challenging task without the correct tools. That’s where Python package managers come in. In this blog post we will explore what a package manager is and why they are important. We will then cover some popular examples, including how to use them, how to install them and the pros and cons of each.

Whilst we will briefly touch on virtual environments in places, we will explore these in more depth in an upcoming post.

What is a Python Package Manager?

Python package managers are essential tools that help developers install, manage, and update external libraries or packages used in Python projects. These packages can contain reusable code, modules, and functions developed by other programmers, making it easier for developers to build applications without reinventing the wheel. Package managers automate the process of fetching, installing, and handling dependencies, streamlining the workflow and ensuring a smooth development experience.

Managing Package Dependencies

One of the key challenges in software development is dealing with dependencies — the external libraries and packages that your project relies on. Python package managers help alleviate this challenge by managing dependencies automatically. When you install a package, the package manager will also fetch and install any dependencies required by that package, recursively handling all transitive dependencies whilst making sure all package versions integrate with each other.

Additionally, package managers provide support for creating virtual environments. Virtual environments enable developers to create isolated and self-contained environments for each project, ensuring that the dependencies installed for one project do not interfere with another.

Popular Python Package Managers

There are many different Python package managers out there. Attempting to write about all of these would lead to an almost never ending blog post and no one would want to read that! Instead, we will talk about some of the most popular options that a lot of Python developers use. These are: pip, conda and poetry. Each have their advantages and disadvantages which we will talk through below.

pip

The most widely used Python package manager is pip (short for “pip installs packages”). It comes pre-installed with Python versions 3.4 and later. Pip allows developers to easily install packages from the Python Package Index (PyPI) and other repositories. It also handles package versioning, so you can install specific versions of packages when needed.

How to Install pip

Typically, once you have installed Python pip is installed by default. If this is not the case, there are two ways to install pip:

ensurepip
get_pip.py

`ensurepip`

Since Python 3.4 the ensurepip module was added to Python as a standard library. You can filter the instructions below to your preferred OS by clicking the corresponding tab:

Windows

In your preferred terminal run:

py -m ensurepip --upgrade

masOS

In your preferred terminal run:

 python3 -m ensurepip --upgrade

Linux

In your preferred terminal run:

 python3 -m ensurepip --upgrade

`get_pip.py`

An alternative way to install pip is by using a Python script get-pip.py.

Windows

Firstly download get_pip.py by visiting bootstrap.pypa.io/get-pip.py
Open the Command Prompt, navigate to the directory where you have downloaded get_pip.py and then run:

 py get-pip.py

macOS

Open your preferred terminal
Download get-pip.py:

 curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Install pip by running:

 python3 get-pip.py

Linux

Open your preferred terminal
Download get-pip.py:

 wget https://bootstrap.pypa.io/get-pip.py

Install pip by running:

 python3 get-pip.py

To check pip has installed run:

pip3 --version

How to use pip

To install a package:

pip3 install package_name

To uninstall a package:

pip3 uninstall package_name

To upgrade a package:

pip3 install --upgrade package_name

Pip is one of the easier Python package managers for getting started with. It is most-likely already pre-installed with Python and is simple to use. When you install a package with pip it will install any other packages that the desired package depends on. However, when you upgrade a package pip may not automatically update all of its relative dependencies which can lead to conflicts.

conda

While pip is excellent for most projects, there are cases when you may need a more comprehensive package manager like conda. Conda is primarily associated with Anaconda and Miniconda, two Python distributions aimed at scientific computing and data science. Conda can manage not only Python packages from PyPI but also non-Python libraries and binary packages. Furthermore, conda excels at handling dependencies and managing virtual environments (which will be discussed in a later blog).

How to install conda

Conda can be installed in two ways by either installing Anaconda or Miniconda. We will only consider installing Miniconda in this blog.

Windows

Download the Miniconda installer for Windows from docs.conda.io/en/latest/miniconda.html.
Run the installer and follow the prompts to install Miniconda

macOS

Download the Miniconda installer for macOS or Linux from docs.conda.io/en/latest/miniconda.html
Open a terminal of your choice and navigate to the directory containing the downloaded installer
Run the installer script:

 zsh Miniconda3-latest-MacOSX-x86_64.sh

Follow the prompts to install Miniconda

Linux

Download the Miniconda installer for macOS or Linux from docs.conda.io/en/latest/miniconda.html
Open a terminal of your choice and navigate to the directory containing the downloaded installer
Run the installer script:

 bash Miniconda3-latest-Linux-x86_64.sh

Follow the prompts to install Miniconda

To check that Conda has installed run:

conda --version

How to use conda

To install a package:

conda install package_name

To uninstall a package:

conda remove package_name

To upgrade a package:

conda update package_name

By default, conda will give preference to packages that are included in the Anaconda distribution. If you need to install PyPI packages that are not in the default conda distribution, you can install pip by running conda install pip, then follow the pip instructions above. This will install a version of pip within your conda environment. You need to be careful when using pip inside of conda, for more information on using pip inside conda, Anaconda have written a useful blog on the subject, including some best practises.

poetry

Poetry is a modern and comprehensive Python package manager that combines dependency management and project packaging. It aims to simplify the workflow of managing dependencies and version control, making it an attractive choice for Python developers.

How to install poetry

Installation of poetry is slightly more involved than pip and conda, but thankfully poetry have released a Python script to aid in installation which can be accessed at install.python-poetry.org

Windows

If you are comfortable with using powershell, download and execute the installer script by running:

 (Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -

Otherwise, copy and paste the content of the python script from install.python-poetry.org into a file called get-poetry.py and run:

 py get-poetry.py

The installer script will have created a poetry wrapper at %APPDATA%\Python\Scripts. This path needs to be added to your $PATH if it has not already been added. You can find out more information on how to edit the $PATH variable in this blog post
You may need to restart your machine before the command poetry will work

macOS

Using your preferred terminal download and execute the installer script by running:

 curl -sSL https://install.python-poetry.org | python3 -

The installer script will have created a poetry wrapper at $HOME/.local/bin. This path needs to be added to your $PATH if it has not already been added. To do this run:

 vim ~/.zshrc

Press i (to enter insert mode) and add the following line to the file:

 export PATH="$HOME/.local/bin"

Press Esc and then enter :wq (which will write and quit the file)

To make the poetry command recognisable finally run:

 source ~/.zshrc

Linux

Using your preferred terminal download and execute the installer script by running:

 curl -sSL https://install.python-poetry.org | python3 -

The installer script will have created a poetry wrapper at $HOME/.local/bin. This path needs to be added to your $PATH if it has not already been added. To do this run:

 vim ~/.bashrc

Press i (to enter insert mode) and add the following line to the file:

 export PATH="$HOME/.local/bin"

Press Esc and then enter :wq (which will write and quit the file)

To make the poetry command recognisable finally run:

 source ~/.bashrc

To check that poetry has installed run:

poetry --version

How to use poetry

First you need to create a new project:

poetry new project_name

To install a package:

poetry add package_name

To uninstall a package:

poetry remove package_name

To upgrade a package:

poetry update package_name

Package Management in Project Workflows

Python package managers are a critical component of project workflows and can be used in various ways:

Setting up development environments: Package managers help developers create consistent development environments across different machines by specifying the package version numbers. In pip this is in the form of a requirements.txt file, in conda this is an environment.yml file and in poetry this is a pyproject.toml file (this includes more than just python packages).
Continuous Integration (CI) and Deployment: Package managers facilitate the installation of dependencies in CI systems and deployment servers, ensuring that the application runs as expected in these environments.
Version Control: Similar to setting up a development environment, by including a requirements.txt, environment.yml or pyproject.toml file in version control systems like Git, developers can ensure that collaborators and other team members have the same environment setup.

Creating and using these files is pretty straightforward. Lets take a look at how to do this in pip, conda and poetry.

pip

When using pip, it is easy to create a requirements file. All you need to do is run the following command in the terminal.

pip3 freeze > requirements.txt

This will produce a file with content which will look something similar to the example below.

flake8==4.0.1
numpy==1.25.2
pandas==2.1.0
scikit-learn==1.3.0

pip freeze will produce a list of all the packages you have installed, along with their dependencies and the versions for each package. This list is then written to a file called requirements.txt by using the > command to redirect the output from pip freeze.

To install all the packages and versions from a requirements file within a directory in pip you can execute the following command in the terminal.

pip3 install -r requirements.txt

We use the same command as before when installing a package, however a flag -r is needed to tell pip to look inside requirements.txt and pull all the packages and versions from this file.

conda

In pip a requirements.txt file is used to store package versions (in practice the file could be given any name, but it is standard practice to name the file requirements), however in conda a YAML file is used which is typically named environment.yml. YAML (which stands for YAML Ain’t Markup Language) files are often used for configuration files and are human-readable. Within conda, YAML files are used to store any necessary information of your conda environment, this includes the packages for the project you are working on and the version of python being used (this could in practice be another coding language). To create an environment.yml file in conda you can use the following command below.

conda env export > environment.yml

This will produce an environment.yml file which will be similar to the example below.

name: <environment_name>
channels:
 -defaults
dependencies:
 - flake8=4.0.1
 - numpy>=1.15.2
 - pandas=2.1.0
 - python=3.10.8
 - scikit-learn=1.3.0

conda env export is similar to pip freeze and will export all the relevant packages from your environment with the relevant versions, however instead of this being a list, it is in a format suitable for a YAML file. Also like pip, we use the > operator to write the information from conda env export into environment.yml.

To install all the packages and their dependencies with the specific versions from an environment.yml file use the following command.

conda env create -f environment.yml

This will create a new conda environment with all the packages and versions specified in environment.yml. If you want to know more about python environments we will talk more about these along with their uses in an upcoming blog.

poetry

By default, when you create a new poetry project (using poetry new <PROJECT-NAME>) a pyproject.toml file will be generated. Once you have added packages to your poetry project, your pyproject.toml file will look like something similar to below:

[tool.poetry]
name = "<environment_name>"
version = "0.1.0"
description = ""
authors = "Jane Doe {jane.doe@123evergreenterrace.com}"

[tool.poetry.dependencies]
python = "^3.10"
numpy = "^1.25.2"
pandas = "^2.1.0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

The python packages you install can be seen under [tool.poetry.dependencies] along with the Python version. You can add extra requirements to the pyproject.toml file by either manually editing it, or by using poetry add <package>. If you want to manually edit the TOML file, the hat notation ^ is equivalent to greater than or equal to, e.g. if you require Python 3.10 or above you can add python = "^3.10" to the TOML file.

When you install dependencies in a poetry project, the exact version numbers of the installed packages and their dependencies are added to a “poetry.lock” file located in the same directory. The pyproject.toml and poetry.lock files can then be shared with a colleague, who can install the dependencies by running the following command in the same directory as the files:

poetry install

To use the dependencies installed by poetry, you need to activate the poetry environment by running:

poetry shell

You will now be using the same development environment as any colleagues that are working on the same project. We will learn more about poetry and other virtual environments in an upcoming blog.

If instead you have a requirements.txt file, we can still install the packages and relevant versions using poetry. This can be done as follows:

poetry add $( cat requirements.txt )

This will add each package and version to the pyproject.toml file. We can use $( cat requirements.txt ) to feed each line of requirements.txt to poetry add.

Conclusion

Package manager	Easy to install	Online support	Latest packages always available	Virtual environment manager	Handles package dependencies	Small installation size	Multi-platform	Access to PyPI	Easy Python package publishing
pip	✅	✅	✅	❌	❌	✅	❌	✅	❌
conda	✅	✅	❌	✅	✅	❌	✅	✅*	❌
poetry	❌	✅	✅	✅	✅	✅	❌	✅	✅

*by using pip

Installing Python package managers is a straightforward process that varies slightly based on your operating system. Regardless of whether you’re using Windows, macOS, or Linux, setting up these tools is a small investment that pays off in significantly improved project management and development practices. A table summarising some pros and cons of each package manager we have covered is shown in the table above. I would not recommend installing all three package managers at once as it may become confusing to remember what you have installed in which package manager. I would recommend choosing whichever you like the look of best and try that one first. Personally I would recommend either installing pip or conda if this is your first introduction to Python and poetry if you are working on a collaborative project. However, choose the package manager that best suits your needs and enjoy the benefits of efficient dependency management and streamlined development workflows.

For updates and revisions to this article, see the original post

Shiny in Production: Sponsors

Thu, 28 Sep 2023 23:59:00 +0000

There’s only two weeks left to go until Shiny in Production 2023! The events team are hard at work getting things ready for the day, and we wanted to take this opportunity to say a huge thank you to our event sponsors!

National Innovation Centre for Data

The National Innovation Centre for Data (NICD) was created in 2019 with £30 million of funding from the government and Newcastle University. Based in the state-of-the-art Helix science district in Newcastle, our mission is to transfer data skills to the UK workforce. Our team of PhD-level data scientists work to ensure that organisations across the country are equipped to reap the benefits of the global data-driven revolution.

Silver Sponsors + Drinks Reception

Royal Statistical Society

Founded in 1834, the Royal Statistical Society (RSS) are one of the world’s leading organisations advocating for the importance of statistics and data. They’re a professional body for all statisticians and data analysts – wherever they may live.

They have more than 10,000 members in the UK and across the world. As a charity, they advocate for the key role of statistics and data in society, and work to ensure that policy formulation and decision making are informed by evidence for the public good.

Silver Sponsors

Newcastle University Solve

Newcastle University Solve (NU Solve) has been helping businesses, public sector organisations and industries to find answers to complex challenges for more than three decades. We emerged out of the Industrial Statistics Research Unit, which had successfully engaged with enterprises since 1984.

Posit

Posit’s mission is to create open-source software for data science, scientific research, and technical communication. They do this to enhance the production and consumption of knowledge by everyone, regardless of economic means.

R Consortium

For updates and revisions to this article, see the original post

Reproducible reports with Jupyter

Thu, 21 Sep 2023 23:59:00 +0000

Jupyter notebooks are a useful tool for Python users of all levels. They allow us to mix together plain text (formatted as Markdown) with Python code. This is beneficial for beginners and experienced data scientists alike:

Beginners that are learning Python for the first time can use Markdown cells to annotate code and record notes.
By splitting up their code into chunks, developers can write and test their code in a modular manner.
Jupyter notebooks are open-source and a convenient format for developers to share reports containing live code, equations, visualisations and narrative text with colleagues.

In this post, we will go deeper with these ideas and show you how to create reproducible HTML and PDF reports with Jupyter. This blog is a follow-up to Quarto for the Python user, which explained how to generate reproducible reports from plain text files with Quarto.

What is Quarto?

Quarto is a free-to-use, open-source software based on Pandoc that enables users to convert plain text files into a range of formats, including PDF, HTML and powerpoint presentations. These documents can contain a mixture of narrative text, Python code, and figures that are dynamically generated by the embedded code.

This has many use-cases:

Your company may have a weekly board meeting to go over the latest sales figures. By having a Quarto presentation that pulls in the latest company sales data, you can regenerate the presentation slides each week at the click of a button.
As a researcher you may be preparing a report for publication. By having the code that generates data tables and figures embedded within the report, regenerating the draft as the experimental data floods in is a breeze!

In our recent blog post, Quarto for the Python user, we used Quarto to render dynamic reports that mix together Python code and narrative text. We used Quarto’s standard workflow, which starts from plain text .qmd files. In this post we will extend these ideas to Jupyter Notebooks.

Starting with .ipynb notebook files, the Quarto workflow is:

A Jupyter kernel is used to interpret the Python code cells and Quarto generates a Markdown document.
The Markdown document includes the text, code, and any figures or results that were generated by the code.
This is then converted into the desired output format (PDF, HTML, etc) using Pandoc.

Prerequisites

We will be using VS Code to edit and render our Jupyter notebook (the only other IDE with support for both Jupyter and Quarto is JupyterLab). Before you can work with Jupyter in VS Code, you will need to install the Jupyter extension. This can be located in VS Code by clicking “Settings” -> “Extensions” then typing “jupyter” into the extensions search bar. Select the “Jupyter” extension by Microsoft and click “Install”.

You will also need to install Quarto. You can then find the Quarto extension in VS Code by typing “quarto” into the extensions search bar. Select the “Quarto” extension and click “Install”.

Finally, to reproduce the examples covered in this post, you will need to install the Python dependencies by running the following command from your terminal:

python3 -m pip install ipykernel nbclient nbformat pandas papermill plotly statsmodels

These dependencies are required for creating an interactive Plotly figure in Jupyter and rendering the notebook from the command line.

Setting up a virtual environment

In case you’d like to follow along with these examples using a virtual environment, we will provide brief instructions for setting up a kernel on Jupyter. If you’re happy to just use your system Python installation then you can move onto the next section.

To create a virtual environment, run the following command from your command terminal:

python3 -m venv venv

This will create a folder called “venv” which can be used to activate the virtual environment (you can call it whatever you like). To activate it, run:

source venv/bin/activate

Now install the Python dependencies into your environment by running the pip command shared above. You can now add this environment to your list of Jupyter kernels by running:

ipython kernel install --user --name=venv

This will add a kernel called “venv”. Next time you open a Jupyter notebook, you should now be able to select this kernel from the list of options.

Rendering a report

We will generate a report about Mario Kart 64 world records. Please refer to our previous post for a recap of the YAML header, Markdown syntax and code chunk options (we will only briefly cover these topics here).

Setting up Jupyter

Within VS Code, create a Jupyter notebook by clicking “File” -> “New File…” -> “Jupyter Notebook (.ipynb support)”. Within the notebook, you can select the kernel by clicking “Select Kernel” and choosing an option from the available list (for example, your system Python installation or a virtual environment). For this post, we used Python 3.10.

Header settings

The first code cell should be changed to a Raw NB Convert cell. In VS Code, the cell type can be changed by clicking the text in the bottom-right corner of the cell (this will read “Python” for a Python code cell). To select a raw cell, type “raw” in the search bar and click the option that appears.

The raw NB convert cell acts as the YAML header of the Quarto report. This is where we include settings such as the title and default output format. Our example is given below:

---
title: "Reporting on Mario Kart 64 World Records"
author: "Parisa Gregg & Myles Mitchell"
date: "1 Aug 2023"
format: html
execute:
eval: true
jupyter: python3
---

This sets the default output format to HTML and ensures that the code cells are evaluated on execution. Remember to include the fencing (---) for YAML code.

Adding text and code

The remainder of the report will be built from a mixture of Markdown and Python code cells:

Markdown cells are used for narrative text in the report.
Python cells are used for displaying Python code and generating dynamic content (e.g., figures, tables and inline results).

Try copying the following into a Markdown code cell. This adds the Abstract, Introduction and the beginning of the Methods section:

## Abstract

Investigating how the world record for Rainbow Road in Mario Kart 64
developed over time.

## Introduction

Mario Kart 64 is a racing video game developed and published by
[Nintendo](https://en.wikipedia.org/wiki/Nintendo) for the
[Nintendo 64](https://en.wikipedia.org/wiki/Nintendo_64).

Players can choose from eight characters to race as, including:

- Mario
- Toad
- Princess Peach

The game consists of 16 tracks to race around. World records can be
set for either one lap or a full race (three laps) of the course. As
players have competed for faster times, several track shortcuts have
been discovered. There are separate world records for both _with_ and
_without_ the use of a shortcut.

## Methods

We loaded a dataset of [Mario Kart 64](https://mkwrs.com/) world
records. This data is from [tidytuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-05-25/readme.md)
with credit to [Benedikt Claus](https://github.com/benediktclaus).

For this investigation we are interested in the world records for
Rainbow Road over a three-lap course. The dataset was loaded and
filtered using pandas:

By running the Markdown cell, the text will be rendered so it includes subheadings, bullet points, italic text fomatting and hyperlinks.

Next we may wish to display the code used for loading and filtering the data. Try copying this code into a Python cell:

import pandas as pd

# Load the records data
records = pd.read_csv(
 "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-25/records.csv"
)
# Filter the data
rainbow_road = records.loc[
 (records["track"] == "Rainbow Road") &
 (records["type"] == "Three Lap")
].reset_index()
# View the data
rainbow_road.head()

Running this should produce the expected Pandas output, including the first five rows of the rainbow_road data.

Let’s now include some results, starting with a Markdown cell to add the Results section header and opening text:

## Results

The figure below shows the development of world records for the Rainbow Road
track on Mario Kart 64 from 1997 to 2021.

We could insert the figure as a PNG or PDF image. But to make this report reproducible, let’s dynamically generate the figure using a Python code cell:

#| echo: false
#| fig-cap: "Progress of Rainbow Road world records, with and without allowing shortcuts."
#| fig-width: 8
#| label: wr-plot
import plotly.express as px

px.line(
 rainbow_road,
 x="date",
 y="time",
 color="shortcut",
 title="Progress of Rainbow Road N64 World Records",
 line_shape="hv",
 markers="."
)

The code chunk options at the top of this cell will make the code invisible in the rendered document and set the figure caption, width, and label to our liking. Plotly is used to visualise the world record for Rainbow Road over time. Try running this code within your notebook to check that it generates a figure like the one below:

Finally, let’s quote the longest time a world record was held for using inline code. Copy this code into a Python cell:

#| echo: false
from IPython.display import display, Markdown

max_duration = rainbow_road.record_duration.max()
display(Markdown(
f"""
The longest a 3 lap world record was held
for on Rainbow Road is {max_duration} days
({round(max_duration/365,1)} years).
"""
))

Running this should add the sentence “The longest a 3 lap world record was held for on Rainbow Road is 2214 days (6.1 years).”, where the numbers 2214 and 6.1 have been calculated by Python. If more data is added, these numbers can be updated automatically by re-rendering the notebook.

Rendering your notebook

You should now have a complete notebook with a YAML header, Markdown text and Python code cells. To see how it should look, you can view our notebook here.

To render the report from the command line:

quarto render <notebook>.ipynb --to html will render the document as HTML.
quarto preview <notebook>.ipynb will generate a live preview which can be viewed as you edit the notebook.
quarto render <notebook>.ipynb --execute will execute the code cells as the output is generated. Without this, you will need to ensure that you have run the code cells in the notebook manually, before quarto is used to render it.

Upon rendering, an HTML document like the one here should be created.

It’s also possible to render the notebook with the VS Code UI. Provided you have the Quarto extension installed, there should be options to “Render”, “Render All”, “Render HTML”, “Render PDF”, and “Render DOCX”:

Note that the HTML plot generated by Plotly cannot be displayed in a DOCX or PDF document. Instead we would have to use a static image format like PNG or PDF.

Cell embedding

In Quarto 1.3 a new feature was added that enables you to embed external Jupyter notebook cells in a Quarto document. This is particularly useful if you have results from different notebooks that you want to extract into a report.

As well as investigating the word records set on Rainbow Road, we have also been looking at those set on Choco Mountain. The results for Choco Mountain are in a separate choco_mountain.ipynb notebook. We might now want to summarise our various Mario Kart results in a single .qmd report (see our previous post for a guide to .qmd reports).

Rather than having to replicate our plotting code, we can embed the relevant cells from our rainbow_road.ipynb and choco_mountain.ipynb notebooks directly into the .qmd report:

---
title: "Reporting on Mario Kart 64 World Records"
author: "Myles Mitchell & Parisa Gregg"
date: "14 June 2023"
format: html
---

## Rainbow Road

The figure below shows the development of world records for the
Rainbow Road track on Mario Kart 64 from 1997 to 2021.

{{< embed rainbow_road.ipynb#wr-plot >}}


## Choco Mountain

The figure below shows the development of world records for the
Choco Mountain track on Mario Kart 64 from 1997 to 2021.

{{< embed choco_mountain.ipynb#wr-plot >}}

Here we have used the “wr-plot” label to reference the code cells that produce the Plotly figures in the Rainbow Road and Choco Mountain reports. These code cells are now embedded in the .qmd report and the figures will be visible in the rendered document (as can be seen here).

Parameterised Reports

Above we produced a report for the Rainbow Road world records on Mario Kart 64. There are 16 tracks in total in the game. What if we wanted to replicate this report for each track? With Quarto and Jupyter notebooks we can define a set of parameters to easily create different variations of a report.

To parameterise a Jupyter notebook we need to create a cell with a “parameters” tag. To add a parameters tag to a Python cell in VS Code, click on “…” (More Actions) in the cell tool bar and select “Add Cell Tag”:

To add a parameters tag we then just type “parameters” into the pop up box:

The cell should now have a “parameters” tag:

If we want to have the track as a parameter in the report, we can define a track variable in the tagged cell (as above):

track = "Rainbow Road"

We can then use this variable in the remainder of our notebook. For example, it can be used to set the track filter in the data-loading code:

# Load the records data
records = pd.read_csv(
 "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-25/records.csv"
)
# Filter the data
course_records = records.loc[
 (records["track"] == track) &
 (records["type"] == "Three Lap")
].reset_index()

The full code for our parameterised mario_kart.ipynb notebook can be found here. In this example we have used "Rainbow Road" as the default value for our track parameter. Running the following will therefore generate a report for Rainbow Road:

quarto render mario_kart.ipynb --execute

If we want to report on the "Moo Moo Farm" world records instead, we can pass this to the track parameter on the command line using the -P flag:

quarto render mario_kart.ipynb -P track:"Moo Moo Farm" --execute

You may have noticed that running the above command actually inserts a cell defining the track variable as “Moo Moo Farm” into mario_kart.ipynb.

# Injected Parameters
track = "Moo Moo Farm"

posit::conf(2023)

Thu, 14 Sep 2023 23:59:00 +0000

Our bags are packed, flights are booked, and we’re ready to head stateside for posit::conf(2023). We’re excited to be sponsoring the event this year, as well as presenting a few talks ourselves. You’ll be able to fine Colin, Liam and Rich at the Jumping Rivers exhibition stand for the week, come along, say hello, and get your hands on one of our coveted JR coasters.

The Road to Easier Shiny App Deployments - Liam Kalita

15:00 CDT - Tuesday 19th September

We’re often helping developers to assess, fix and improve their Shiny apps, and often the first thing we do is see if we can deploy the app. If you can’t deploy your Shiny app, it’s a waste of time. If you can deploy it successfully, then at the very least it runs, so we’ve got something to work with. There are a bunch of reasons why apps fail to deploy. They can be easy to fix, like Hardcoded secrets, fonts, or missing libraries. Or they can be intractable and super frustrating to deal with, like manifest mismatches, resource starvation, and missing libraries. At the end of this talk, I want you to know how to identify, investigate and proactively prevent Shiny app deployment failures from happening.

Getting the Most Out of Git - Colin Gillespie

16:00 CDT - Tuesday 19th September

Did you believe that Git will solve all of your data science worries? Instead, you’ve been plunged HEAD~1 first into merging (or is that rebasing?) chaos. Issues are ignored, branches are everywhere, main never works, and no one really knows who owns the repository.

Don’t worry! There are ways to escape this pit of despair. Over the last few years, we’ve worked with many data science teams. During this time, we’ve spotted common patterns and also common pitfalls. While one size does not fit all, there are golden rules that should be followed. At the end of this talk, you’ll understand the processes other data science teams implement to make Git work for them.

For updates and revisions to this article, see the original post

Shiny in Production: Full speaker lineup

Thu, 07 Sep 2023 23:59:00 +0000

We are pleased to announce the full line-up for this year’s Shiny in Production conference! Don’t miss out on this excellent set of talks and workshops - head over to the conference website to sign up now!

Workshops

This year’s workshops consist of two delivered by our JR trainers, and one by a special guest, Andrie de Vries of Posit!

Andrie de Vries - Posit

Introduction to Shiny for Python

This workshop provides an introduction to coding a web application using Shiny for Python. It is aimed at providing R users, who are already familiar with Shiny, the tools and understanding to write similar apps using Python. In addition to using Shiny for Python yourself, this will also give you the capability to discuss Shiny with your Python colleagues, for example when you work in a bi-lingual data science team.

About the speaker

Andrie is Director of Product Strategy at Posit (formerly RStudio) where he works on the Posit commercial products. He started using R in 2009 for market research statistics, and later joined Revolution Analytics and then Microsoft, where he helped customers implement advanced analytics and machine learning workflows. To keep healthy, he practices yoga and does some recreational running and canoeing.

Keith Newman - Jumping Rivers

Building Responsive Shiny Applications

About the speaker

Following a PhD in statistics at Newcastle University, Keith developed software to improve road safety modelling. He enjoys creating Shiny apps and teaching the use of R.

Russ Hyde - Jumping Rivers

Shiny Testing

Automated testing plays a vital role in any production-grade software project. But what benefit does well-tested code bring to a project, and how do you write a good test suite for your shiny application? In this workshop, we demonstrate how to document the behaviour of an application using browser-driven end-to-end tests and show that lower-level, module- or function-focussed, tests make development a happier and more predictable experience. The tools used here (shinytest2, testServer) all build upon the testthat package.

About the speaker

Russ has previouly worked in molecular biology and bioinformatics. He holds a PhD in Molecular Physiology and MSc in Mathematics. Russ is an author of several CRAN packages and mentor on the R-for-data-science community.

Talks

The second day of the conference will consist of a great line up of talks from people from a variety of industries!

Keynote: George Stagg - Posit

R Shiny without a server: webR and Shinylive

WebAssembly (Wasm) is a technology that enables software that’s normally compiled for a specific computer system to instead run anywhere, including inside web browsers. WebR is a version of the R interpreter compiled for Wasm, bringing this technology to the R world.

Earlier this year, the initial version of webR was released and users have already begun building new interactive experiences with R on the web. The latest release, version 0.2.0, includes improvements to graphics, accessibility and internationalisation, developer API updates, and introduces a new webR REPL app. The release also includes expanded support for Wasm R packages, including the ability to run fully client-side Shiny apps.

In this talk, I’ll introduce webR with some simple examples and discuss some details of how the system works. I’ll talk about how JavaScript APIs can be used to integrate webR into wider web applications and describe webR’s communication channel. Finally, I’ll give a description of how Shiny apps can be run using webR without an R server, ending with a demo of an in-development “Shinylive for R”.

About the speaker

George is a software engineer working on the webR project as part of the Open Source Team at Posit Software PBC. A former academic, George also has experience with teaching and research in computational mathematics, statistics and physics. When not working with software, George enjoys hacking hardware, photography, fantasy & sci-fi, and tinkering with electronic synthesisers.

Keynote: Cara Thompson - Freelance Data Consultant

Dynamic annotations: tips and tricks to make text shine without stealing the show

Data. It’s complicated! And there are often many facets to our data stories, which we need to present succinctly enough for our readers to want to engage with. On top of that, it changes! If we want our apps to reflect up-to-date data, how do we make sure the annotations stay up to date, and don’t end up off the edge of the plot or on top of each other with the next batch of data?

In this talk, we will explore how to make text work for us, by first considering how much of it we really need. Once we’ve decluttered and explored how we can use colours to make our plots less text-dependent, we’ll look at how to optimise text hierarchy in descriptions and in-plot annotations to keep the main thing the main thing, and how to create dynamic content and alignments for our titles, subtitles, axes, and annotations. Finally, we’ll explore coding tricks to apply these typography tips to tables and interactive plots, giving readers additional information on demand. Throughout the talk, I will share the packages and code snippets used to create and modify plots in R straight from readily available data, as well as tools which we can use to check for accessibility in our dataviz design decisions.

About the speaker

Naomi Bradbury, Clareece Nevill & Janion Nevill - Complex Reviews Support Unit

Health Data Scientists Developing Production Grade Shiny Apps

In 2017 a group of Biostatisticians at the University of Leicester embarked on two mini-projects to investigate the feasibility of using R Shiny apps to perform meta-analyses. Six years later, we have produced a suite of four R Shiny apps that automate network meta-analysis and diagnostic test accuracy meta-analysis, allowing researchers and healthcare professionals to access cutting-edge statistical analysis techniques without the need to write their own code. We also have two further apps currently in development. Our apps have approximately 1,000 user hours each month across the globe, and they have been used to conduct numerous published meta-analyses and in the development of clinical guidelines.

In this talk, we will give an overview of the project and the lessons we have learned (and are still learning) along the way as research scientists entering the world of software development, and how the power of R Shiny has enabled us to achieve this. We will discuss:

How we transitioned from working individually on apps to creating a research software development team
Leveraging existing R analysis packages to avoid ‘reinventing the wheel’ and ensuring our users could be confident in the accuracy of the results delivered
Developing novel data visualisations available within the MetaInsight and MetaDTA apps
Creating a living network meta-analysis of treatments for COVID-19 during the pandemic using a combination of R Shiny, Python and a Raspberry Pi
The changing landscape of Shiny app development and journal publication of apps across the time span of the project.

About the speakers

Naomi, Clareece and Janion are part of the Complex Reviews Support Unit (CRSU) based at the University of Leicester. The CRSU began in 2015 as a support group of experts in the field of evidence synthesis, but now include a strong interdisciplinary team primarily tasked with developing and maintaining the CRSU’s suite of Shiny apps for assisting evidence synthesis.

Chris Brownlie - Barnett Waddingham

Anatomy of a Shiny app

Have you ever wondered what really goes on under the hood of a Shiny app? What the building blocks are and how they fit together to enable us to build reactive web apps using R? Shiny apps are made up of a collection of objects that all link with each other and external sources to make the app work. These objects and methods interact in various ways in order to: start up the app, build the reactive graph, handle reactivity and much more. In addition to this, the inner workings of Shiny rely on the use of other, less well-known R packages.

In this presentation I’ll be exploring the building blocks of Shiny - such as the Shiny Session, reactive context and reactive log - as well as the key functions provided by Shiny’s dependencies, giving a high-level overview of how they fit together and what they are each responsible for in the lifecycle of an app. I’ll also discuss how understanding these can be useful when debugging or monitoring a production Shiny app.

About the speaker

Chris is an analytics consultant in the Management Decision Analytics team at Barnett Waddingham, specialising in R and Shiny-app development. He comes from a background in data science and formerly worked in the public sector. Besides coding he enjoys rugby, reading fantasy books and spending time with his dog, Nero.

Colin Gillespie - Jumping Rivers

Securing Shiny Dashboards

Shiny apps, Rmarkdown reports and flask dashboards provide a rich user experience for relatively little development time. Often this experience is created by utilising third-party Javascript functions, CSS files, fonts and images, but every external file we use means we implicitly trust the authors. The NHS and thousands of other government websites can attest that this is an issue; in 2018, they ran scripts that made their visitors use their computing power to mine cryptocurrencies.

This talk will look at how organisations can improve their Shiny application security. We’ll discuss general procedures for securing your overall workflow, such as security audits of your R packages and general Git security. We’ll then see how Content Security Policies (CSPs) can be leveraged in Shiny apps, which allow a website to specify what external content a site can access. This talk will discuss implementing these precautions within Shiny and Posit Connect. We’ll demonstrate that securing and monitoring your applications is relatively straightforward.

About the speaker

Colin has been using R since 1999. He’s the author of a number of R packages and has published the book Efficient R Programming with O’Reilly.

Tan Ho - Zelus Analytics

Effective Logging for Shiny

This talk will share some strategies I’ve found effective in setting up logging for Shiny apps to help with debugging applications both in development and when deployed to production.

About the speaker

Tan is a data nerd from Ottawa, Canada who loves R, Shiny, fantasy football and carving pumpkins! By day, he’s an ML engineer for Zelus Analytics. In his spare time, he maintains DynastyProcess.com Trade Calculator (a Shiny app that serves over 200,000 unique monthly users), develops nflverse R packages, and mentors in the R4DS Slack Community.

Liam Kalita - Jumping Rivers

The Road to Easier Shiny App Deployments

There are a bunch of reasons why apps fail to deploy. They can be easy to fix, like Hardcoded secrets, fonts, or missing libraries. Or they can be intractable and super frustrating to deal with, like manifest mismatches, resource starvation, and missing libraries.

At the end of this talk, I want you to know how to identify, investigate and proactively prevent Shiny app deployment failures from happening.

About the speaker

Liam has been the InfoSec Lead at Jumping Rivers since the start of 2023, specialising in compliance, security controls, and policies (GDPR, Cyber Essentials, ISO 27001). With a previous 2 years in infrastructure support and consultancy, he ensures secure Shiny app and Posit platform deployments, and promotes a culture of security awareness within the company.

Anna Skrzydło - Appsilon

3 reasons why nobody uses your app

You’ve built a great app. You are sure that once your coworkers will start using it, their life will be so much easier. You are waiting for some signs of success: your happy colleagues praising the app or recommending it to others. But … it doesn’t come. Does it sound familiar? Have you ever wondered why nobody is using your app? Come to my talk and wonder no more. During my talk I will present 3 main reasons for low user adoption: don’t need the app, can’t use the app and don’t trust the app. I will not only share the examples, but also recipes on how to deal with each of those situations.

About the speaker

Anna is a Delivery Manager, Business Analyst, and R/Shiny developer with over 10 years of professional experience leading software and Data Science projects, facilitating user workshops and mentoring Project Managers. She is a regular speaker at industry conferences, including WhyR, UseR and Data Science Summit - Dog lover, salsa dancer and stand-up comedy fan.

For updates and revisions to this article, see the original post

Using Stan to analyse global UFO sighting reports

Thu, 31 Aug 2023 23:59:00 +0000

UFO sighting data

A recent #TidyTuesday data set piqued my interest. It’s a rather large collection of worldwide reportings of UFO sightings.

Interesting.

You can download the data yourself and load it into R:

library("readr")

ufo_sightings = read_csv("ufo_sightings.csv")

ufo_sightings contains information about thousands of UFO sightings. Each sighting contains information such as the date and time (reported_date_time, day_part) of the sighting, the location (city, state, country) of the sighting, and other information such as a freetext summary. The summary column is by far the most interesting …

library("dplyr")
ufo_sightings %>%
 select(city, day_part, summary) %>%
 head()
# A tibble: 6 × 3
 city day_part summary
 <chr> <chr> <chr>
1 Pinehurst night Saw multi color object above horizon.
2 Rapid City nautical dusk An object in the shape of a straight line about an …
3 Cleveland night Tone in the air.
4 Bloomington afternoon Black tic-tac shaped ufo. Moved with insane speed
5 Irvine night Two alien were scanning me
6 Moore morning Long cigar solid shaped craft with light beam

What do we want to achieve?

The goal here is to fit a simple Bayesian model which will allow us to understand the historical counts of reported UFO sightings. The Bayesian approach to modelling is a probabilistic approach to modelling that has some advantages:

we are able to incorporate meaningful prior information about model parameters
including uncertainty in our predictions is natural and automatic.

A common drawback of Bayesian methods is the lack of fast and simple-to-use software to fit such models. With modern tools such as Stan, fitting Bayesian models is less of a headache!

We’re going to use Stan to fit our model, but I’ll be sparing you the details of the program, as well as many other details — we’ve linked to a Github repo at the end of this post with full analysis scripts. The purpose of this post is to give a high level overview of how we can fit flexible regression models for count data using Stan. We also touch upon how to work with Markov chain Monte Carlo (MCMC) output within a {tidyverse} framework towards the end of this post.

What is Stan?

Stan is a free, open source, C++ program used for specifying and fitting Bayesian models. Stan uses state of the art MCMC algorithms to fit your Bayesian models, thus is efficient and numerically stable. We don’t really have the time to delve too much into the reasons for using Stan today, but this previous post goes into considerable detail! If you’ve used JAGS or PyMC3 before, the concept behind Stan is similar; you specify your Bayesian model in the Stan language, and Stan takes care of the MCMC algorithm for you. Installing Stan is simple in R; calling

install.packages("rstan", dependencies = TRUE)

will install Stan for you (as well as the {rstan} package).

A note on how to interpret the analysis

It’s definitely worth stating up front that the purpose of this post is to have a little play with Stan and a fun data set. This data set contains reported UFO sightings, they’re not confirmed UFO sightings. Therefore, we can only make statements about reports and not the numbers of UFOs.

Let’s take a peek at the data

Prior to modelling the yearly counts, we should have a little look at the data. This can help us make informed modelling choices later down the line.

The number of sightings per year isn’t directly recorded in this data, but we can wrangle this out of the raw data with a few {dplyr} commands. We also only look at the GB data.

library("tibble")
library("lubridate")
library("dplyr")
sights_per_year = ufo_sightings %>%
 filter(country_code == "GB") %>%
 mutate(year_of_sighting = year(reported_date_time)) %>%
 summarise(
 sightings_per_year = length(year_of_sighting),
 .by = "year_of_sighting"
 ) %>%
 complete(
 year_of_sighting = full_seq(year_of_sighting, 1),
 fill = list(sightings_per_year = 0)
 )

We’ve now got annual counts of the number of global sightings. The first date in the data set is 1938, so the counts start from then.

Okay, let’s plot the data:

library("ggplot2")
sights_per_year %>%
 ggplot() +
 geom_point(aes(x = year_of_sighting,
 y = sightings_per_year)) +
 xlab("Year of reported sighting") +
 ylab("Number of recorded sightings per year") +
 ggtitle("Yearly UFO sighting reports for Great Britain") +
 theme_minimal()

There are a few interesting features of the data set:

The data produces a complex pattern; this might be tricky to model!
The number of reports pre 2000 was generally small.
The number of reports increases from roughly the year 1995 until the late 2000s.
From 2010 onward, the number of reports is in rapid decline.

To be honest, I’m not really sure why we see a rapid increase of sightings from the mid 90s onwards. There are a few potential reasons for this:

There was an increase in UFO traffic over Earth from the mid 1990s to 2010 (maybe 👽)
The emergence of the internet brought like-minded people together, improving the ease of reporting (plausible)
The 1996 blockbuster Independence Day had some kind of effect on people (plausible)

The statistical approach

Our approach here will be to model the number of UFO sighting reports over time with a Negative Binomial regression model, using spline terms to flexibly model the non-linear trend. This wasn’t the first idea I had, but after a little bit of frustration, contemplation and iteration, this gave a reasonably good fit to the data.

Some previous approaches (we won’t delve into the details) involved simpler Poisson models.

The statistical model is as follows:

$ y\mid \lambda(X) \sim \text{NegBin} (\lambda(X), \phi) $

$ \log \lambda(X) = \alpha + X\beta $

Where $y$ is the number of sightings, $X$ is a matrix of spline terms (derived from the times at which sightings were observed), $\lambda(X)$ is the expected number of sightings and $\phi $ is a dispersion parameter. A model block in Stan for this statistical model might look a bit like this:

// Stan model block
model {
// likelihood
y ~ neg_binomial_2_log(alpha + X * beta, exp(log_phi));
// prior
alpha ~ normal(m_alpha, s_alpha); // intercept
beta ~ normal(m_beta, s_beta); // spline coefficients
log_phi ~ normal(m_phi, s_phi); // log dispersion term
}

There’s quite a bit going on here. Let’s break the model down a little:

The neg_binomial_2_log distribution is used to specify the likelihood. This is an alternative parameterisation of the Negative Binomial distribution; the first parameter of neg_binomial_2_log is $ \log {E (y \mid X, \alpha, \beta, \phi)} $, the second parameter is simply phi.
The 2 in neg_binomial_2_log tells us that this distribution is parameterised by the mean and dispersion (rather than by the shape and scale parameters).
The log tells us that we’re actually parameterising by the log mean, rather than raw mean.
alpha is an intercept term (for a linear predictor); beta is a vector of regression coefficients.

In our approach, we’re going to use R to generate spline terms for us, then pass these spline terms to Stan (and thus populate X). If you’re unfamilliar with splines, they’re clever devices which allow us to model non-linear behaviour. This crs vignette provides an introduction. Another approach would be to write a Stan function to construct the splines, as in this Stan case study. The advantage of the approach we’ve taken is that splines are not a hard coded feature of the model, so we could use this Stan program for a more general Negative Binomial regression. The downside is, if someone used this Stan model as part of a workflow not performed with R, we would have to carefully verify that splines have been constructed in the same way as the R implementation.

Constructing the Splines

From a data preparation perspective, the trickiest thing is probably constructing the B spline basis functions (the other parts of our Stan program can simply be specified). However, the bs() function from the {splines} package takes the hard work out of this. The following function call constructs our splines for us; we’ve specified that we want to use 10 ‘knots’ which are a part of the specification of our spline terms. We could try many numbers of knots and use model selection methods to pick the best number (or even model averaging methods), but we later see that 10 knots provides a reasonable fit of the data.

library("splines")
year_range = range(sights_per_year$year_of_sighting)
B = bs(sights_per_year$year_of_sighting,
 knots = seq(from = year_range[1],
 to = year_range[2],
 length = 10),
 degree = 3,
 intercept = TRUE)

Once we’ve done this, we’re basically ready to run our Stan program. All we need to do is collect all of our data together in a list, we’ve called this stan_data.

Performing the inferences

To perform our inferences, we’re going to use Stan with help from {rstan}. {rstan} provides an interface to Stan from R, as well as some other handy features like plotting functions. A bit of trial and error led me to use a thinning factor of 10 in the MCMC scheme, and a warmup period of 1000 proved to be adequate, so we’ll use these numbers again, and aim to have 4 (approximately) unautocorrelated chains of length 5000. If you’ve never used MCMC methods before, we typically specify a warmup (or “burn in”) period to account for the fact that an MCMC chain must “converge” to the region of high posterior density, from the chains starting point. The thinning factor is used to account for the fact the Markov chains exhibit a dependence structure (like a time series might), if we keep only every thin-th iteration, we can reduce, or even eliminate, the autocorrelation in the chain. These steps allow us to better assess the quality of the MCMC scheme and also reduce computational overheads. If we didn’t thin, and kept the warmup period, we can end up with a very memory-intensive MCMC chain.

This is achieved with the following code. Our Stan program is in the file stan/nbin_reg.stan.

library("rstan")
options(mc.cores = 4) ## run chains in parallel (using 4 cores)
target_iter = 5000
warmup = 1000
thin = 10
total_iter = warmup + thin * target_iter

K = ncol(B)
stan_data = list(
 N = nrow(sights_per_year),
 K = K,
 y = sights_per_year$sightings_per_year,
 X = B,
 # priors
 m_alpha = 0,
 s_alpha = 1,
 m_beta = rep(0, K),
 s_beta = rep(1, K),
 m_phi = 0,
 s_phi = 1
 )

fit = stan(
 "stan/nbin_reg.stan",
 data = stan_data,
 chains = 4,
 iter = total_iter,
 warmup = warmup,
 thin = thin
)

Making the Stan output a bit more usable

The object fit is a stanfit object (an S4 class). These can be a bit awkward to work with, but {tidyverse} fans will find the {tidybayes} package offers a natural approach to working with MCMC output. Suppose our Stan program performs sample prediction at the years at which we observed the data via the following genreated quantities block. The _rng suffix on the neg_binomial_2_log_rng tell us we are performing random sampling from the neg_binomial_2_log distribution.

generated quantities {
int y_pred[N];
y_pred = neg_binomial_2_log_rng(log_lambda, exp(log_phi));
}

We might want to plot the summaries of the distribution of y_pred over time (or e.g. posterior quantiles as a function of time). In it’s raw format, wrangling this data from fit is a bit clunky. However, the tidybayes::spread_draws() function makes this simple! The only unusual thing to remember is that, if y_pred is a vector or array (in Stan), then we need to append [condition] to the column name (in R) to preserve the fact that a y_pred is many draws of an array of dimension N.

## [condition] tells tidybayes to group by index of y_pred
tidy_fit = fit %>%
 spread_draws(y_pred[condition])
head(tidy_fit)

# A tibble: 6 × 5
# Groups: condition [1]
 condition y_pred .chain .iteration .draw
 <int> <dbl> <int> <int> <int>
1 1 0 1 1 1
2 1 0 1 2 2
3 1 0 1 3 3
4 1 3 1 4 4
5 1 0 1 5 5
6 1 2 1 6 6

We see here that, although we only have one “statistical” variable (y_pred), we have quite a few pieces of metadata. Firstly, we have condition - this is the element of y_pred that we have repeated samples from. .chain refers to the MCMC chain, .iteration is the draw within that chain, and .draw is essentially a unique id for each row of tidy_fit. The y_pred column is the randomly drawn value of y_pred at the chain-iteration combination.

The nice thing here is that because our Stan output is a tibble, we can use all of our favourite {tidyverse} tools to summarise the Stan output.

For example, to see which years had the largest posterior mean number of sightings, we can use the following snippet:

tidy_fit %>%
 reframe(mean = mean(y_pred),
 year = year_of_sighting) %>%
 distinct() %>%
 arrange(-mean) %>%
 head(5)

# A tibble: 5 × 3
 condition mean year
 <int> <dbl> <dbl>
1 45 119. 2006
2 44 118. 2005
3 46 114. 2007
4 43 109. 2004
5 47 103. 2008

From this, we can see that the UFO haydays were the mid 2000s. Of course, plotting the data and predictions will give us a more complete picture. Similar logic would allow us to determine which spline terms were the most important; grabbing the $ \beta$ terms (coefficients of spline terms), and ordering by $ | E(\beta \mid \mathcal{D}) | $ is one approach to determining which spline terms are most important:

fit %>%
 spread_draws(beta[condition]) %>%
 summarise(mean_beta = mean(beta)) %>%
 arrange(-abs(mean_beta)) %>%
 head(3)

# A tibble: 3 × 2
 condition mean_beta
 <int> <dbl>
1 10 4.11
2 3 -2.21
3 5 -2.04

We see here that the 10th spline term is the most imporant, followed by the 3rd and 5th. Because MCMC algorithms are stochastic, your results might be slightly different to mine, but the main messages should be very similar.

Again, those familiar with {tidyverse} packages will find that {tidybayes} makes plotting posterior summaries of the data relatively straight forward

tidy_fit %>%
 ggplot(aes(x = year_of_sighting, y = sightings_per_year)) +
 stat_lineribbon(aes(y = y_pred), .width = c(.97, .89, .73, .5), colour = "grey10") +
 scale_fill_brewer() +
 geom_point(aes(x = year_of_sighting, y = sightings_per_year), data = sights_per_year) +
 xlab("Year of reported sighting") +
 ylab("Number of recorded sightings per year") +
 ggtitle("UFO sighting reports for Great Britain,\nwith posterior summaries superimposed") +
 guides(fill = guide_legend(title = "Posterior\ncoverage")) +
 theme_minimal()

From our plot, we see that indeed, the mid 2000s were the peak for UFO sightings, and the model has captured this quite well. Uncertainty quantification is also good; we see that only a small number of points lie outside the 89% and 97% predictive bands. The median line (50%) follows the trend of the data closely and is also fairly smooth!

Summary

We’ve had a whirlwind tour of fitting flexible models for count data in Stan, and how to process the output using R and {tidybayes} to communicate our findings. UFO sightings certainly boomed during the 2000s, but in recent years, the skies appear to be a somewhat empty. We only performed the analysis for the GB subset of the data. What would be interesting (but would take a while!) would be to construct a joint model for UFO sightings across all countries. We could then, for example, cluster the posterior distributions for curves to identify similar trends. This could allow us to investigate the Independence Day effect; if the effect is real, we would expect to see similar patterns in countries where Independence Day was popular. Of course, the same effect could be explained by other hypotheses!

We didn’t show you all the code to run the Stan model, you can find a complete R script and Stan file to perform the analysis in our blogs repo. As mentioned, the MCMC algorithm is stochastic, so there may be small discrepancies between your results and mine.

If you think Stan is awesome and want to learn more, then why not consider attending one of our Rstan or PyStan courses? Our courses are a great hands-on and interactive way of getting up-and-running and fitting models with Stan!

For updates and revisions to this article, see the original post

Talks to watch at the RSS International Conference 2023

Tue, 29 Aug 2023 23:59:00 +0000

The Royal Statistical Society International conference is next week from 4-7 September 2023, hosted in Harrogate. Jumping Rivers are exhibiting at the conference, as well as delivering workshops and talks. The draft program can now be viewed online, so we wanted to let you know where you can find us at the event and some of the other sessions we are looking forward to.

Highlights

Teaching statistics interactively with webR

If you teach statistics using R and want to make your sessions more engaging, this talk is one to watch. Nicola Rennie will introduce webR and demonstrate it’s potential to revolutionise the way we teach data science.

GitHub: Version control for research, teaching and industry

Open-source coding practises are an integral part of software and model development in all applications of data science. In this session, the panel will discuss how GitHub can be used to develop models and applications more effectively across teaching, research, and industry.

How to avoid becoming an ornamental data scientist

The RSS Data Science and AI Section have toured the country asking practitioners and companies about their hopes and fears about a career in data science and AI. In this session they will outline how to become efficient, effective and ethical in your application of the statistical and algorithmic tools of the trade.

Jumping Rivers Talks and Events

Throughout the week, you can find Rhian Davies and Jack Kennedy at the Jumping Rivers exhibition stand. Come along and say hello and pick up one of our Jumping Rivers coasters!

Pre-conference workshop for early career researchers

On Monday 4th September, Jack and Rhian will both be presenting at the pre-conference workshop for early career researchers. Jack will be kicking off the afternoon, introducing the Young Statisticians section and welcoming the young statisticians to the conference, while Rhian will be deliviering a skills workshop later in the afternoon on Building your data science portfolio.

Making Maps! Visualising spatial data in R

On Tuesday 5th September, Rhian will be delivering the workshop Making Maps! Visualising spatial data in R, which will cover the fundamentals of working with geospatial data in R. Can’t attend? We offer a training course on spatial data analysis.

Activities to reach a broader audience: RSS Ambassadors’ tips for communicating statistics

Communication is a big part of what we do at Jumping Rivers. In this session, Rhian and the other RSS Statistical Ambassadors will be sharing their tips and tricks for communicating statistical concepts. This interactive session will give audience members an opportunity to practice their communication skills.

Getting your work to work

On Wednesday 6th September, Rhian will be taking part in the panel on getting your work to work, talking about cleaning up messy code. She’ll share actionable tips to help you refactor your code and make it easier for collaborators to work with it.

For updates and revisions to this article, see the original post

Our ISO 27001 Certification

Thu, 24 Aug 2023 23:59:00 +0000

Hello from the Jumping Rivers team! Today, we’re taking a moment to chat about our recent achievement – becoming ISO certified.

What is ISO 27001 and Why Does It Matter?

ISO 27001 is an internationally recognised standard for information security management systems (ISMS). It provides a systematic approach to managing sensitive company information, ensuring its confidentiality, integrity, and availability. The standard outlines a framework that helps organisations identify and manage information security risks, implement appropriate controls, and continuously improve their security posture.

In today’s digitally driven world, where data breaches and cyberattacks are rampant, ISO 27001 offers a proactive approach to safeguarding sensitive information. It not only helps companies protect their own data but also builds trust with clients, partners, and stakeholders by demonstrating a commitment to maintaining robust information security practices.

Why We Chose the ISO Path

A couple of reasons nudged us towards these certifications:

The clients we interact with often required them.
It presented a brilliant opportunity for a bit of introspection. Were our current security practices up to scratch? We were keen to find out.

Our Route to Certification

While it was an enlightening six months, it wasn’t without its hurdles. We had to sift through our security practices and ensure they were robust. The real task, however, was fostering a company-wide understanding that security isn’t just an IT department’s concern – it’s everyone’s business. We enlisted the help of a consultant who really knew their stuff. They guided us through the intricacies of the ISO standards, ensuring we were on the right track.

The Statement of Applicability: An Analogy

Personally, my favourite exercise in the standard is the Statement of Applicability (SoA). Think of the SoA in the context of building a house. Imagine you’re constructing a new home and you want it to be safe and secure for your family. You wouldn’t just randomly choose security measures; you’d assess the risks, identify potential vulnerabilities, and then decide which security features to include.

Similarly, the Statement of Applicability is like the blueprint for securing your organisation’s digital “house.” It’s a crucial component of ISO 27001 implementation. The SoA lists the specific controls from the ISO 27001 standard that your organisation has chosen to implement based on its unique risk profile. These controls act as the security measures that protect your sensitive information. Just as you wouldn’t install an alarm system in your home if you live in a crime-free neighbourhood, you wouldn’t implement certain controls if they aren’t relevant to your organisation’s operations and risks.

The SoA ensures that your information security efforts are targeted, effective, and aligned with your business objectives. It’s a dynamic document that evolves as your organisation grows, risks change, and technology advances. Just as you might update your home security system as new threats emerge, you’ll revise your Statement of Applicability to adapt to evolving cybersecurity challenges.

An example of a control we’ve excluded from our Statement of Applicability is “Cabling Security,” which pertains to safeguarding power and telecommunications cabling carrying data or supporting information services. This control emphasises protection against interception, interference, or damage to physical cabling infrastructure.

Our decision to exclude this control stems from our company’s primary mode of operation, which is rooted in remote work and cloud-based infrastructure. Given that we extensively leverage major cloud providers for our server architecture, our reliance on physical on-site cabling is significantly limited. The inherent nature of cloud-based systems means that the responsibility for cabling security largely falls under the purview of these established providers.

By creating a well-thought-out Statement of Applicability, you’re essentially tailoring your security “blueprint” to fit your organisation’s needs, making your ISO 27001 implementation not just a compliance exercise, but a strategic decision that aligns with your business goals and risk appetite.

The Post-Certification Landscape

Since waving our ISO certificates about:

We’ve noticed more of a focus on processes across the company. They have become clearer and more streamlined. It’s less winging it, and more standardised and easy to follow instructions.
The procurement process with clients? It’s been smoother sailing. That certification tends to be the seal of approval many are looking for.

Staying the Course

We’re not ones to become complacent. We have a risk treatment plan in place to implement over the coming year up to our next audit, as well as regular internal audits on the horizon, so we’re all set to keep our standards sky-high.

For updates and revisions to this article, see the original post

Best Practices for Data Cleaning and Preprocessing

Thu, 17 Aug 2023 23:59:00 +0000

As data scientists, we often find ourselves immersed in a vast sea of data, trying to extract valuable insights and hidden patterns. However, before we embark on the journey of data analysis and modeling, we must first navigate the crucial steps of data cleaning and preprocessing. In this blog post, we will explore the significance of data cleaning and preprocessing in data science workflows and provide practical tips and techniques to handle missing data, outliers, and data inconsistencies effectively.

Why Data Cleaning and Preprocessing Matter?

Data cleaning and preprocessing are fundamental steps in the data science process. High-quality data is essential for accurate analysis and modeling.

Improved Accuracy: Incomplete data can lead to biased results and inaccurate models.
Better Insights: Preprocessed data reveals more profound insights, patterns, and trends. Removing noise allows us to focus on the meaningful aspects of the data.
Model Performance: Machine learning models rely on clean data.

In this blog, we’ll embark on a journey of data processing with the R programming language. To navigate this journey, the {tidyverse} package, a powerhouse of interconnected tools, will allow us to efficiently examine our data. Let’s dive into the world of R and witness the magic of turning raw data into meaningful insights.

Load in the Required Packages

# Install and load the tidyverse package
install.packages("tidyverse")
library(tidyverse)
library(janitor)

Create or load your data

df_1 <- tibble(
 id = 1:5,
 name = c("Alice", "Bob", "Amber", "Fred", "Eve"),
 Age = c(25, 31, NA, 23, NA),
 gender = c("Female", "Male", NA, "Male", "Female"),
 Score = c(80, 91, 87, 77, NA)
)

df_2 <- tibble(
 id = 6:7,
 name = c("Jenny", "Dave"),
 Age = c(29, 11),
 gender = c("Female", "Male"),
 Score = c(40, 70)
)

df_1
## # A tibble: 5 × 5
## id name Age gender Score
## <int> <chr> <dbl> <chr> <dbl>
## 1 1 Alice 25 Female 80
## 2 2 Bob 31 Male 91
## 3 3 Amber NA <NA> 87
## 4 4 Fred 23 Male 77
## 5 5 Eve NA Female NA
df_2
## # A tibble: 2 × 5
## id name Age gender Score
## <int> <chr> <dbl> <chr> <dbl>
## 1 6 Jenny 29 Female 40
## 2 7 Dave 11 Male 70

Addressing Data Inconsistencies:

Suppose the dataset combines data from different sources, which are stored differently. We can standardise these inconsistencies as follows:

Data Standardisation: We can standardise the names to follow a consistent format. For example below, the column names “Age” and “Score” have been standardised to “age” and “score” in the dataframe. This lowercase naming convention is consistent with the other column names.

df_1 <- clean_names(df_1)
df_2 <- clean_names(df_2)

df_1
## # A tibble: 5 × 5
## id name age gender score
## <int> <chr> <dbl> <chr> <dbl>
## 1 1 Alice 25 Female 80
## 2 2 Bob 31 Male 91
## 3 3 Amber NA <NA> 87
## 4 4 Fred 23 Male 77
## 5 5 Eve NA Female NA
df_2
## # A tibble: 2 × 5
## id name age gender score
## <int> <chr> <dbl> <chr> <dbl>
## 1 6 Jenny 29 Female 40
## 2 7 Dave 11 Male 70

Data Integration: When combining data from multiple sources, ensure that all data fields align correctly.

Let’s combine the data frames df_1 and df_2 vertically by stacking their rows on top of each other to create a unified data frame, df.

df <- bind_rows(df_1, df_2)

df
## # A tibble: 7 × 5
## id name age gender score
## <int> <chr> <dbl> <chr> <dbl>
## 1 1 Alice 25 Female 80
## 2 2 Bob 31 Male 91
## 3 3 Amber NA <NA> 87
## 4 4 Fred 23 Male 77
## 5 5 Eve NA Female NA
## 6 6 Jenny 29 Female 40
## 7 7 Dave 11 Male 70

Managing Outliers:

Let’s assume that there are some extreme outliers in the dataset. We can deal with outliers as follows:

Visual Inspection: Plotting a scatter plot may reveal outliers as data points far away from the general trend. We can visually inspect these data points and decide how to deal with them. Deletion of outliers is only recommended when the data point is seen as a data-entry mistake, rather than unusual. However, getting the record corrected would be a better solution!

ggplot(df, aes(x = age, y = score)) +
 geom_point() +
 geom_smooth(method = "lm", se = FALSE) +
 labs(title = "Scatter Plot of Age vs. Score",
 x = "Age", y = "Score")

We see that there is one potential outlier. Typically, Score increases with Age, but Jenny’s score is very low, given her age.

Handling Missing Data:

Missing data is a common challenge in real-world datasets. Ignoring missing values or handling them poorly can lead to skewed conclusions. Some methods of handling missing data are:

Deletion: Remove rows or columns with missing values. This should only be done when the “missing-ness” is not related to the outcome of interest.
Imputation: Replace missing values with statistical measures such as the mean, median, or mode.
Advanced Techniques: Machine learning-based imputation methods, like K-nearest neighbors (KNN) or regression imputation, can be used for more accurate filling of missing values. This is the gold standard for imputation methods, and is most likely to reduce the bias in our models and findings.

Below are some common techniques for handling missing data. Here, missing data is addressed using mean and median imputation, replacing gaps in ‘age’, ‘score’, and ‘gender’ columns with appropriate measures. Subsequently, categorical variables are converted to factors and integers to ensure accurate analysis. The code also showcases advanced transformations such as encoding categorical variables as binary features and performing data splitting for machine learning models.

df <- df %>%
 mutate(age = replace_na(age, mean(age, na.rm = TRUE)),
 score = replace_na(score, median(score, na.rm = TRUE)),
 gender = replace_na(gender, "Unknown"))

df
## # A tibble: 7 × 5
## id name age gender score
## <int> <chr> <dbl> <chr> <dbl>
## 1 1 Alice 25 Female 80 
## 2 2 Bob 31 Male 91 
## 3 3 Amber 23.8 Unknown 87 
## 4 4 Fred 23 Male 77 
## 5 5 Eve 23.8 Female 78.5
## 6 6 Jenny 29 Female 40 
## 7 7 Dave 11 Male 70

Let’s explore some data cleaning and processing steps using the {tidyverse} package. The {tidyverse} package is an umbrella package; it imports useful packages for us. The ones we rely on below are {dplyr} and {tidyr}. Now let’s begin:

When working with data in R, it’s important to ensure that the data is in the right format for analysis and visualisation. Factors are data types in R that are used to represent categorical variables. Let’s convert the gender column to a factor and the age column to an integer. By converting the gender column to a factor, we’re telling R that the variable is categorical and has a limited set of possible values. Factors also help ensure that the data is treated correctly in statistical analyses and modeling.

df <- df %>%
 mutate(gender = as.factor(gender),
 age = as.integer(age))

df
## # A tibble: 7 × 5
## id name age gender score
## <int> <chr> <int> <fct> <dbl>
## 1 1 Alice 25 Female 80 
## 2 2 Bob 31 Male 91 
## 3 3 Amber 23 Unknown 87 
## 4 4 Fred 23 Male 77 
## 5 5 Eve 23 Female 78.5
## 6 6 Jenny 29 Female 40 
## 7 7 Dave 11 Male 70

Encoding Categorical Variables

Many models don’t work with factors (categorical variables) straight out of the box. A simple workaround is to convert factors to a series of binary variables:

df_encoded <- df %>%
 mutate(is_female = as.numeric(gender == "Female"))

df_encoded
## # A tibble: 7 × 6
## id name age gender score is_female
## <int> <chr> <int> <fct> <dbl> <dbl>
## 1 1 Alice 25 Female 80 1
## 2 2 Bob 31 Male 91 0
## 3 3 Amber 23 Unknown 87 0
## 4 4 Fred 23 Male 77 0
## 5 5 Eve 23 Female 78.5 1
## 6 6 Jenny 29 Female 40 1
## 7 7 Dave 11 Male 70 0

Data Transformation

Sometimes our models work better with transformed data. For example, if the distribution of a feature is highly skewed, a log or square root transform can improve the symmetry of its distribution:

# Apply square root transformation to age
df_encoded <- df_encoded %>%
 mutate(sqrt_age = sqrt(age))

df_encoded
## # A tibble: 7 × 7
## id name age gender score is_female sqrt_age
## <int> <chr> <int> <fct> <dbl> <dbl> <dbl>
## 1 1 Alice 25 Female 80 1 5 
## 2 2 Bob 31 Male 91 0 5.57
## 3 3 Amber 23 Unknown 87 0 4.80
## 4 4 Fred 23 Male 77 0 4.80
## 5 5 Eve 23 Female 78.5 1 4.80
## 6 6 Jenny 29 Female 40 1 5.39
## 7 7 Dave 11 Male 70 0 3.32

Feature Engineering

Feature engineering is just making new columns from old ones. For example, score per age could be found as:

# Create new feature: score_per_age
df_encoded <- df_encoded %>%
 mutate(score_per_age = score / age)

df_encoded
## # A tibble: 7 × 8
## id name age gender score is_female sqrt_age score_per_age
## <int> <chr> <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 Alice 25 Female 80 1 5 3.2 
## 2 2 Bob 31 Male 91 0 5.57 2.94
## 3 3 Amber 23 Unknown 87 0 4.80 3.78
## 4 4 Fred 23 Male 77 0 4.80 3.35
## 5 5 Eve 23 Female 78.5 1 4.80 3.41
## 6 6 Jenny 29 Female 40 1 5.39 1.38
## 7 7 Dave 11 Male 70 0 3.32 6.36

Data Splitting

This is an essential step for many machine learning models; we split the data into a training set to train the model on, and a test set to allow us to test model predictions. The tidymodels package offers a consistent and streamlined approach to data splitting and other aspects of modeling workflows, making it a powerful tool for data scientists.

# Install and load the tidymodels package
install.packages("tidymodels")
library(tidymodels)

# Create a split index using initial_split
split_data <- initial_split(df_encoded, prop = 0.5)
split_data
## <Training/Testing/Total>
## <3/4/7>

# Extract the training and testing data sets
train_data <- training(split_data)
test_data <- testing(split_data)

train_data
## # A tibble: 3 × 8
## id name age gender score is_female sqrt_age score_per_age
## <int> <chr> <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 6 Jenny 29 Female 40 1 5.39 1.38
## 2 5 Eve 23 Female 78.5 1 4.80 3.41
## 3 7 Dave 11 Male 70 0 3.32 6.36
test_data
## # A tibble: 4 × 8
## id name age gender score is_female sqrt_age score_per_age
## <int> <chr> <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 Alice 25 Female 80 1 5 3.2 
## 2 2 Bob 31 Male 91 0 5.57 2.94
## 3 3 Amber 23 Unknown 87 0 4.80 3.78
## 4 4 Fred 23 Male 77 0 4.80 3.35

We use the initial_split() function to split the data_encoded dataframe into training and testing sets. The prop specifies the proportion of data to allocate for the training set. In this case, we’ve gone for a 50:50 split. The training() and testing() functions are then used to extract the training and testing data sets.

Advanced Data Cleaning and Processing Techniques

Data cleaning and preprocessing have evolved beyond the traditional methods. Advanced techniques such as Time-Series Imputation and Deep Learning-Based Outlier Detection can handle complex scenarios and yield more accurate results:

Time-Series Imputation: Missing values can disrupt patterns. Techniques like forward-fill, backward-fill, or using the last observation carried forward can be effective.
Deep Learning-Based Outlier Detection: Autoencoders can identify subtle outliers in high-dimensional data.

Deeper Dive into Feature Engineering:

Feature engineering goes beyond data cleaning — it’s about creating new attributes to improve model performance, for example:

Polynomial Features: Transforming features into higher-degree polynomials can capture non-linear relationships.
Interaction Features: Multiplying or combining features can reveal interactions between them.

Advanced data cleaning steps involve more specialised techniques that can help you handle complex scenarios. Here are some references and resources that provide in-depth information on advanced data cleaning techniques:

Automation and Tools:

For R users, the journey of data cleaning and preprocessing becomes even more seamless due to powerful libraries and tools tailored to your needs. The {tidyverse} suite of packages offers {dplyr} for efficient data manipulation, {tidyr} for tidying up messy datasets, and {stringr} for handling text data, among others. Whether it’s imputing missing values, encoding categorical variables, or standardising features, R’s automation libraries such as {tidyverse} empower you to focus on extracting insights rather than getting caught up in manual data cleaning tasks. With these tools by your side, you can navigate the data preprocessing landscape with confidence and efficiency.

In conclusion, data cleaning and preprocessing are essential steps that pave the way for accurate analysis and reliable insights. By following the best practices outlined in this blog post, you can ensure that your data is well-prepared for modeling and analysis.

By addressing missing values, outliers, and inconsistencies, you’re laying a strong foundation for impactful data-driven decision-making. As you delve into more advanced techniques, explore feature engineering, and embrace automation, you’ll unlock even more potential from your data. So, whether you’re a data scientist, researcher, or business professional, embracing these practices will undoubtedly contribute to the success of your data-driven endeavours.

Remember that the effort invested in cleaning and preprocessing data is an investment in the quality of your results.

Happy data cleaning and preprocessing!

For updates and revisions to this article, see the original post

SatRdays London 2023 - Recordings

Thu, 27 Jul 2023 23:59:00 +0000

The recordings from this year’s SatRdays London conference are here! Over the next couple of weeks, we will be releasing the recordings of some of the excellent talks from the conference!

You’ll be able to find them all in this playlist as they’re released, and keep an eye on our Twitter feed to see them as they come out.

As a quick reminder, here’s a list of talks from the event:

Speakers

Keynote Speakers

Julia Silge (Posit): What is “production” anyway? MLOps for the curious
Oliver Hawkins (Financial Times): Why R is good for journalism

Contributed Talks

Botan Ağın & Michael Stevens (SamKnows): AutRmatic reporting: billions of internet measurements, hundreds of reports and one repository to rule them all
Vyara Apostolova & Laura Cole (National Audit Office): ScRutinising government spending (not recorded)
Andrew Collier (Fathom Data): Sidekicks of the Tidyverse
Jack Davison (Ricardo Energy & Environment): “Put it on a map!” – Developments in Air Quality Data Analysis
Russ Hyde (Jumping Rivers): Does code quality even matter in data science?
Ella Kaye & Heather Turner (University of Warwick): Sustainability and EDI (Equality, Diversity and Inclusion) in the R Project

Sponsors

Huge thanks again to all of the event sponsors, in particular CUSP London, who provided the venue and AV support, which allowed us to share these with you.

For updates and revisions to this article, see the original post

Generate multiple presentations with Quarto parameters

Thu, 20 Jul 2023 23:59:00 +0000

Parameterised reporting is a powerful technique that allows you to create dynamic and customisable reports by incorporating user-defined parameters. These parameters act as placeholders that can be easily modified to generate tailored reports based on specific inputs or conditions, enabling seamless updates to reports without the need for manual modifications. Quarto, a modern and flexible document generation tool, provides excellent support for parameterised reporting.

In this blog, we will be looking at a Quarto Reveal JS presentation as an example. By defining parameters within a Quarto presentation, you can easily add flexibility and interactivity to your presentations, allowing you to tailor the content to the specific needs or preferences of your audience.

Creating a Reveal JS presentation

You can easily create a Reveal JS presentation in RStudio with File > New File > Quarto Presentation > Reveal JS. This will create a Quarto file (let’s call it slides.qmd) as usual. We are going to be using a slightly modified version of the TidyTuesday data set on UFO sightings. The CSV file for the data set is available on our GitHub. In this data set, we have information on UFO sightings between 2019 and 2022 from different US states.

We will update the YAML for our presentation to add a title as well as update the theme to make it look a little bit nicer.

---
title: "UFO Sightings"
format:
 revealjs:
 theme: simple
---

We’ll also add some general package-loading and data-reading code to the top of our presentation.

library("dplyr")
library("ggplot2")

ufo_sightings = readr::read_csv("ufo_sightings.csv")

Let’s say we wanted to include a histogram of UFO sighting durations in our presentation. For Idaho in 2022, the code would look something like this:

ufo_subset = ufo_sightings %>%
 dplyr::filter(
 year == 2022,
 state_name == "Idaho"
 )

to create a subset of the full UFO sighting data set and then:

ufo_subset %>%
 ggplot(aes(x = duration_seconds)) +
 geom_histogram(fill = "#c74a4a") +
 labs(x = "Duration (seconds)", y = "") +
 scale_x_continuous(labels = scales::comma_format()) +
 theme_minimal()

to create the histogram itself. We can then add this to a slide of our presentation and add the heading “Sighting duration”.

Now, what if we wanted to create this plot but for another state-year combination. This is where we need parameters.

Using parameters in Quarto

To add parameters to your Quarto document or presentation, you need to use the params option in the YAML. We want to be able to generate our report flexibly with different combinations of US state and year, so we will create a parameter for each of them. We will use Idaho and 2022 as the default values for these parameters.

---
title: "UFO Sightings"
format:
 revealjs:
 theme: simple
params:
 state: "Idaho"
 year: 2022
---

These parameters are then stored in the params list accessible from within your presentation. So we can now update our code from before to instead rely on params$year and params$state instead of the hard-coded year and state.

ufo_subset = ufo_sightings %>%
 dplyr::filter(
 year == params$year,
 state_name == params$state
 )

Now, our plot will automatically update each time we re-generate the presentation with different parameters.

Before we go through how to actually generate the presentation with different values than the defaults, let’s first also add a subtitle to the presentation which will change as the parameters change. So, something like:

---
title: "UFO Sightings"
subtitle: "`r params$state`, `r params$year`"
format:
 revealjs:
 theme: simple
params:
 state: "Idaho"
 year: 2022
---

Render with parameters

To render our presentation with different parameters, we have a few different options.

If you prefer to render your presentation using an R function, you can use quarto::quarto_render() to render your presentation. You’ll just need to provide the input .qmd file, as well as a list of the parameters with the execute_params argument. So, if you wanted to generate the presentation for Alabama in 2021 this time, your command would look something like:

quarto::quarto_render(input = "slides.qmd",
 execute_params = list("year" = 2021,
 "state" = "Alabama"))

If you’d rather render your presentation from the command line, you can also easily do so with the quarto render command paired with the -P parameter flag.

quarto render slides.qmd -P year:2021 -P state:"Alabama"

You can also supply a YAML file of key:value pairings when rendering your presentation with parameters. Simply create a file called params.yml, and define your parameters. To change the parameters for a new run, you can just update your YAML file.

Your YAML file would look something like:

# in params.yml
year: 2020
state: 'Alabama'

and then, to render:

quarto render slides.qmd --execute-params params.yml

Rendering multiple parameter combinations at once

Being able to render a presentation with different parameters is useful. But let’s say you needed a single presentation for each combination of state and year. You’d end up needing to manually render 250 separate presentations. So, we want to automate rendering multiple combinations of parameters at once.

Instead of rendering 250 files, let’s take a sample of 3 states and 2 years, so we’ll end up with 6 presentations in total. We then create a tibble of the year-state combinations we want to generate presentations for:

years = unique(ufo_sightings$year)[1:2]
states = unique(ufo_sightings$state_name)[1:3]

params = tidyr::crossing(
 year = years,
 state = states
)
params

You can then either build a for loop or use the {purrr} package to iterate over the state-year combinations. If you want to learn more about iteration, check out our Programming with R and our Functional Programming with purrr courses.

Here, we’re using the walk2() function from {purrr} to iterate over the different year-state combinations to create multiple files. The walk2() function lets you iterate over two inputs simultaneously and is used when your function has a side effect, such as writing a file, rather than wanting the output returned as an R object.

We also include our input parameters in the output file name to allow us to distinguish between the multiple output files:

purrr::walk2(params$year, params$state, ~quarto::quarto_render(
 input = "slides.qmd",
 execute_params = list("year" = .x,
 "state" = .y),
 output_file = glue::glue("{.y}_{.x}.html")
))

Running this command, you end up with 6 aptly named output files:

And there you have it! Generating multiple presentations or reports at once is a fairly straightforward process when using Quarto to render your outputs. You can of course extend this logic and create much more in-depth reports or presentations with different kinds of outputs, including text summaries, which depend on input parameters.

To see the full code behind this blog post, as well as some further examples in a more fleshed out Quarto report, check out the blogs repo on our GitHub.

For updates and revisions to this article, see the original post

Shiny in Production 2023

Tue, 18 Jul 2023 23:59:00 +0000

With the early bird deadline approaching, we thought now would be a great time to tell you a bit more about what to expect at this year’s Shiny in Production!

As with last year’s conference, SIP2023 will take place over a day and a half at the Catalyst in Newcastle upon Tyne, UK. The first day (Thursday 12th October), will consist of three parallel workshops, followed by a drinks reception in the evening, a great opportunity for networking and debriefing from the day’s learning.

The second day (Friday 13th October) will be full of talks from speakers across industry, telling us all about all things Shiny and other web based R tools. If you want more of an idea of what to expect, check out the playlist of talks from last year’s conference.

So far this year we have announced 3 of our invited speakers, and we are currently reviewing the submissions from the recent call for abstracts. The line up so far is:

Anna Skrzydło (Appsilon) - 3 reasons why nobody uses your app
George Stagg (Posit) - Title TBC
Cara Thompson (Freelance data consultant) - Dynamic annotations: tips and tricks to make text shine without stealing the show

For updates and revisions to this article, see the original post

Changing the world with Data: An outreach event

Thu, 06 Jul 2023 23:59:00 +0000

Earlier this year, two data scientists from Jumping Rivers ran an outreach activity for 14-19 year olds across the UK, in collaboration with the youth charity Speakers for Schools.

The three hour workshop focussed on how to create visualisations that are both visually appealing and useful to the viewer. We demonstrated with a few examples of some visualisations that we created from some questions we asked on sign up (their favourite fast food restaurants and snacks) and showed them some examples of visualisations that challenged the view of data visualisation all being bar charts and scatter plots - think football pundit analysis and tube maps!

The session culminated with us turning the tables - we became the clients, specifically, two biologists, recently returned from their studies of some penguin species in Antarctica! We asked them to design a dashboard to help us explore all of the data we had collected.

Those of you familiar with the R programming language will probably recognise the reasoning for this particular leap in professions. The {palmerpenguins} package, originally published in 2020 by Allison Marie Horst and Alison Presmanes Hill and Kristen B. Gorman is an excellent resource for teaching data exploration and visualisation! And it even comes with some adorable artwork that you can download to help.

Artwork by @allison_horst

We developed a Shiny app that could be accessed by attendees via their browsers, which allowed them to create their own visualisations of the penguin data. With options to change the size, shape and colour of the points based on the data, the ability to play around with the axes, colour palettes and themes, as well as to add customised labels and choose to split by year, or add a line of fit, there was a lot to keep you busy!

We wanted to make sure we had enough options in to make it fully customisable, while also giving attendees the opportunity to make something truly awful, if they wanted, to demonstrate some dos and don’ts of data visualisation.

While we can’t share the creations here, we were very impressed with the creativeness that we saw throughout the whole workshop. The collaborative whiteboards were well used for introductions via the medium of sketched visualisations!

We hope that the attendees of this workshop enjoyed themselves as much as we did! If you’re interested in learning how to make great visualisations, or how to create Shiny apps, take a look at our course page. We offer courses on visualisation in R and Python, as well as a variety of courses on Shiny, from an introduction for Shiny beginners, to more advanced concepts and web accessibility.

For updates and revisions to this article, see the original post

July Training Update

Tue, 04 Jul 2023 23:59:00 +0000

Embark on your programming odyssey with our extensive range of courses! Never written a line of code in your life? No stress - we offer a mix of introductory courses for beginners as well as more advanced courses for those looking to expand their knowledge further.

Over the summer and autumn months, we will be offering training in the popular programming languages Python and R, plus additional courses on Quarto, Git and SQL.

R

We have something for everyone with our R courses, whether it’s statistical modelling and machine learning you’re after, or data visualisation with {ggplot2} and {shiny}.

Statistical Modelling with R

Course Level: Intermediate

Next course date: 17th July 2023 (DEADLINE 10th July)

Data Visualisation with ggplot2

Course Level: Intermediate

Next course date: 4th September 2023

Spatial Data Analysis with R

Course Level: Intermediate

Next course date: 18th September 2023

As spatial data sets get larger, more sophisticated software needs to be harnessed for their analysis. R is now a widely used open source software platform for working with spatial data thanks to its powerful analysis and visualisation packages. The focus of this course is providing participants with the understanding needed to apply R’s powerful suite of geographical tools to their own problems.

Introduction to Shiny

Course Level: Intermediate

Next course date: 2nd October 2023

Time Series Analysis with R

Course Level: Intermediate

Next course date: 30th October 2023

Predicting the future is a tough problem. Time series analysis makes it possible to assess whether or not predictions are possible and, if they are, build a model which can generate informed predictions for the future with realistic estimates of uncertainty. This training course will introduce participants to the packages in the Tidyverts.

Building an R Package

Course Level: Advanced

Next course date: 1st November 2023

This is a one-day intensive course on building a package in R. The focus will be on getting a working R package ready for distribution. This includes automating package setup and consistent package structure with {usethis}. You will be able to use the {testthat} workflow to create tests for packages.

Machine Learning with Tidymodels

Course Level: Intermediate

Next course date: 6th November 2023

Advanced Machine Learning with Tidymodels

Course Level: Advanced

Next course date: 8th November 2023

Python

With our Python courses, you will start from programming basics and work your way up to data visualisation and machine learning.

Introduction to Python

Course Level: Foundation

Next course date: 7th August 2023

Programming with Python

Course Level: Intermediate

Next course date: 21st August 2023

Data Visualisation with Python

Course Level: Intermediate

Next course date: 4th October 2023

Machine Learning with Python

Course Level: Intermediate

Next course date: 16th October 2023

Python (along with R) has become the dominant language in machine learning and data science. This course will equip you with the knowledge and tools to undertake a variety of tasks in a standard machine learning pipeline. We stress the importance of data preparation, both in terms of data standardisation and feature selection, before tackling model building.

Other courses

We are also offering several language-agnostic courses spanning automated reporting with Quarto, version control with Git, and relational databases with SQL.

Reporting with Quarto

Course Level: Intermediate

Next course date: 14th August 2023

Git for Me

Course Level: Foundation

Next course date: 6th September 2023

When working on data analysis projects version control is essential, for tracking project progress and assisting project collaboration. During this course we will show you multiple ways to integrate version control into your project with git. You will gain an understanding of how to use online code sharing websites such as GitHub / GitLab, along with the best practices while doing so.

Introduction to SQL

Course Level: Foundation

Next course date: 20th September 2023

For updates and revisions to this article, see the original post

Fullscreen Ahead for Shiny Applications

Thu, 08 Jun 2023 23:59:00 +0000

Browsers have been implementing variations on a JavaScript fullscreen API for over a decade. Unfortunately, for much of that time the APIs varied across browsers. This made actually using it in production somewhat cumbersome.

Finally, with the release of Safari 16.4 in March of this year, the latest versions of all major desktop browsers now support a single, standardized interface. Legacy versions of Safari for desktop are still in use and there’s still no support at all for the Fullscreen API on iPhones; so while you can cover most users with the standardized API, it should still be for progressive enhancement and not as a fundamental requirement for operation of an application.

In this post I’m going to show how we can enhance a toy Shiny application with fullscreen behaviour using only a few lines of JavaScript. Unfortunately, I did have issues using the fullscreen API with the browser that comes with RStudio — while at least some of the methods exist, calling them led to errors being thrown. Because of this, we will launch that app we build straight into the system’s default browser.

You can find all the code on our Github blog repository under “fullscreen-shiny”.

The Shiny App

For the toy Shiny application we’ll use the txhousing dataset from {ggplot2}. The full R code is provided below, but the ui function is the most relevant bit:

library imports

# ./app.R
library("shiny")
library("tidyverse")
library("glue")
# Launch in system's default browser
options(shiny.launch.browser = .rs.invokeShinyWindowExternal)

ui function

ui = fluidPage(
 tags$head(
 tags$link(rel = "stylesheet", href = "style.css"),
 tags$script(src = "fullscreen.js")
 ),
 titlePanel(title = "Texas housing dashboard"),
 sidebarPanel(selectInput(
 "city", "City", unique(txhousing$city), selectize = FALSE
 )),
 mainPanel(
 tags$div(
 plotOutput("salesPlot", height = "100%"),
 "class" = "plot-container",
 "tabindex" = "0"
 ),
 tags$div(
 plotOutput("volumePlot", height = "100%"),
 "class" = "plot-container",
 "tabindex" = "0"
 ),
 tags$div(
 plotOutput("medianPlot", height = "100%"),
 "class" = "plot-container",
 "tabindex" = "0"
 ),
 tags$div(
 plotOutput("listingsPlot", height = "100%"),
 "class" = "plot-container",
 "tabindex" = "0"
 )
 )
)

server function

server = function(input, output, session) {
 baseData = txhousing %>%
 mutate(
 volume = volume / 1000000,
 median = median / 1000,
 date = as.Date(glue("{year}-{month}-01"), "%Y-%m-%d")
 )

 data = reactive({
 baseData %>%
 filter(city == input$city)
 })

 dates = as.Date(c("2000-01-01", "2015-07-01"), "%Y-%m-%d")

 formatLabels = function(label) {
 str_pad(label, 6, pad = " ")
 }

 createPlot = function(data, yProp, yTitle) {
 ggplot(data) +
 geom_line(aes(x = date, y = .data[[yProp]])) +
 labs(x = "Date",
 y = yTitle) +
 scale_x_date(limits = dates,
 expand = expansion(mult = c(0.025, 0))) +
 scale_y_continuous(
 labels = formatLabels,
 limits = c(0, NA),
 expand = expansion(mult = c(0, 0.025))
 ) +
 theme(
 text = element_text(size = 14, colour = "black"),
 axis.text = element_text(family = "mono", size = 12),
 panel.grid.minor.x = element_blank(),
 panel.grid.minor.y = element_blank()
 )
 }

 output$salesPlot = renderPlot({
 createPlot(data(), "sales", "Number of sales\n")
 })

 output$volumePlot = renderPlot({
 createPlot(data(), "volume", "Total value of sales\n(millions)")
 })

 output$medianPlot = renderPlot({
 createPlot(data(), "median", "Total value of sales\n(millions)")
 })

 output$listingsPlot = renderPlot({
 createPlot(data(), "listings", "Total active listings\n")
 })
}

shinyApp call

shinyApp(ui = ui, server = server)

The accompanying CSS file is very short:

www/style.css

h2 {
 margin-top: 5px;
}

.plot-container {
 height: 190px;
 cursor: pointer;
 padding-top: 5px;
 margin-bottom: 15px;
}

.plot-container:fullscreen {
 cursor: default;
}

.plot-container:last-child {
 margin-bottom: 5px;
}

Opening this app in a desktop browser and you should see something like this:

One thing of note from the ui function: I set the heights of the plots to be 100% of their containers:

tags$div(
 plotOutput("salesPlot", height = "100%"),
 "class" = "plot-container",
 "tabindex" = "0"
),

The heights of the containers themselves were then set in the CSS file:

.plot-container {
 height: 190px;
 cursor: pointer;
 padding-top: 5px;
 margin-bottom: 15px;
}

From the above R-code snippet you will also see that I gave the containers a tabindex value of “0”. I’ll explain why later.

Aside: ugly hacks

Notice that the four plots are all created as separate images, not as a single matrix. This is so that they can separately be fullscreened, as we’ll see shortly. However, because the charts are independent of each other and the y axes have different units and labels, out of the box the horizontal axes did not line up. To get around these issues I implemented a few hacks in the server function. There are probably better solutions out there, but I:

(pre-)padded the y-axis labels with whitespace so all labels had the same number of characters,

formatLabels = function(label) {
 str_pad(label, 6, pad = " ")
}

set the axis-label font to “mono” so all the equal-length labels took up the same space,

axis.text = element_text(family = "mono", size = 12),

added a newline character at the end of the shorter y-axis labels so that they took up two lines of space like the longer y-axis labels.

output$salesPlot = renderPlot({
 createPlot(data(), "sales", "Number of sales\n")
})

The Basic JavaScript

While it’s perfectly possible to use the fullscreen API with only vanilla JavaScript, Shiny already adds jQuery to the page (aliased as $) so we’ll use it for convenience and brevity. We’ll begin by using the ready method to ensure the code inside the supplied function isn’t run until the page has loaded and our plot containers are a part of it:

$(function() {
 'use strict';

 // Interesting code goes here
});

The first thing we can do is check if the fullscreen API is actually supported. If it’s not we can give up straight away.

if (!document.fullscreenEnabled)) {
 return;
}

Now we’ll add a helper function to check whether fullscreen mode is already in action:

function isFullscreen() {
 return !!document.fullscreenElement;
}

This function is very simple and isn’t necessary, but (I think) it does make the later code we’ll see a little easier to read.

Now let’s use jQuery again to grab our plot containers:

const $plotContainers = $('.plot-container');

and add a very simple event handler to them for when they are double-clicked on:

$plotContainers.on('dblclick', function() {
 if (isFullscreen()) { return; }
 this.requestFullscreen();
});

The first line of the body checks we’re not already in fullscreen. The second line uses the special this variable. Inside jQuery event handlers, this refers to the specific document element on which the event listener was triggered so all we need to do with it is requestFullscreen. And that’s it! Double-click on/near a plot and it will go fullscreen and look something like this:

You’ll see — if you try this for yourself — that not only does the container resize, the plot does shortly after. I didn’t have to write any JavaScript to make the latter trick happen. The only thing I had to do was, as mentioned earlier, set the plots to be 100% the height of their container (the width already is by default) in the R code:

plotOutput("salesPlot", height = "100%"),

When the browser put the plot container into fullscreen it forces that element to be 100% wide and tall (superseding the “190px” value I set in the CSS). After that happens, the Shiny JavaScript code magically (I’m 90% sure it’s not actually magic) notices the image is too small and requests a new, bigger, one from the server.

There’s one other tiny little tweak that’s worth mentioning. The CSS sets the cursor style to pointer for the plot containers (hoping to remind an informed user the plot can be blown up if double clicked). The following rule makes use of the :fullscreen pseudoclass to unset it again when (double)-clicking no longer has an effect:

.plot-container:fullscreen {
 cursor: default;
}

You could, of course, use double-click to exit fullscreen, too. But the browser will provide the user with means to exit (press Esc, click a button) and using double-click to make something smaller doesn’t feel intuitive to me .

Adding keyboard functionality

You’ll recall I mentioned adding a tabindex value of “0” to each of the plot containers. This means they can be focused by a keyboard user who uses the “Tab” key to move around the page.

With a little extra JavaScript we can make the fullscreen behaviour keyboard accessible:

$(document).on('keydown', function(event) {
 const code = event.originalEvent.code;
 if(code !== 'Enter' || !isFullscreen()) { return; }
 const focus = document.activeElement;
 if ($plotContainers.toArray().includes(focus)) {
 focus.requestFullscreen();
 }
});

The first line inside the event handler checks which key has been pressed. If that key is “Enter” and we’re not already in fullscreen we check which element currently has focus. If that element is one of our plot containers we make it fullscreen.

The whole JavaScript script

For convenience and clarity, here’s the full fullscreen.js script, with comments added:

www/fullscreen.js

$(function () {
 'use strict';

 // If fullscreen is not supported jump right out
 if (!document.fullscreenEnabled)) {
 return;
 }

 // Simple helper to return a Boolean indicating whether
 // already in fullscreen mode
 function isFullscreen() {
 return !!document.fullscreenElement;
 }


 // Get all the plot containers
 const $plotContainers = $('.plot-container');


 // Make plots go fullscreen when double-clicked
 $plotContainers.on('dblclick', function () {
 if (isFullscreen()) { return; }
 this.requestFullscreen();
 });


 // Add keyboard controls
 $(document).on('keydown', function (event) {
 // Get name of key pressed
 const code = event.originalEvent.code;
 // If the user presses something other than Enter or
 // we're already in fullscreen we can jump staight out...
 if (code !== 'Enter' || isFullscreen()) { return; }
 // Find the element that currently has focus
 const focus = document.activeElement;
 // If that element is one of our plots...
 if ($plotContainers.toArray().includes(focus)) {
 // ...make it fullscreen
 focus.requestFullscreen();
 }
 });
});

Quick notes on accessibility

While we’ve added both mouse and keyboard controls for entering fullscreen, you only know how they work — and that they exist at all — because I’ve outlined them in this article! That is, to keep things simple I’ve omitted instructions in the actual app. In the real world, how fullscreen can be entered should be made clear to all users of the app, not just those who’ve read an accompanying blog post.

For similar reasons, I’ve omitted alt text from the charts, which is also bad for accessibility. You should see our earlier blog post on “Alt Text in R: Plots, Reports, and Shiny” for advice on how to do alt text well.

Finally, the items being made fullscreen here are graphics. But any element can be made fullscreen in browsers that support the API. That includes elements containing descendant focusable elements. In that case be sure to check the behaviour of these elements isn’t adversly affected by the change and ensure they are still accessible to both mouse and keyboard users.

For updates and revisions to this article, see the original post

June Training Update

Tue, 06 Jun 2023 23:59:00 +0000

This summer, we have public courses to take you all the way from the very basics of R, through to using R for statistical modelling, with some data wrangling and intermediate programming in between. Wherever you are on your R journey, take a look at our upcoming courses to see if we can help you on your way.

Introduction to R

Course Level: Foundation

Next course date: 26th June 2023

Data Wrangling in the Tidyverse

Course Level: Foundation

Next course date: 3rd July 2023

Programming with R

Course Level: Intermediate

Next course date: 10th July 2023

Statistical Modelling with R

Course Level: Intermediate

Next course date: 17th July 2023

For updates and revisions to this article, see the original post

Shiny in Production 2022: A recap

Thu, 25 May 2023 23:59:00 +0000

With the planning for this year’s Shiny in Production conference well under way, we thought now was a good time for a little recap of what happened last year.

Day One: Workshops

The first day of the conference consisted of three of workshops delivered by our very own JR trainers.

The most popular workshop of the day was Introduction to RStudio (now Posit) Connect, which we will be running again in the 2023 conference. This workshop demonstrated a few different workflows to allow you to host, share and scale content such as APIs, Shiny applications and R Markdown documents with RStudio Connect.

We also ran a workshop on an Introduction to Tableau, demonstrating the basics of using this software to summarise and interactively visualise data. If you’re interested in learning more about Tableau, take a look at our Tableau courses, Introduction to Tableau and Data Exploration with Tableau.

Last but certainly not least, we ran a workshop on Automated Reporting with Quarto. Quarto is a brand new open source publishing system that allows you to dynamically create static or interactive documents and automatically update reports when data changes. This workshop demonstrated how to make a range of outputs, from simple documents to presentations and dashboards. If you’re interested in learning more about Quarto, we have a Reporting with Quarto course, which you might want to check out.

Day Two: Talks

Last year’s speakers set the bar high for our first conference, and we’re excited to see what this year’s speakers bring to the table! For a full rundown of the talks, take a look at the highlights blog, which was put together by some of our JR data scientists who were in attendance.

If it’s more than just highlights that you’re after, we have a playlist of talk recordings on our YouTube channel.

If this has whet your appetite, our Early Bird tickets are now available! If you want to take an even more active role in the conference, we’re now accepting abstracts. All of the details are on our conference website, so head over there to sign up!

For updates and revisions to this article, see the original post

Conference and useR Group Sponsorship Opportunities

Tue, 23 May 2023 23:59:00 +0000

Here at Jumping Rivers, we love data science. One of the huge benefits of data science is transparency. For example, R. Being an open-source language, it immediately is giving something back to the community that propels it to the top of the data science ladder. Just like the world of data science, our ethos is transparency and giving back to the community.

Conference sponsorship

To help support the community, we are offering automatic sponsorship of any R conference. All the organisers need to do is complete a quick questionnaire and the money is sent on it’s way. We have sponsored several events in the past, which can be found on the community page of our website. We have also sponsored several SatRdays events over the last few years, including

So if you are organising an R conference, feel free to tap us for sponsorship! We’re particularly proud of how frictionless we’ve made the process.

useR / R-Ladies Groups

We also offer sponsorship for useR and R-Ladies groups in Europe! We currently sponsor:

So if you want sponsorship for your group, just complete this quick form

For updates and revisions to this article, see the original post

Why should I use R: Handling Dates in R and Excel: Part 3

Thu, 18 May 2023 23:59:00 +0000

This is part 3 of an ongoing series on why you should use R. Future blogs will be linked here as they are released.

Part 1: Why should I use R: The Excel R Data Wrangling comparison: Part 1
Part 2: Why should I use R: The Excel R plotting comparison: Part 2
Part 3: Why should I use R: Handling Dates in R and Excel: Part 3 (This post)

Dates in Excel

Here we will explore the various ways to handle dates in Excel and R. Dates are a crucial part of data analysis and are used in various fields such as biology, healthcare, and social sciences. However, working with dates can be challenging, especially when dealing with large datasets or multiple formats.

In Excel, there are several functions available to handle dates, such as DATE, YEAR, MONTH, and DAY. Excel also provides various formatting options to customise the display of dates. However, Excel has some limitations when it comes to complex date calculations, and it can be time-consuming to work with dates in large datasets.

In contrast, R has a robust set of tools for handling dates, including the {lubridate} package, which simplifies the manipulation of dates and times. Additionally, R allows for efficient handling of dates in large datasets, making it a powerful tool for time-series analysis. Whether you are working with dates in Excel or R, this blog will provide you with the basic tools and techniques to handle dates efficiently and accurately. So let’s get started!

Handling dates using {lubridate}

The {lubridate} package provides a range of functions that simplify common tasks. {lubridate} makes working with dates and times more intuitive and less error-prone, allowing users to focus on their analysis rather than the difficulties of date manipulation.

The {lubridate} package provides:

User-friendly syntax: consistent and intuitive syntax which makes it easier to understand and write code for date operations.
Comprehensive functionality: Offers a range of built in functions for common date operations. It allows us to parse dates from different formats and extract information such as year, month and day. This functionality saves time and effort compared to your manual calculations in Excel.
Date representation: {lubridate} ensures consistent date representation by using the POSIXct class, which stores dates as numbers of seconds since 1 January 1970.

Converting dates:

In Excel, to convert a string into a date format, you can use the DATEVALUE() function. For example, if your date is in cell A2, you can use =DATEVALUE(A2) to convert it into a date format. In R, you can use the as_date() function to convert a string into a date format. For example, if your date is "2023-01-18", you can use as_date("2023-01-18").

R

In R, when running the class function on as_date("2023-01-18"), it returns the class or data type of the object. In this case, it would return “Date” since as-date("2023-01-18") converts the given string into a date object.

class(2023-05-16)
## [1] "numeric"
lubridate::as_date("2023-01-18")
## [1] "2023-01-18"
class(lubridate::as_date("2023-01-18"))
## [1] "Date"

Calculating time intervals:

In Excel, you can use the DATEDIF() function to calculate the time difference between two dates in various units (years, months, etc.). For example, if you want to calculate the number of days between two dates in cells A2 and B2, you can use = DATEDIF(A2,B2,"d"). In R, using {lubridate}, you can calculate the difference in dates using the interval() function. Let’s calculate the difference between the two dates specified (January 18, 2023 and May 16, 2023) in terms of days.

Excel

The screenshot shows how you would use = DATEDIF() in a cell to calculate the interval between two dates.

R

The following code performs the same action in R, taking the start date and end date and calculating the difference. We then convert the difference to days using as.numeric().

library("lubridate")
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
## date, intersect, setdiff, union

start_date = as_date("2023-01-18")
end_date = as_date("2023-05-16")

diff_date = interval(start_date, end_date) |>
 as.duration() |>
 as.numeric(unit = "days")

diff_date
## [1] 118

Formatting dates:

Dates in Excel can be formatted using the Format Cells feature. For example, you can format a date as dd-mmm-yyyy to display it as "16-May-2023". In R, you can use the format() function to format a date in various ways.

Excel

The following gif shows the manual process of formatting a date in Excel using the Format>Format cells process.

R

The following lines of code accomplish the same thing.

date = lubridate::as_date("2023-05-16")
date_formatted = format(date, "%d-%b-%Y")
date_formatted
## [1] "16-Mai-2023"

Overall, Excel and R have different syntax and functions for handling dates, but both can be used effectively for data analysis and manipulation. It’s important to choose the tool that is best suited for your specific needs and workflow.

Extracting components of a date:

In R, you can extract different components of a date, such as the year, month, or day, using various functions. For example:

my_date = lubridate::as_date("2023-05-16")
year(my_date)
## [1] 2023
month(my_date)
## [1] 5
day(my_date)
## [1] 16

In Excel, you can extract different components of a date using the YEAR(), MONTH(), and DAY() functions.

The Movies Data

Let’s dive into more advanced examples of working with dates in R and Excel. In our previous blog series comparing Excel and R, we utilised a dataset called “movies data” which consists of five columns: country, year, highest movie profit, number of movies produced, and number of employees involved in the production. We’ve added two new columns to our dataset called start_date and end_date.

library(readr)
movies_data = read_csv("blog-data.csv")
## Rows: 6 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (4): Year, Highest_profit, Number_movies, no_employees
## date (2): start_date, end_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(movies_data)
## # A tibble: 6 × 7
## Country Year Highest_profit Number_movies no_employees start_date end_date 
## <chr> <dbl> <dbl> <dbl> <dbl> <date> <date> 
## 1 England 2011 100 3 1500 2011-01-16 2011-08-19
## 2 America 2012 150 2 2000 2012-03-21 2012-09-21
## 3 America 2013 300 4 4000 2013-01-01 2012-11-12
## 4 England 2013 130 2 4020 2013-01-04 2013-05-04
## 5 South K… 2013 177 3 5300 2013-01-28 2013-09-22
## 6 America 2014 350 1 3150 2014-01-01 2014-12-12

Let’s say we wanted to calculate the duration in days for each movie production, and then find the average duration per country.

R

In R, we can accomplish this by using the {lubridate} and {dplyr} packages. The first portion of this code takes the start and end dates of the movie production as dates, and then calculates the time between the dates, converting it to a numeric type. The second part then calculates the mean production time as a summary statistic.

library(lubridate)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
## filter, lag
## The following objects are masked from 'package:base':
## 
## intersect, setdiff, setequal, union

movies_data = movies_data |>
 mutate(start_date = as_date(start_date),
 end_date = as_date(end_date),
 duration = as.numeric(end_date - start_date))

average_duration = movies_data |>
 group_by(Country) |>
 summarise(average_duration = mean(duration, na.rm = TRUE))

In Excel

In Excel, you would need to use formulas and functions such as DATEDIF() and AVERAGEIF() to achieve similar results. Let’s take a moment to refresh our memory on how the movie data is structured within an Excel sheet.

The following are the steps to accomplish the above task in Excel:

In cell H2, enter the formula to calculate the duration:

=G-F

Press Enter to calculate the duration for the first row.
Drag the formula down from cell H2 to fill the formula for the remaining rows.
In cell I1, enter the formula to calculate the average duration per country. Where country refers to the range of country values, and duration refers to the range of duration values.

=AVERAGEIF(country, A2, duration)

Press Enter to calculate the average duration for the first country.
Drag the formula down from cell I1 to fill the formula for the remaining countries.

In this Excel approach, we used formulas and functions such as subtraction and AVERAGEIF() to perform the calculations. While it is possible to achieve the desired results in Excel, the process involves multiple steps and formulas, and it may become more complex as the dataset grows. R however, simplifies the process with its built-in date functions resulting in cleaner and more efficient code.

Advantages of using R:

Flexibility: R allows us to work with date objects and apply operations on them such as subtraction in order to calculate duration.
Vectorised operations: R allows us to apply calculations to the entire column at once.
Data Manipulation: The {dplyr} data manipulation package in R makes it easier to perform complex tasks on the entire dataset, such as aggregating the data based on country and then determining the average duration.

By utilising R’s ecosystem of packages such as the {lubridate} package for handling dates in R, we can handle complex date calculations efficiently and easily.

Using {dplyr} and {lubridate}

Let’s say we wanted to find the average number of employees on set for movies that were released in the year with the highest average profit.

library("lubridate")
library("dplyr")

# Extract the year from the Start_date column
movies = movies_data |>
 mutate(start_year = year(start_date))

# Calculate the average highest profit per movie for each year
profit = movies |>
 group_by(start_year) |>
 summarise(avg_profit = mean(Highest_profit))

# Determine the year with the highest average profit
profit_year = profit |>
 filter(avg_profit == max(avg_profit)) |>
 pull(start_year)
profit_year
## [1] 2014

Now let’s try to visualise how we would approach this task in Excel. To replicate the task above, we would need multiple functions such as MAX(), AVERAGEIFS() as well as date manipulation functions in order to extract the relevant data before calculating averages. Excel’s formula-based design and approach might require multiples steps and complex formulas, which makes the process time consuming and prone to errors. Handling dates in Excel is a challenge on its own, so while it is possible to perform these calculations in Excel, it may not be as efficient and straightforward as in R.

Both R and Excel have their strengths in data manipulation and analysis. Excel is commonly used due to its user-friendly, easily accessible system, making it suitable for quick, basic tasks. However, when it comes to complex data analysis and advanced programming capabilities, R proves to be the superior choice. R, with its packages such as {lubridate} and {dplyr}, provides intuitive syntax specifically designed for handling dates. Its flexibility allows for seamless integration with other statistical and visualisation packages.The ability to write reproducible scripts in R enhances collaboration, documentation, and automation.

In addition to the advantages of using {lubridate}, there are also several base R datetime functions that provide flexibility in handling dates. Functions such as as.Date() and difftime() allow for date manipulations. Base R provides a solid foundation for date operations, and when combined with additional packages like {lubridate}, it offers a powerful suite of tools for working with dates.

While Excel remains useful for basic tasks, R’s approach makes it the preferred tool for complex data manipulation and analysis. Its flexibility, extensive community support, and comprehensive packages make R the go-to choice for handling date operations, as well as other advanced data analysis tasks.

If you’re interested in learning more about using R for data analysis, take a look at our training course offerings; there’s something for all levels.

For updates and revisions to this article, see the original post

SatRdays London 2023: Thanks for coming!

Thu, 11 May 2023 23:59:00 +0000

SatRdays returned to London last month, with a day packed full of talks from expert speakers across a variety of sectors! We’d like to take this opportunity to say a huge thank you to everybody involved in making the day a success!

Speakers

Huge thanks to all of our speakers for your contributions for the day. It was great to see such a varied line up of talks, covering everything from Sidekicks of the Tidyverse, to Sustainability and EDI in the R Project, with much more in between. We’ll be adding any available talk materials to the SatRdays website in the coming days.

Keynotes

Julia Silge (Posit): What is “production” anyway? MLOps for the curious
Oli Hawkins (Financial Times): Why R is good for journalism

Contributed talks

Michael Stevens & Botan Ağın (SamKnows): AutRmatic reporting: billions of internet measurements, hundreds of reports and one repository to rule them all
Russ Hyde (Jumping Rivers): Does code quality even matter in data science?
Ella Kaye & Heather Turner (University of Warwick): Sustainability and EDI (Equality, Diversity and Inclusion) in the R Project
Andrew Collier (presenter) & Bianca Peterson (Fathom Data): Sidekicks of the Tidyverse
Jack Davison (Ricardo Energy & Environment): “Put it on a map!” – Developments in Air Quality Data Analysis
Vyara Apostolova & Laura Cole (National Audit Office): ScRutinising government spending

Sponsors

Thanks again to our sponsors for supporting the day in various ways!

CUSP London

The Centre for Urban Science and Progress (CUSP) London provided our incredible venue, Bush House London, as well as all of the AV support and catering throughout the day.

Based in London, UK, their mission is to support interdisciplinary research and innovation using Data Science in and for London.

Jumping Rivers

If you’re here, you probably already know who we are, but just in case - We’re Jumping Rivers and we were the event organisers for SatRdays 2023.

Jumping Rivers is an analytics company whose passion is data and machine learning. We help our clients move from data storage to data insights.

Posit

Posit supported the conference in many ways, including sending us our keynote speaker, Julia Silge, all the way over from the US!

Posit aim to create open-source software for data science, scientific research, and technical communication, to enhance the production and consumption of knowledge by everyone, regardless of economic means.

R Consortium

The R consortium are the organisation behind SatRdays as a worldwide event.

You

Of course, a conference would not work with nobody there to see it, so thank you to all of our attendees, both in-person and virtual! We hope that you enjoyed the day as much as we did, and that you got a chance to network and learn and maybe share a bit of your own knowledge too!

What’s next?

SatRdays recordings

Keep an eye out here and on our social media, as we’ll be sharing the recordings of some of the talks, so if you missed out on the day, you don’t have to miss out completely!

Shiny in Production confererence

We also have our Shiny in Production conference coming up later this year! This conference takes a look at shiny and other web based R packages, and includes an afternoon of workshops as well as a day of talks from subject experts. Take a look at the conference website to find out what we have lined up so far, and keep an eye out on our social media for further announcements.

For updates and revisions to this article, see the original post

Diffify - the anniversary update!

Tue, 02 May 2023 23:59:00 +0000

We’ve just passed an important milestone for diffify: our app for tracking Python and R package releases has just turned 1 year old! To mark this exciting occasion we are delighted to announce an “anniversary update” featuring numerous quality of life improvements. This post will outline the latest changes and tease at some exciting developments in the works…

First, though, we would like to take this opportunity to thank everyone that continues to use the app and welcome any new users to the service. Your continued feedback via social media and GitHub has played a major role in shaping the last year of development.

Anniversary update

Let’s start by going through the changes introduced by today’s anniversary update!

Latest package releases

When you navigate to the R and Python homepages, you will notice a new window titled “Latest Releases and Updates”:

This lists any new or updated packages that have been published in the past day or so. See a package that you’re using? Just click on it and you will be redirected to the diffify summary with the latest changes.

Package dependencies

In response to user feedback, we have added cross-links for package dependencies. Let’s check out the changes between versions 3.6.3 and 3.7.1 of the matplotlib package:

We see that the version requirement has changed for the numpy and pyparsing packages. You may now be wondering what’s changed in the latest versions of those packages? Just click the link icons and that will open a new tab with the two latest versions diffed.

Quick disclaimer that not all package dependencies will have cross-links. We can only provide cross-links for packages that are actually tracked by diffify, which includes:

All R packages published on CRAN (this does not include base-R packages)
Any Python package that is in the top 5000 PyPI packages list and has an accessible wheel file on PyPI

News layout

We have made some changes to the way we display news for R packages. Let’s check out the changes between versions 1.0.7 and 1.0.10 of {dplyr}. As before, the news can be accessed for all versions since (but not including) the earlier version:

However, you’ll notice we now have an accordion layout with the version tabs listed vertically. You are now free to have as many of these versions open as you like, and scrolling through these will feel just like scrolling through a NEWS.md file.

Dark theme

Last but not least … we now have a dark theme! Just click the theme dropdown at the top of the page, select “Theme: Dark” and enjoy this lower-light setting:

On the topic of themes, we have also improved the default theme by incorporating beneficial features from the old boosted contrast theme.

Other recent changes

In case you missed them, here are some other improvements that have been made over the past six months or so.

Maintainer section

Just below the version dropdowns you will notice a new maintainer section:

If you maintain a package that is featured on diffify, you can generate a diffify badge to copy into your GitHub repository. Simply click “Get a badge”, then paste the copied HTML code directly into an HTML or Markdown file (perhaps your package README).

As an example, here’s the badge generated for the {dplyr} package:

Clicking this icon will redirect users to the {dplyr} page on diffify.

Python content

We have expanded the list of Python packages that are tracked by diffify to cover the top 5000 packages on PyPI according to download counts. We are still only tracking packages that have a wheel file on PyPI, but will look to expand this to zips and tars within the next month.

Usability

We are continuing to optimise the usability and performance of the app. Recent improvements include:

text wrapping on narrow screens
smoother transitions using the backward and forward navigation buttons
improvements to keyboard navigation.

Exciting times ahead…

In the coming months we will be releasing two public APIs to accompany diffify. We will release dedicated blogs to coincide with those releases, but here’s a quick overview to whet your appetite:

Next month we will release an API which will allow R package authors to submit development versions to diffify. Package authors and users will then be able to use diffify to view changes between published versions and the latest development version.
The second API, which will take a little longer to develop, will act as a command-line interface for submitting queries to diffify. This will allow you to check whether installing the latest version of a package could break your code.

We can’t wait to share more when these release!

Wrapping up

That’s all from us for today. Thanks again for your continued feedback on the app, and please stay tuned for more updates…

For further reading, you can check out our previous blog posts here!

For updates and revisions to this article, see the original post

How to create a clickable world cloud with wordcloud2 and Shiny

Thu, 27 Apr 2023 23:59:00 +0000

Word clouds are a visual representation of text data where words are arranged in a cluster, with the size of each word reflecting its frequency or importance in the data set. Word clouds are a great way of displaying the most prominent topics or keywords in free text data obtained from websites, social media feeds, reviews, articles and more. If you want to learn more about working with unstructured text data, we recommend attending our Text Mining in R course

Usually, a word cloud will be used solely as an output. But what if you wanted to use a word cloud as an input? For example, let’s say we visualised the most common words in reviews for a hotel. Imagine we could then click on a specific word in the word cloud, and it would then show us only the reviews which mention that specific word. Useful, right?

This blog will take you through creating a clickable word cloud in a Shiny app, where the user can click any word in the word cloud to filter an output table. We will be using the 2021 TidyTuesday Netflix titles data set and the {wordcloud2} package to create our word cloud. We will then integrate it in a Shiny app with a reactively filtered {DT} table output.

Creating a word cloud with {wordcloud2}

{wordcloud2} is an R package which creates HTML-based word clouds, based on wordcloud2.js. The main function is simply called wordcloud2() and takes a word count data frame as an input i.e. one column containing the words, one column containing the frequencies of those words.

Before creating the word cloud, we need to read in our data using the {tidytuesdayR} package. If you want to see the full source code for the final Shiny app, check out our GitHub.

tuesdata = tidytuesdayR::tt_load("2021-04-20")
netflix_titles = tuesdata$netflix_titles

To create our word count data frame, we will use a combination of {dplyr} and {tidytext} functions. We filter out words that are used in 10 titles or less to prevent our word cloud from being too crowded.

library("dplyr")
library("tidytext")

word_counts = netflix_titles %>%
 unnest_tokens("word", title) %>%
 anti_join(stop_words, by = "word") %>%
 count(word) %>%
 filter(n > 10)

word_counts %>%
 arrange(desc(n))
## # A tibble: 157 × 2
## word n
## <chr> <int>
## 1 love 151
## 2 2 115
## 3 christmas 78
## 4 story 67
## 5 life 65
## 6 world 63
## 7 movie 60
## 8 time 54
## 9 de 46
## 10 american 45
## # ℹ 147 more rows

Then we just need to pass this word count data frame into the wordcloud2() function. We’re using a custom colour palette instead of the default one. wordcloud2() requires a colour palette vector of the same length as the data set, so you can use the rep_len() function to achieve this.

library("wordcloud2")

my_palette = c("#355070",
 "#6d597a",
 "#b56576",
 "#e56b6f",
 "#eaac8b")

my_wordcloud = wordcloud2(
 word_counts,
 color = rep_len(my_palette,
 nrow(word_counts)))

The wordcloud2 package contains two functions for incorporating word clouds in a Shiny app: wordcloud2Output() and renderWordcloud2(). These work in the same way as most *Output() and render*() functions.

library("shiny")
ui = fluidPage(
 wordcloud2Output("wordcloud")
)

server = function(input, output) {
 output$wordcloud = renderWordcloud2(my_wordcloud)
}

shinyApp(ui, server)

Binding a JavaScript click event to a Shiny input

Now to the key part of this blog post. We want to be able to click on a word in the word cloud, and use the clicked word as an input in Shiny. We need to write some JavaScript for this, which will be wrapped in the HTML() function within a script tag (tags$script()). We are writing an anonymous function, i.e. an unnamed function, which will be run whenever we click on a word in the word cloud. The function will extract the text content of the label produced when we hover over a word, and then cast this to a Shiny input called clicked_word.

ui = fluidPage(
 tags$script(HTML(
 "$(document).on('click', '#canvas', function() {
 word = $('#wcLabel').text();
 Shiny.onInputChange('clicked_word', word);
 });")),
 wordcloud2Output("wordcloud")
)

Now, we can use input$clicked_word in our Shiny server to filter the Netflix titles to retain only the titles which contain that specific word. We use a combination of {dplyr} and {stringr} to do this. The input also contains the count, e.g. “love: 151”, so we need to first use a regular expression remove the colon and any numbers after it.

server = function(input, output) {
 output$wordcloud = renderWordcloud2(my_wordcloud)

 filtered_netflix = reactive({
 clicked_word = str_remove(input$clicked_word, ":[0-9]+$")

 netflix_titles %>%
 filter(str_detect(tolower(title), clicked_word)) %>%
 select(title, everything(), -show_id)
 })
}

The final step is to create an output table of the filtered data. We use the renderDT() and DTOutput() functions from {DT} to do this, but you can use any package for creating tables.

library("DT")

ui = fluidPage(
 <...>,
 DTOutput("filtered_tbl")
)

server = function(input, output) {
 <...>,
 output$filtered_tbl = renderDT(filtered_netflix())
}

Now, you should have an interactive word cloud input which allows you to filter a table based on whichever word you click! You can of course use the word input for something else, for example, you could re-render the word cloud every time you click a word to show you the words which are most often used together with your clicked word, or you could use the input to create some further visualisations.

If you’re interested in learning more about Shiny, check out our Shiny in Production conference, taking place October 12th-13th in Newcastle upon Tyne. We’ll be focussing on all things shiny as well as other web-based R packages, with an afternoon of workshops run by our JR trainers, followed by a day of talks from R experts!

For updates and revisions to this article, see the original post

What's new in R 4.3.0?

Thu, 20 Apr 2023 23:59:00 +0000

Logic will get you from A to B. Imagination will you take everywhere. (Einstein)

R can already take you everywhere. With it we can learn about the minutest particles and the largest galaxies. So, to celebrate the release of R 4.3 (“Already Tomorrow”, on April 21st, 2023), let’s reverse Einstein’s quote and take you from A to B with logic.

Two modes of comparison

In R, almost all of your data will be stored as a vector. Even if your vector holds a single value it is still considered to be a vector by R. This is unlike many other languages, and getting comfortable “thinking for the whole vector” can gain you efficiencies from several viewpoints. Your code will be more concise and it may even run quicker, when compared with an iterative approach to the same problem.

1:10 # A vector of integers
## [1] 1 2 3 4 5 6 7 8 9 10
is.vector(1:10)
## [1] TRUE
sum(1:10) # A vectorised computation
## [1] 55

integer(0) # An empty vector of integers
## integer(0)
1L # A single integer, stored as a vector
## [1] 1

But the conciseness that R’s vectorised operations provide may trip you up unexpectedly. A typical case is when you think you are working with a scalar (a length-1 vector) but you are actually working with an empty or multivalued vector.

The logical values in R (TRUE, FALSE) are a little bit special. A vector of logical values might be used to represent some quality in a dataset, for example, to select those rows of a dataset that are to be kept in dplyr::filter().

library("tidyverse")
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

head(diamonds$cut == "Ideal") # A logical vector
## [1] TRUE FALSE FALSE FALSE FALSE FALSE
filter(diamonds, cut == "Ideal") # Subsetting a data-frame using a logical vector
## # A tibble: 21,551 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63
## # ℹ 21,541 more rows

head(diamonds$carat > 0.3)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
filter(diamonds, carat > 0.3)
## # A tibble: 49,737 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 2 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 3 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
## 4 0.31 Very Good J SI1 59.4 62 353 4.39 4.43 2.62
## 5 0.31 Very Good J SI1 58.1 62 353 4.44 4.47 2.59
## 6 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
## 7 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 8 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
## 9 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## 10 0.32 Good H SI2 63.1 56 403 4.34 4.37 2.75
## # ℹ 49,727 more rows

But there are places where you use logical values, where it would make no sense (and could potentially be dangerous) to use a multivalued logical vector. We use if (...) {} and while (...) {} statements for flow control in R. The conditional expression in these statements (the ... in if (...) {}) should always evaluate to a logical scalar: either TRUE or FALSE.

When R 4.2.0 was released, stricter guarantees were placed on the length of these conditional expressions. We mentioned this in an earlier blog post. So in addition to getting an error when the conditional is empty, we now get an error when the conditional is too long:

# Comparison with an empty logical vector:
if (logical(0)) {
 print("I didn't expect to get here")
}
## Error in if (logical(0)) {: argument is of length zero

# Comparison with an over-sized logical vector:
numbers <- c(1, 3, 5, 6)

print(numbers %% 2 == 0) # Determine if even
## [1] FALSE FALSE FALSE TRUE

if (numbers %% 2 == 0) {
 print("Should we ever be allowed to get here?")
}
## Error in if (numbers%%2 == 0) {: the condition has length > 1

Previously, R would use the first entry in a non-scalar conditional vector to decide whether to enter the if or while block.

Strictly comparing

So, we have two main ways of using a logical vector, one of which now requires that the vector is a scalar.

Another place where it is really important to know the length of your vectors is when combining logical values together.

R has a number of ways to combine logical values together that build on the AND and OR operations in Boolean algebra:

all and any for combining the values in a single vector (are all of the values TRUE; are any of the values TRUE)
&, && (representing “AND”), |, and || (for “OR”) for combining two different vectors

is_april = TRUE
is_r_released = TRUE
is_already_tomorrow = FALSE

# Logical AND within a single vector
all(c(is_april, is_r_released, is_already_tomorrow))
## [1] FALSE

# Logical OR within a single vector
any(c(is_april, is_r_released, is_already_tomorrow))
## [1] TRUE

# Logical AND between vectors
is_april & is_r_released
## [1] TRUE
is_april && is_already_tomorrow
## [1] FALSE

# Logical OR between vectors
is_april | is_r_released
## [1] TRUE
is_april || is_already_tomorrow
## [1] TRUE

For scalars, there’s no difference between the single-character operators (&, |) and the two-character operators (&&, ||). So why have a pair of operators for each concept?

&& and || are intended for use solely with scalars, they return a single logical value.
& and | work with multivalued vectors, they return a vector whose length matches their input arguments.

Since they always return a scalar logical, you should use && and || in your if/while conditional expressions (when needed). If an & or | is used, you may end up with a non-scalar vector inside if (...) {} and R will throw an error.

To illustrate the difference between the scalar operators and vectorised operators, here’s an example:

x = c(TRUE, TRUE, FALSE, FALSE)
y = c(TRUE, FALSE, TRUE, FALSE)

The vectorised operators apply AND/OR on matched pairs of elements:

x & y # c(x[1] && y[1], x[2] && y[2], ...)
## [1] TRUE FALSE FALSE FALSE

x | y # c(x[1] || y[1], x[2] || y[2], ...)
## [1] TRUE TRUE TRUE FALSE

In R 4.2.0, a warning is thrown when a non-scalar input is passed to the scalar-operators. But, a scalar logical is returned (here, the result of x[1] && y[1]). In earlier versions of R, no warning was printed.

# R 4.2
x && y
[1] TRUE
Warning messages:
1: In x && y : 'length(x) = 4 > 1' in coercion to 'logical(1)'
2: In x && y : 'length(x) = 4 > 1' in coercion to 'logical(1)'

This could lead to hidden bugs. For example, if you used this code in an if conditional, a warning would be printed when a non-scalar vector was used but the code would continue happily:

# R 4.2
if (x && y) {
 print("The world can't end today...")
}
[1] "The world can't end today..."
Warning messages:
1: In x && y : 'length(x) = 4 > 1' in coercion to 'logical(1)'
2: In x && y : 'length(x) = 4 > 1' in coercion to 'logical(1)'

In R 4.3.0, this warning has been elevated to an error and no value is returned:

# R 4.3
x && y
Error in x && y : 'length = 4' in coercion to 'logical(1)'

This more strict version of the scalar comparison operators will help catch those bugs where you didn’t realise a logical variable could contain more than one entry.

To check whether the strict comparison operators will affect your existing code, before upgrading to R 4.3.0, you can set an environment variable before running it:

# In R:
Sys.setenv("_R_CHECK_LENGTH_1_LOGIC2" = TRUE)

A more logical flow

Where else do we work with scalars in R? Many functions expect certain arguments to be scalars. For example, the seq() function complains with non-scalar arguments:

seq(from = 1:3, to = 4)
## Error in seq.default(from = 1:3, to = 4): 'from' must be of length 1

seq(from = 1, to = 4:5)
## Error in seq.default(from = 1, to = 4:5): 'to' must be of length 1

There are several other places where R will throw an error if we provide a value that is of the wrong size:

a_data_frame[[column_index]] # column_index must be a scalar
a_matrix[rows, cols] = value # value must match the size of the replaced element(s)

There are other places where R will throw a warning, and try to gracefully handle values that are of an unexpected size:

# R's recycling rules are used to match the size of the vector input
c(1, 3, 5) * c(2, 3) # c(1 * 2, 3 * 3, 5 * 2)
## Warning in c(1, 3, 5) * c(2, 3): longer object length is not a multiple of
## shorter object length
## [1] 2 9 10

# The smaller vector was recycled to match the size of the larger
# c(1, 3, 5) * c(2, 3, 2)

An interesting case is the : operator, which like seq(), can be used to create sequences of numbers.

3:5
## [1] 3 4 5

If we provide a non-scalar on either side of the operator, R will warn us:

# R 4.2
(1:2) : 5
[1] 1 2 3 4 5
Warning message:
In (1:2):5 : numerical expression has 2 elements: only the first used

# R 4.2
1 : (4:6)
[1] 1 2 3 4
Warning message:
In 1:(4:6) : numerical expression has 3 elements: only the first used

Now, because the output should be a single sequence, R has to pick a specific value for the start- and the end-point of that sequence from the arguments provided. It uses the first entry in each argument. So,

(1:2) : 5 is equivalent to 1:5; and
1 : (4:6) is equivalent to 1:4.

If your code is providing non-scalar arguments to :, there may be a bug in your code or the packages that it depends upon. R 4.3.0 has introduced a more strict setting, which will catch the use of non-scalar values when constructing sequences with the : operator.

Much like with the stricter logic comparisons described above, the R developers have introduced this as an optional setting. After setting the environment variable _R_CHECK_LENGTH_COLON_ to a true value, R will throw an error whenever an oversized argument is passed into a:b.

# R 4.3
# Without the check enabled:
(1:2) : 5
[1] 1 2 3 4 5
Warning message:
In (1:2):5 : numerical expression has 2 elements: only the first used

# With the strict check enabled:
Sys.setenv("_R_CHECK_LENGTH_COLON_" = TRUE)
(1:2) : 5
Error in (1:2):5 : numerical expression has length > 1

And finally: Extracting from a pipe

Have you started using the native pipe yet? In our blog post to celebrate the release of R 4.2.0, we showed this example:

mtcars |> lm(mpg ~ disp, data = _)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Coefficients:
## (Intercept) disp 
## 29.59985 -0.04122

Here the pipe |> passes the value on it’s left-hand side into the function on the right. By default that value will be used as the first argument to the right-hand function. But when an underscore is present, the piped-in value will replace that underscore. So the above is equivalent to:

lm(mpg ~ disp, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Coefficients:
## (Intercept) disp 
## 29.59985 -0.04122

What if you want to extract values that are output by a pipeline? For example, if you want the coef entry from the linear model above. One way would be to store the results in a variable and extract the coef from that:

model = mtcars |> lm(mpg ~ disp, data = _)
model$coef
## (Intercept) disp 
## 29.59985476 -0.04121512

Or you could wrap the pipeline in parentheses:

(
 mtcars |> lm(mpg ~ disp, data = _)
)$coef
## (Intercept) disp 
## 29.59985476 -0.04121512

R 4.3.0 provides a much neater solution, where the underscore _ can be used to refer to the final value from a pipeline. This can make your code much neater:

mtcars |> lm(mpg ~ disp, data = _) |> _$coef
(Intercept) disp
29.59985476 -0.04121512

Trying the latest version out for yourself

To take away the pain of installing the latest development version of R, you can use docker. To use the devel version of R, you can use the following commands:

docker pull rstudio/r-base:devel-jammy
docker run --rm -it rstudio/r-base:devel-jammy

See the r-docker project for more details.

Why should I use R: The Excel R plotting comparison: Part 2

Thu, 13 Apr 2023 23:59:00 +0000

This is part 2 of an ongoing series on why you should use R. Future blogs will be linked here as they are released.

Part 1: Why should I use R: The Excel R Data Wrangling comparison: Part 1
Part 2: Why should I use R: The Excel R plotting comparison: Part 2 (This post)
Part 3: Why should I use R: Handling Dates in R and Excel: Part 3

Why create plots in R and not Excel? To a programmer this may seem like a very obvious question, but it is still a common question asked by Excel users — If you have a data set, could you select it, hit a couple of buttons and generate plots? This is one of the trickiest questions to answer, especially if you have limited Excel experience as many new age data scientists do. Hopefully, some of the reasons below will encourage you to make the switch from Excel to R.

Reproducibility

How do you view the code used to generate the Excel graph? Are you able to tell exactly whats going on? Are you able to control and modify all of the aesthetics of the plot, such as changing the length of the axis ticks, or changing the font? If yes, are you able to share your work with a colleague and have them easily replicate your plot without you telling them where to click and which modification should be applied?

With R all of these things are possible. You automatically have all the code visible in the form of scripts. Reading and understanding the code is possible because of its easy to read syntax, which allows you to track what the code is doing without having to be concerned about any hidden functions or modifications happening in the background.

Understanding changes

In Excel it is challenging to eye-ball which changes have been made to a graph, especially if these were minor changes. With R (and some easy to use version control systems), you can see exactly which files were changed. Also, in Excel, a user would usually draw a graph on a single Excel document, and if the same graph is required on a different data set, it is common to copy-and-paste a bunch of manipulations and configurations to another document. Such repeated human interaction is prone to introducing errors, as well as consuming a large amount of time. With R we can avoid this by creating functions, which can be used to run the same code on different data sets simply by changing the input, thereby producing reliable outputs and saving us a lot of time.

Extensibility

Yes, Excel has a wide range of basic graphics available, but R has a lot more. Excel has been around for a while, so it has some decent tools that have been developed over the years. R, however, is open source, and therefore extensions are widely available - it’s even fairly easy to make your own. R also has thousands of libraries that can be used to easily produce graphics without all the pre-graph work to create some really crafty stuff. With that being said, Excel is perfectly sufficient when creating basic, simple, straight forward plots. But what if we’re not looking to be basic?

The simplicity of R

The package {ggplot2} is a plotting package in R that provides us with commands to create complex plots. R’s command line interface let’s you quickly select x- and y-axis labels, colour by variables, modify grid lines and much more. Each item is added in a new layer, which allows us to add in and remove graph elements without affecting the rest of the plot. Interested in changing the colour gradient/scale of your plot? No problem, just use a package called {RcolourBrewer}, which helps you select sensible colour schemes for your plots. Interested in changing the title of your plot? Simply add a layer called ggtitle - and so much more.

The comparison

Let’s create some simple plots in Excel and then create a similar plot in R using the {ggplot2} functions. Hopefully, by the end of this post, we’ll have motivated you to switch to R. Now, let’s get started by loading the data and packages. The data set that we’ve used below is data from a selection of movies, and is comprised of five columns: country, year, highest profit gained per movie, number of movies produced and number of employees on set during production.

library("ggplot2") # For plotting
library("viridis") # Provides a range of colour palettes
library("readr") # For loading data 
library("tidyverse") # For data wrangling

movies_data <- read_csv("blog_data.csv")

Let’s start by creating a scatter plot, in which we compare the number of employees present in the different countries within each year.

Scatter Plot

Excel

The scatter plot generated in Excel was simple to create, but everything had to be done manually: selecting the data and the variables for the x- and y-axis and then selecting the type of plot. I was also required to manually change the axes titles. If we were interested in changing the grid lines, this would have to be done manually too. Looking at this plot, is this something that you are able to easily recreate? Would you know where to point and click to generate this visualisation?

R

Here we created a similar plot in R using the {ggplot2} functions. Because the code is visible we can easily recreate the plot above, but also, we are able to conveniently see which functions and aesthetics were applied to our plot.

ggplot(data = movies_data, aes(x = Year, y = no_employees)) +
 geom_point(aes(colour = Country)) +
 labs(x = "Years",
 y = "Number of employees",
 colour = "Country") +
 theme_bw()

Theming system in {ggplot2}

Theme arguments specify the non-data features that you can control. For example, the axis.text argument controls the appearance of the axis text such as the font size, colour and face of text. The axis.ticks.x controls the ticks on the x-axis and so on. The theme() function allows you to override the default theme elements, like theme(plot.title = element_text(colour = "red")). Complete themes, like theme_bw(), set all of the theme elements to values designed to work together.

We can take this plot even further. Let’s say we were interested in creating the same plot as above, but with each country having its own plotting panel within the same visualisation. We can use the facet function from the {ggplot2} package:

ggplot(data = movies_data, aes(x = Year, y = no_employees)) +
 geom_point() +
 facet_wrap(~Country, ncol = 4) +
 labs(x = "Years",
 y = "Number of employees") +
 theme_bw() +
 theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

We have also utilised the axis.text.x element to adjust the angle and position of the x-axis labels to ensure that they are legible. Are you able to create this in Excel without copying and pasting the graphs? If so please do show us how you were able to do this.

Now, let’s proceed to create a histogram using Excel and R. Looking at the theme() function alone, we can see that R has a lot more features available that we are able to modify, such as axes text, fonts, legend size and grid lines. As a data enthusiast, which graph looks more aesthetically pleasing to you?

Histogram Plot

Excel

The histogram generated below was a bit more time consuming. Firstly, we had to change the size of the bars in a normal bar graph in order to generate a histogram. The colours of each column had to manually be selected and applied. Adding a legend to this plot was also a manual process. Looking at this plot, is this something that you are able to easily recreate?

Now, let’s generate a histogram using R and its {ggplot2} functions.

R

Once again, it is evident that we can easily control all of the variables and aesthetics of the histogram plot generated using ggplot. Here we used a new function called the scale_fill_viridis() which is a function for {ggplot2} which allowed us to modify the colours visible on the histogram bars. We also used the theme_classic() function in R to create a classic looking plot with x- and y-axis lines and no gridlines. We also edited the size, colour and font of the text on the axes (axis.text).

ggplot(data = movies_data, aes(x = Highest_profit)) +
 geom_histogram(aes(fill = Country)) +
 labs(x = "Yearly profit (in million dollars)", y = "Count") +
 scale_fill_viridis(discrete = T) +
 theme_classic()+
 labs(colour = "Country") +
 theme(
 axis.text = element_text(size = 10, colour = "black", family = "serif")
 )

Now, let’s move on and generate our last plot.

Line Plot

Excel

The line plot was the most complex plot to create. Firstly, when generating the line graph, it was evident that the data within the year column had to be rearranged in ascending order or it will put the earlier years after the later years. The line graph was also not able to plot more than one graph representing each country as a different line as some countries did not have data for all the years. After a lot of frustration with Excel we attempted to create a very basic line plot in R.

R

With only three lines of code and very little frustration, we were easily able to recreate the line graph above in R.

ggplot(data = movies_data, aes(x = Year, y = Number_movies)) +
 geom_line(aes(colour = Country)) +
 labs(x = "Years", y = "Number of movies produced")

Now, let’s add some more aesthetics to our plot as we did for the previous ones by changing the font size (axis.title and axis.text), changing the panel border (panel.border), as well as editing the legend size (legend.key.size). Here we decided to use the theme_dark() function in R to create a dark background, which is commonly used to make thin coloured lines pop out.

ggplot(data = movies_data, aes(x = Year, y = Number_movies)) +
 geom_line(aes(colour = Country)) +
 labs(x = "Years", y = "Number of movies produced") +
 labs(colour = "Country") +
 theme_dark() +
 theme(
 panel.border = element_rect(colour = "black", fill = NA, size = 2),
 axis.title = element_text(size = 12, face = "bold", family = "Arial"),
 axis.text = element_text(size = 10, colour = "black", family = "Arial"),
 legend.key.size = unit(0.50, "cm")
 )

When comparing R and Excel, it’s important to define the level of information you are looking for. If you want to run basic statistics quickly, Excel might be the better choice. If you are interested in creating a very basic graph, Excel may be the better choice, due to its easy point-and-click system. Before plotting a graph ask yourself; “How detailed does my visualisation need to be? Am I creating a plot for a publication or not? In Excel it is evident that we can easily select a chunk of data and make a simple chart, however, when making more comprehensive plots, using Excel can be extremely frustrating and time consuming. It all comes down to what you need your graphics to do. For those planning to publish large amounts of complicated data, spending the time in R to create impressive visual representations will certainly be worth your time. It is also clear that R is not difficult, and gives you the option to customise more than Excel.

R and Excel are beneficial in different ways. Excel starts off easier to learn and is the go-to program when we are exposed to computers and some of us end up being stuck there. However, R is designed to be reproducible which is clearly of high importance. It’s not a question of choosing between R and Excel, but deciding which program to use for different needs.

If you’re interested in learning how to create graphs using R, then attend our Data visualisation with ggplot2 course.

For updates and revisions to this article, see the original post

We’re a British Data Awards 2023 Finalist

Tue, 11 Apr 2023 23:59:00 +0000

We’re delighted to announce that we’ve been named a Finalist in the British Data Awards 2023.

The British Data Awards is an annual quest to discover and celebrate data success stories. Organisations taking part this year range from FTSE 100 heavyweights, public sector pioneers, technology unicorns, fast-growing scale-ups, essential Not-For-Profits, and everything in between.

A record 226 entries were received this year which means that competition to be named a Finalist proved to be particularly tough, so we’re especially pleased to be announced as a Finalist.

Jason Johnson, Co-Founder of Predatech and British Data Awards judge said: “Judging the British Data Awards this year wasn’t easy given the high standard of entries. All our Finalists should be incredibly proud of their data success stories and for helping to showcase the best that the world of data has to offer. I look forward to celebrating your achievements in May.”

Our nominations

Data for Good Consulting Initiative of the Year (sponsored by The Dot Collective)

Jumping Rivers has been working on a project for the World Health Organisation Europe, streamlining and maintaining their COVID19 vaccination programme monitoring application. You may have read a little about this project in our recent blogs on offloading Shiny’s workload and working smarter, not harder with automated workflows. This is a great example of data being utilised to track and develop global initiatives. The maintenance that Jumping Rivers has performed on this app allows it to be quick, flexible and robust to changes in the data. The automation also allows the staff at the WHO/Europe to concentrate their efforts on important initiatives, rather than spending their time cleaning and managing data.

Rising Star of the Year

Jack Walton is a finalist for the Rising Star of the Year award! Jack is very community driven, leading data science meetups and contributing to open source projects and online support networks. He inhabits a unique space in the industry, between data science and data engineering, carving out a position for himself acting as an intermediary between the two areas, and allowing for greater collaboration across the company.

Quote from a company spokesperson

The British Data Awards 2023 will announce Winners across some 22 categories. A number of Highly Commended awards will also be presented. This year, ‘Data for Good Initiative of the Year’ and ‘Innovation of the Year’ received the most entries categories. Other categories include ‘Data Leader of the Year’ and ‘Technology Company of the Year’, while new categories including ‘Climate Change Initiative of the Year’ were introduced to help showcase and celebrate the work of a diverse group of organisations. The British Data Awards 2023 judging panel included:

Roshan Awatar: Group Director, Data & Analytics at Sky
Lynne Bailey: Chief Data Officer at KPMG UK
Neil Carden: CEO at Forth Point I A Blend360 Company
Dr Sophie Carr: Founder at Bays Consulting
Caroline Carruthers: CEO at Carruthers and Jackson
Christina Finlay: Director of Data & Analytics, Nest
Roxane Heaton: Chief Information Officer at Macmillan Cancer Support
Natalie Jakomis: Director of Data & Analytics at Rightmove
Jason Johnson: Co-Founder at Predatech
Natasha Lauer: Head of Marketing at Soda
Dr Jo Watts: CEO & Founder at effini

Finalists will be celebrated, and Winners announced, at an awards ceremony taking place in London on the 11 th May.

For updates and revisions to this article, see the original post

SatRdays London is now Hybrid!

Thu, 06 Apr 2023 23:59:00 +0000

SatRdays London is fast approaching, and we have a couple of exciting announcements to share with you!

Full program available now

The full list of speakers and their abstracts can now be found in a downloadable program on our conference website, along with the schedule for the day and the registration options.

Registration deadline extension

The registration deadline has now been extended to the 21st April, so you can register all the way up to the day before the event!

Virtual tickets now available

Most excitingly, we are pleased to announce that this will now be a hybrid event! If you aren’t able to make it to London for the day, there’s no need to miss out. You can sign up in the same place (via the website), and select the “Virtual only” option! You will then be able to watch live on the day, and join in on the Q&A sessions with our speakers.

We’re really looking forward to hosting you all, whether at the incredible Bush House in London or virtually, so please book your place now to make sure you don’t miss out on our excellent line up of speakers. We have a great range of talk topics, from R in journalism and MLOps, to sustainability and EDI in the R project, air quality analysis to scrutinising government spending, and much more, there will be something for everyone at this month’s SatRdays London event!

For updates and revisions to this article, see the original post

Optimising tooltip design with modern CSS

Thu, 30 Mar 2023 23:59:00 +0000

In my blog post on improving the responsiveness of Shiny applications I mentioned a recent project I was involved with as part of a collaboration with Utah Tech University. Part of that project involved the construction of interactive Sankey (or to be extra-precise “alluvial”) diagrams using the d3 JavaScript library. One of the requirements was that the user could hover over a link or node in the diagram and see all the connections to or from that link or node highlighted. The image below shows a cropped section of one such Sankey, constructed using data from the diamonds dataset in R’s {ggplot2} package. The data used was handy for illustrative purposes here - whether a Sankey diagram is a good way of visualising that data is largely moot for the discussion that follows.

While colour-highlighting can be a great way of emphasizing part or parts of a chart or diagram, it doesn’t usually add precise information, which was important to the client. To add this precise information we used a tooltip. But to make them as effective as possible we had to spend a bit of time refining their design.

The power of tooltips

Tooltips typically give you precise information next to your cursor or where you’ve just tapped. That is to say, where you happen to be looking. This is great because, to put it bluntly, your peripheral vision is rubbish. Don’t worry, mine is too. As Jeff Johnson outlines in Designing with the Mind in Mind) (third edition, chapter 5), the centre of our visual field - the fovea - contains around 158,000 cone cells per square millimetre and around half of the visual cortex is then devoted to processing information coming from the fovea. Yet the fovea makes up only about 1% of the retina! The rest of the retina contains only around 9,000 cones per square millimetre and, on top of that, data from these cells is compressed (multiple cones and rods connect to each ganglion cell) before being sent to the brain.

Peripheral vision is useful for detecting motion. That’s great if you want to see that apex predator sneaking up on you or that application in your MacOS Dock that really wants you to give it some attention (thanks Apple! 🙄). So peripheral vision guides attention. By using tooltips close to your cursor you don’t have to worry about your user’s focus and attention having to dart from one bit of the screen to another (and maybe getting lost on the outward or return journey).

Returning to our Sankey diagram, by adding a tooltip relating to the item being hovered over we can see precise information relating to that item - “details on demand”.

The problem with tooltips

There’s a problem. And that problem is the same as the advantage given before: the tooltip is placed right where you happen to be looking. This is bad because there’s a reasonable chance some useful information has just been occluded. If we compare the two images above we can see that, in the latter case, the “Good” node has been completely hidden while the vertical extents of the “Very Good” and “Fair” nodes are also no longer obvious. With a tooltip that follows the cursor, the user can move around a bit, but it’s not ideal, especially with the thinner links.

One obvious option is to make the background translucent. With an HTML tooltip we can do that by setting its CSS background-color property to a colour with an alpha channel value less than one. Let’s try rgba(255, 255, 255, 0.3):

This does a pretty good job of fixing the occlusion problem but now the text on the diagram interferes with the text of the tooltip, making them both hard to read where they overlap. One could conceivably hide the Sankey text when the tooltip is visible, but that is likely to lead to an annoying flashing behaviour as that text comes and goes with cursor movement.

Combatting occlusion without introducing intereference

Now, chances are you’ve sat through a few video conferences over the last few years. If you have, there’s a high chance you’ve seen the use of blurring background filters - the application detects what, in the video feed, is the human and what is the background and blurs the latter. The viewer of the feed then sees the human clearly while the background is much less clear. There’s still a general feeling for what’s there but not the details. You might be able to make out that there’s a bookshelf with books and trinkets but not the titles of the books or the precise nature of the trinkets.

This got me wondering whether I could do something similar for tooltips. I already knew there was a CSS filter property with a blur() function so I thought I’d try that:

Oops. That did not improve readability at all. Obviously I don’t want to blur the text in the tooltip. So my next thought was to make the text container transparent then place a separate, translucent, background element directly behind it and apply the blur filter to that. Thankfully I didn’t have to go to that faff as I discovered there also exists a backdrop-filter property. The interactive graphic below shows it in action, with the updating CSS below showing what might be used with a tooltip that has an HTML class of "tooltip". (Note that in this example the blur is measured in pixels and the image size varies with screen width, so the optimal blur size here may vary for you depending on the dimensions of your browser window.)

Opacity: Blur: px

 .tooltip {
 background-color: rgba(255, 255, 255, 0.3);
 backdrop-filter: blur(2px);
 }

This does exactly what I want: The text of the tooltip is completely readable (for me at least, more on that in the Q & A section below). The “Good” node is still visible as a distinct entity from the “Fair” and “Very Good” nodes. Ok, I can’t read the label any more, but I at least know there is something of interest there.

What is actually happening? The CSS blur function applies a Gaussian blur to the target element’s background with the standard deviation specified as the argument (e.g. two pixels). Large areas of flat colour are only really affected at the edges - in entirely non-scientific terms I like to think of it as some of the colour from one pixel being smudged into neighbouring and nearby pixels while the nearby pixels smudge the same colour back into the original pixel for no net effect. Text is basically all edges and so is completely smudged - black text on a white background becoming a grey “blob”.

A more scientific explanation would probably include talk of Fourier transforms, the frequency domain and low-pass filters.

Q and A

Does this work on all browsers?

The backdrop-filter property does not work on Safari at the time of writing, it’s simply ignored. However, there is a vendor-prefixed version - -webkit-backdrop-filter - that does work. So a little update to the CSS code (that I already sneaked in behind the scenes to the example above) can make this work across all modern browsers (as far as I’m aware):

.tooltip {
 background-color: rgba(255, 255, 255, 0.3);
 -webkit-backdrop-filter: blur(2px);
 backdrop-filter: blur(2px);
}

What about accessibility?

While we like to cover accessibility issues in our blog posts, a thorough treatment of tooltip accessibility is beyond the scope of this post. However, it is worth mentioning that some people may struggle with the reduced contrast that can come with a translucent tooltip background. So it might be worth considering offering users an override to make the background opaque. Alternatively, your CSS code can check if your user has informed their operating system or browser that they prefer increased contrast. When they do, you then override the applied styles:

.tooltip {
 background-color: rgba(255, 255, 255, 0.3);
 -webkit-backdrop-filter: blur(2px);
 backdrop-filter: blur(2px);
}

@media (prefers-contrast: more) {
 .tooltip {
 background-color: white;
 -webkit-backdrop-filter: none;
 backdrop-filter: none;
 }
}

Probably not. It works here because the encodings I want to keep visible are large blocks of colour and those I want “obscured” are smaller. Because of this, I’m guessing this design style might work effectively with some thematic maps. On the other hand, where the data is encoded using small elements (e.g. a scatter plot with lots of small points of varying colour) the result might not be ideal.

For updates and revisions to this article, see the original post

SatRdays London 2023: Sponsors

Tue, 28 Mar 2023 23:59:00 +0000

SatRdays London 2023 is fast approaching!

Don’t miss out! Ticket sales close at midnight on 8th April!

On 22nd April 2023 we will be hosting SatRdays London, an inclusive, low cost event, which gives R users an opportunity to network and learn from other experts across sectors. In a recent blog post, we introduced all of the speakers for the event! This week, it’s the sponsors turn.

CUSP London

The Centre for Urban Science and Progress (CUSP) are based in London, UK. Their mission is to support interdisciplinary research and innovation using Data Science in and for London, bringing together multi-disciplinary teams of academics, a group of associates from external partners and students working with CUSP.

CUSP London, hosted in the King’s Department of Informatics, is part of an international network of multidisciplinary institutes led by CUSP New York at the NYU Tandon School of Engineering. They welcome new academic collaborators and external partners into the CUSP family locally and internationally.

CUSP are generously providing the venue for SatRdays London 2023.

Jumping Rivers

Jumping Rivers is an analytics company based in the North East who specialise in creating bespoke solutions for modern business problems. Their team is made up of experts in data science and data engineering from many different backgrounds, and their wealth of knowledge and experience allows them to think outside the box and solve problems in new and innovative ways.

Posit

Posit (formerly RStudio), are a US based company who develop R and Python-based tools to help you produce higher quality analysis faster. As well as creating open-source tools for all to use and develop, Posit are very active in the data science community, hosting their own annual conference, as well as supporting conferences around the world, including SatRdays London!

R Consortium

The central mission of the R Consortium is to work with and provide support to the R Foundation and to the key organisations developing, maintaining, distributing and using R software through the identification, development and implementation of infrastructure projects. Its members include leading institutions and companies dedicated to the use, development and growth of R.

For updates and revisions to this article, see the original post

Alt Text in R: Plots, Reports, and Shiny

Thu, 23 Mar 2023 23:59:00 +0000

What is alt text?

Alt text (short for alternative text) is text that describes the appearance and purpose of an image. Alt text has multiple purposes, the main one being that it aids visually impaired users to better understand your content when the alt text is read aloud by screen readers. Alt text is also used in place of an image if it fails to load, which means that users with poor internet connection are more likely to be able to engage with your content.

How do I write alt text?

There are already a lot of good resources on how to write alt text, and so that’s not the main focus of this blog post. This Medium article by Amy Cesal describes a simple formula for helping you to write alt text for charts, which I’ve found really helpful. Liz Hare also recently gave a talk on alt text for R-Ladies New York, and the slides are an excellent resource.

Can I automate writing alt text?

One of the often-cited arguments for using programming languages, such as R, is that they allow you to automate processes. And so you may very well be wondering “can I use R to automate the writing of alt text?” Before I answer that question, let me remind you of the phrase just because you can, doesn’t mean you should.

The examples of automated alt text to describe plots that I’ve seen tend to describe which variables are on the x and y axes, the range of the data, the chart title, and maybe the colours in the plot. Some make attempts to describe a trend line. What’s almost always missing is the “why”. It’s very difficult to automate a description of what you’re trying to communicate to the person interpreting a plot with only a list of plot components.

One R package that may be useful as a starting point for writing alt text is the {BrailleR} package. It has support for generating alt text for both base R and {ggplot2} graphics, using the VI() function. Since it’s still missing the “what am I supposed to be seeing” message, and it doesn’t always get it right, I’d encourage you never to rely 100% on automated alt text. It could provide a starting point for you to check, edit, and include the take-home message of your graphics.

After you’ve written the alt text for your image, you need to actually add it to your document or app. If you’re directly writing HTML code, it’s usually quite straightforward - and this guide for improving accessibility with alt text gives a great overview. Today, this blog post will show you how to include alt text in your web applications and documents when you’ve built them in R.

How do I add alt text in R?

With R, you can create static plots, documents, presentations, web applications, and many other output types - and they all need alt text! We’ll go through how you do that for the most common output types in R.

{ggplot2}

It goes without saying that {ggplot2} is one of the most popular packages for creating plots in R. So it’s likely that you’ll be adding alt text to a plot created with {ggplot2}. Within the labs() function in {ggplot2}, there’s an argument alt (introduced in version 3.3.4) - and this is where you can add alt text.

g <- ggplot(lemurs, aes(x = name, y = n)) +
 geom_col() +
 labs(x = "",
 y = "Number of lemurs",
 title = "Lemurs at Duke Lemur Center",
 alt = "A bar chart titled Lemurs at Duke Lemur Center. On the x-axis three species of lemurs are shown including the Crowned lemur, Gray mouse lemur, and Ring-tailed lemur. On the y-axis the count of the number of each species is shown. The number of lemurs ranges from just under 2500 for Crowned lemurs, to almost 12500 for Gray mouse lemurs. The number of Crowned lemurs is significantly lower than the other two species shown.")

If you save the plot as a variable, you can extract the alt text with:

get_alt_text(g)

There are a couple of reasons for using the alt argument in {ggplot2}:

It’s usually easier to write the alt text when you’re making the plot, rather than when you’re compiling the outputs since it’s fresher in your mind.
The string passed to alt automatically gets passed as the image’s alt text if you use the plot in a Shiny app (more on that later…)

Quarto and R Markdown

Both R Markdown and Quarto (next generation R Markdown) allow you to create outputs in HTML format, such as documents or presentations. Although MS Word, and Adobe are starting to allow you to add alt text to word documents and PDFs, neither R Markdown or Quarto have support for this yet (although hopefully they will in the future). HTML outputs are more accessible in general, so I’d recommend HTML outputs where possible anyway.

If you’re creating a plot within a code chunk, you can use the fig.alt option in R Markdown to pass in a character string of alt text:

```{r, fig.alt="A bar chart titled Lemurs at Duke Lemur Center. On the x-axis three species of lemurs are shown including the Crowned lemur, Gray mouse lemur, and Ring-tailed lemur. On the y-axis the count of the number of each species is shown. The number of lemurs ranges from just under 2500 for Crowned lemurs, to almost 12500 for Gray mouse lemurs. The number of Crowned lemurs is significantly lower than the other two species shown."}
g
```

and in Quarto, the idea is similar but the syntax is slightly different:

```{r}
#| fig.alt: "A bar chart titled Lemurs at Duke Lemur Center. On the x-axis three species of lemurs are shown including the Crowned lemur, Gray mouse lemur, and Ring-tailed lemur. On the y-axis the count of the number of each species is shown. The number of lemurs ranges from just under 2500 for Crowned lemurs, to almost 12500 for Gray mouse lemurs. The number of Crowned lemurs is significantly lower than the other two species shown."
g
```

Adding alt text directly into the code chunk options has always felt slightly clunky, and so what you can do instead is store your alt text in a variable and reference it in the code block option. Alternatively, you can make good use of the get_alt_text() function in {ggplot2} in your R Markdown chunk options:

```{r, fig.alt=ggplot2::get_alt_text(g)}
g
```

In Quarto, you need to be a bit more explicit about the fact you’re calling a function, but it’s still pretty straightforward:

```{r}
#| fig-alt: !expr ggplot2::get_alt_text(g)
g
```

If you inspect the HTML of your R Markdown / Quarto output (by right-clicking and selecting Inspect, or using the Ctrl+Shift+I keyboard shortcut), you can see the alt text has been added to the image:

If you’re creating an output format that doesn’t allow you to add alt text, such as PDF, you should still add a description of the image somewhere. You could pass in ggplot2::get_alt_text(g) to the fig.cap chunk option as an alternative.

If you’re adding an image outside of a code chunk, you can add alt text to images in Quarto using:

![](lemur.png){fig-alt="A drawing of a lemur."}

and replacing fig-alt with alt works for R Markdown.

Shiny

Finally, on to adding alt text in Shiny apps! Since Shiny makes it easy to build web applications straight from R, it’s important that you know how to add alt text to Shiny apps from R. If you haven’t made your plots with {ggplot2}, haven’t added your alt text to the alt argument in labs(), or need your alt text to update, read on!

Most plots in Shiny apps are generated within a renderPlot() call, with the first argument being the code that generates the plot. renderPlot() also has an alt argument (added in version 1.5.1) where you can pass in a character string of alt text for your plot:

renderPlot({
 # code to generate plot goes here
 },
 alt = "alt text goes here"
)

However, most plots in Shiny apps have some sort of reactivity associated with them - when a user changes an input value, the plot updates. This means that the alt text should update as well. Luckily, you can pass in a reactive() to the alt argument in renderPlot():

renderPlot({
 # code to generate plot goes here
 },
 alt = reactive({
 # code to add alt text goes here
 })
)

This means you can pass in different strings of alt text depending on which input values a user has selected. Depending on what the plot contains and what the user inputs do, you could construct the alt text based on the inputs. Even better, create a look-up table that returns human-written alt text based on a combination of input variables.

If you want to read more about accessibility in Shiny then check out our previous blog post on the topic.

I hope this blog post has convinced you that writing alt text is worthwhile, and not too tricky to add into your R developed documents and apps!

For updates and revisions to this article, see the original post

How to customise the style of your {shinydashboard} Shiny app

Thu, 16 Mar 2023 23:59:00 +0000

Using {shinydashboard} is great for creating dashboard prototypes with a header-sidebar-body layout. You can quickly mock up a professional looking dashboard containing a variety of outputs, including plots and tables.

However, after a while, you’ll probably have had enough of the “50 shades of blue” default theme. Or, you might have been asked to to follow company branding guidelines, so you need to replace the default colours with custom ones.

This blog will take you through three different options when customising a {shinydashboard}. First, we’ll look at using the colour and theme options available in the package. Then, we’ll show you how to use the {fresh} package to be able to use completely custom colour palettes. Finally, we will look at using custom CSS to give you even more control of the overall style of your dashboard.

The {shinydashboard} package

Before we get started with styling our dashboard, let’s do a quick refresher of what {shinydashboard} is and how to use it. {shinydashboard} is a package which provides a simple dashboard layout consisting of a header, sidebar, and body. The code below creates an empty dashboard, using the main layout functions from {shinydashboard}: dashboardHeader(), dashboardSidebar(), and dashboardBody(), all wrapped inside of dashboardPage().

library("shinydashboard")
library("shiny")

ui = dashboardPage(
 dashboardHeader(),
 dashboardSidebar(),
 dashboardBody()
)

server = function(input, output, session) {

}

shinyApp(ui, server)

The package is really good at this basic type of layout, and includes ways to enhance it — for example by adding tabs to your app using the menuItem() function, as well as the addition of the box(), infoBox(), and valueBox() functions, offering ways of storing outputs in different kinds of containers.

Sticking to quite a rigid layout is what makes {shinydashboard} so great - you don’t have to fiddle around with adjusting the width and height of divs, deciding if you want a sidebar and which side the sidebar should be on etc. Instead, you can just use the default layout which is enough for most dashboards.

However, this rigidity is also the main weakness of {shinydashboard}. If you want to move beyond the basic layout, it may require hacky solutions and can sometimes be downright impossible.

Despite this, it is possible to customise {shinydashboard} using built-in functions and arguments. Let’s take a look at how using an example dashboard which displays and compares some summary statistics for rental properties in the Bay Area of California, US. All code used in this blog post can be found on our GitHub

Using built-in colours and skins

Our example app currently uses the {shinydashboard} default colours. The only styling I have done is set the fill colour of my bar chart to match the colour of the value boxes.

The first thing we can customise is the dashboard “skin”, which is the colour of the dashboard header at the top of the app. The skin argument in dashboardPage() can be one of “blue” (the default), “black”, “purple”, “green”, “red”, or “yellow”. We will set the skin to be “purple”:

dashboardPage(
 skin = "purple",
 ...
)

which gives us

The other main thing we might want to change the colour of is the value boxes. There is a color argument in the valueBox() function, which has slightly more colour choices than for the skin (15 instead of 6). Luckily, there is a purple in the list of valid colours. For all 6 of the value boxes in the app, we will need to add color = "purple" as an argument:

valueBox(
 color = "purple",
 ...
)

which gives us:

Using the {fresh} package

The {fresh} package is an add-on package which helps you style your {shiny} apps, including apps built with {shinydashboard}.

{shinydashboard} is built using AdminLTE, an open source dashboard and control panel theme built on top of Bootstrap. Therefore, functions in {fresh} used to customise {shinydashboard} themes follow the pattern adminlte_*. We will use the adminlte_color() to customise our default colours.

At the top of our app, we need create a new theme my_theme using the create_theme() function. In our theme, we are going to change the default adminLTE colour called “light-blue” to use our company colour instead:

my_theme = create_theme(
 adminlte_color(
 light_blue = "#4898a8"
 )
)

We then need to tell {shinydashboard} to use this theme, by placing a call to use_theme() in the dashboard body.

dashboardBody(
 use_theme(my_theme),
 ...
)

Now, if we change our value boxes to have color = 'light-blue', and remove any skin argument in dashboardPage, we end up with this:

Being able to use any custom colours is definitely a step up from relying on the built-in colour choices of {shinydashboard}. However, let’s take it even one step further and fully customise the look of our {shinydashboard} using CSS.

Using CSS

CSS (Cascading Style Sheets) is the language used to style HTML elements on any webpage. Normally when you build Shiny apps you don’t have to worry about CSS, which is one of the reasons why Shiny is so easy to get started with. But at some point you’re going to want more control of how your Shiny app looks, and then it’s probably time to learn some CSS.

The main way of including CSS in your Shiny app is by creating a CSS file (a file with the .css extension) and placing it in a folder called www/ in the same folder where your Shiny app lives. We will call this file styles.css by convention.

We are going to use this CSS file to modify two things:

The font of the app: We want to use a custom font Prompt
The colour of the input slider bar: We want it to match the colour of the rest of the app

Once we have identified the elements and the associated properties we want to modify, our CSS file ends up looking like this:

@import url('https://fonts.googleapis.com/css2?family=Prompt&display=swap');

.irs--shiny .irs-bar, .irs--shiny .irs-single {
 border: #4898a8;
 background: #4898a8;
}

body, h2, .main-header .logo {
 font-family: 'Prompt', sans-serif;
}

The first line imports our custom font called Prompt from Google Fonts.

The next four lines select the elements of the slider we want to change, and set the border colour as well as background colour to be our company colour (#4898a8).

The last four lines select the body text, our H2 heading, as well as the header text in the top left corner and set the font to be our custom font.

Finally, for a {shinydashboard}, you will need to reference the CSS file in the dashboard body (similar to where we called use_theme() in the {fresh} example). With a stylesheet called “styles.css”, it would look like this:

dashboardBody(
 includeCSS("www/styles.css"),
 ...
)

Now our input slider has gone from this:

To this:

And our font from this:

To this:

Conclusion

There are many ways to customise a {shinydashboard} Shiny app. If you are content with a few different colours, you can stick to the default colour palettes, but if you want to use custom colours you should consider using the {fresh} package. If you want full control of the look and feel of your dashboard, you might want to consider learning CSS and creating your own stylesheet! Although, if you wanted to create a very custom-looking dashboard, you might be better off not using {shinydashboard} at all…

For updates and revisions to this article, see the original post

Network Error Logging - Important Insights

Thu, 09 Mar 2023 23:59:00 +0000

This is the second in the series of blog posts about using server headers

Content Security Policies
Network Error Logging - this one!

Heads up! We’re about to launch WASP, a Web Application Security Platform. The aim of WASP is to help you manage (well, you guessed it) the security of you application using Content Security Policy and Network Error Logging. We’ll be chatting about it more in a full blog post nearer the time.

What is Network Error Logging?

As this is written, Network Error Logging (NEL) is still an experimental header from W3C. It’s a feature of most browsers that lets a website / application opt in to send reports about failed network fetches from the browser. Its aim is to let us, the developers, know when a user has failed to reach the application. For instance, NEL would have let W3C know that when I visited their Network Error Logging page, I had a 503…

Why do you need Network Error Logging?

Not being able to load your application (shiny, Rmarkdown or quarto for example) due to a network failure is possibly the worst experience a user can have on your website (apart from XSS attacks or similar). To understand these errors, we need support from the browser. Why? Well, this information will never reach the server, rendering the server metrics useless.

Since we are setting Network Error Logging at the server layer, we can gain additional insights into our our application is functioning in real life. This level of detail is particularly important now that we are able to quickly create Shiny dashboards, Rmarkdown & Quarto documents. Once you throw in Posit Connect, you can quickly generate a large amount of web content in a short space of time.

Activating the Report-To header

There are two steps to activating NEL for your site. First, it requires the Report-To header. We chatted a little bit about it’s predecessor, report-uri, in the Content Security Policy blog. The Report-To header allows us to specify groups of endpoints to use within the Content Security Policy and Network Error Logging headers. This means we can send our CSP and NEL reports to different endpoints for separate processing. An example Report-To would look like so

Report-To: {
 "group": "csp-endpoint",
 "max_age": 17280000,
 "endpoints": [
 {
 "url": "https://jumpingrivers.com/csp-reports"
 }
 ]},
 {
 "group": "nel-endpoint",
 "max_age": 17280000,
 "endpoints": [
 {
 "url": "https://jumpingrivers.com/nel-reports"
 }
 ]}

In this set-up, we’ve configured the browser to send reports to the endpoints for 17280000 seconds (200 days). After this, you’ll have to re-issue the Report-To header to begin receiving reports again.

Activating the NEL header

The NEL header is pretty simple. There are only two fields:

report-to: The endpoint group name to send the NEL reports
max_age: How long the browser should use the endpoint for in seconds.

If we want to send NEL reports to the nel-endpoint group, then my NEL header looks like this

NEL: {
 "report_to": "nel-endpoint",
 "max_age": 17280000
}

The report format

Let’s say we’ve set NEL up on our website. A user trying to access a page on the website has received a 400 error code. The browser will send a POST request of Content-Type: application/reports+json with a format similar to

{
 "age": 15,
 "type": "network-error",
 "url": "https://jumpingrivers.com/example",
 "body": {
 "elapsed_time": 354,
 "method": "POST",
 "phase": "application",
 "protocol": "http/1.1",
 "referrer": "https://jumpingrivers.com/example",
 "sampling_fraction": 1,
 "server_ip": "115.554.22.87",
 "status_code": 400,
 "type": "http.error"
 }
}

The top-level “body” key contains the actual network error report whilst the other top-level keys are meta info about the report. The meta info includes:

age - How long after the error was encountered did the browser send the report? In ms.
type - Type of report. Always “network-error” for NEL reports.
url - The URL where the error occurred.

Within the body itself, there are a few important keys we should know about:

referrer - This is the URL from which the user has come. If this and the top-level url are the same, the error happened whilst the user was on the same page.
status_code - The status code that the browser received from the server. In this case, it’s a 400.
elapsed_time - How long it took the browser to abort the process after it started, in ms. For us, this is 354ms.
type - The type of network error. See a full list of the error types here. We’ve got http.error, which means the browser successfully received a response, but it was a 400 or 500 status code.
server_ip - The server IP the browser is trying to resolve to.

Note, the report does not get sent as soon as the user gets the network error. The browser will batch reports and send periodically. As well as this, no information is kept about the end-user, just the network error.

Need help setting up Network Logging? Please get in contact.

For updates and revisions to this article, see the original post

SatRdays London 2023: Speakers

Tue, 07 Mar 2023 23:59:00 +0000

SatRdays London is fast approaching, and we are happy to announce our full lineup of speakers for the event! Read on for more info. If you want to join the fun, head over to the conference website to sign up!

Keynote Speakers

Julia Silge - Posit

Julia Silge is a data scientist and software engineer at Posit PBC (formerly RStudio) where she works on open source modeling and MLOps tools. She is an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences.

Oliver Hawkins - Financial Times

Oliver Hawkins works as an editorial data scientist for the visual and data journalism team at the Financial Times. He has previously worked as a statistical researcher and a data scientist for the House of Commons Library, and as a data journalist for the BBC. He is interested in statistics, machine learning and data visualisation.

Contributed talks

Botan Ağın and Michael Stevens - SamKnows

AutRmatic reporting: billions of internet measurements, hundreds of reports and one repository to rule them all

SamKnows has been pioneering internet performance measurements for over 14 years. The reason we exist is to provide a source of truth for how the internet is really performing. The data we collect can be used as a common language between government regulators, internet service providers, academics, and content providers to optimise and improve internet performance for everyone.

Day to day SamKnows uses R to handle a huge range of automated and self-serve workloads. Keeping track of each report’s recipients, delivery schedule, dependencies and deployment procedure can be tricky, especially in the nightmare scenario of suddenly needing to migrate all of your jobs to a new server or cloud environment.

In this presentation, we will talk about how we structure our regularly-scheduled reports as standardised entities within a monorepo. We will explain how this approach reduces the latency in setting up a report, makes it easier for new team members to contribute, and lets us uphold standards while retaining the flexibility to deliver work in diverse formats with a range of complexity levels and opportunities for manual intervention. We will go into detail on specific workflows that take the terabytes of data collected by SamKnows from cloud and on-premises data sources, process them into an R Markdown document, formatted spreadsheet, and raw CSV output, and distribute them through cloud file storage, FTP servers, email, Slack and more.

Vyara Apostolova and Laura Cole - National Audit Office

ScRutinising government spending

“The National Audit Office supports Parliament in holding government to account both via its Financial Audit and Value for Money work. The Analysis Hub is a central team that utilises a range of analytical techniques to support both strands of work. The proposed presentation will showcase two examples of how we in the Analysis Hub use R to support our mission to hold government to account.

We use R to reproduce complex models that departments employ to produce accounting estimates for their financial accounts. Our R reproductions allow us to assess if departments have implemented their selected methodology correctly and to highlight any model integrity issues. We also implement additional sensitivity testing, including via Monte Carlo simulations to capture the uncertainty around model outputs. The presentation will cover an overview of our approach and a demo of a reproduction of a dummy model.

We have also built a R-shiny app, Covid-19 Cost tracker, that brings together data from across the UK government on the costs of measures in response to the Covid-19 pandemic. It is one of the very few sources of comprehensive information on Covid-19 related spending and the only one as an interactive tool. With it the public can examine spending by department and category of spend as well as interact with bubble graphs to explore the costs of individual policies. The presentation will include an overview of how the data analytics team and audit team collaborated to produce the output and a demo of the app.”

Andrew Collier - Fathom Data

Dark Corners of the Tidyverse

“In the realm of the Tidyverse, there are functions which are always in the spotlight. These are the titans: well known and loved, frequently invoked and virtually indispensable. There are other, lesser-known functions which stand quietly in the shadows. Unacknowledged, somewhat obscure and almost forgotten. Waiting for their moment to shine.

I’ll talk about five of these Unsung Heroes of the Tidyverse, lauding their virtues and showing how they can help you succeed on your next Data Science quest.”

Jack Davison - Ricardo Energy & Environment

“Put it on a map!” – Developments in Air Quality Data Analysis

“An understanding of air quality is crucial as it can have significant public health, environmental and economic effects. However, air quality data is complex, constantly changing in space and time, and influenced by a myriad of factors such as meteorology and human activity. This makes air quality analysis challenging, and communicating the results of this analysis more challenging still!

Just over a decade ago, the {openair} package was authored to provide an open-source toolkit to help air quality practitioners get the most out of their data, and is still used widely in academia, consultancy and industry today. While {openair} itself has not changed hugely in recent years, much thought has been put into extending it through leveraging more recent tools and packages.

In this talk I will discuss how we have recently married {leaflet} and {openair} to create effective, interactive air quality maps. In particular, I’ll discuss the development of the {openairmaps} package – a toolset which makes it easy to create interactive “directional analysis” maps to help explore the geospatial context of pollution monitoring data.”

Russ Hyde - Jumping Rivers

Does code quality even matter in data science?

“It depends!
If you need to quickly summarise some data for an ad-hoc request, then knock out the code in whatever manner gets the job done.

But what happens when you start getting a lot of similar requests, or you are working on a more substantial project, or you are collaborating within a larger team? Now, productivity should be viewed ‘across the team’ and ‘across all projects’. What can you do to help yourself and your colleagues, and what tools exist to help?

Code quality concerns those aspects of software that make it easier to work with, easier to explain to others and easier to maintain or extend.

In this talk, I’ll take you through the source code for an evolving analysis project. We’ll discuss how to (and how not to) modularise code. Along the way, we’ll talk about actions and calculations, body-tweaking, duplicate stomping and a few tools that help automate the boring low-level stuff that teams sometimes disagree about.”

Ella Kaye and Heather Turner - University of Warwick

Sustainability and EDI (Equality, Diversity and Inclusion) in the R Project

The R Project is over 20 years old, but its future is not secure - many of the R Core Team are nearing retirement and there are not enough new contributors to sustain the work. We present a number of initiatives, organised under Heather Turner’s ‘Sustainability and EDI (Equality, Diversity and Inclusion) in the R Project’ fellowship, to encourage and train a new, more diverse, generation of contributors. These include R contributor office hours, collaboration campfires, bug BBQs, translatathons and an updated R development guide. This presentation is also a call to action to encourage others to get involved in supporting this language, a fundamental piece of software in many disciplines, used by an estimated 2 million people.

For updates and revisions to this article, see the original post

Content Security Policy - Why You Need It

Thu, 02 Mar 2023 23:59:00 +0000

This is the first in a series of blog posts about server headers

Content Security Policies - this one
Network Error Logging

Heads up! We’re about to launch WASP, a Web Application Security Platform. The aim of WASP is to help you manage (well, you guessed it) the security of your Posit Connect application using Content Security Policy and Network Error Logging. More details soon, but if this interests you, please get in touch.

This blog post is aimed at those who are somewhat tech literate but not necessarily a security expert. We’re aiming to introduce the concept of Content Security Policy and teach some of the technical aspects.

In 2018, a hacking group called Magecart exploited a vulnerability on the British Airways website that allowed them to inject JavaScript. The JavaScript code was used to send customer data to a malicious server, succeeding in skimming the credit cards of 380,000 transactions before the breach was discovered. This type of attack comes under the umbrella of cross-site scripting (XSS) - where malicious code (often client-side JavaScript) is injected into the browser.

What is Content Security Policy?

Content Security Policy (CSP) is a framework of modern (ish) browsers, that allows a developer to protect an application through the use of the Content-Security-Policy HTTP header. It’s used to give applications an extra layer of security - safeguarding against attacks such as cross-site scripting. In this blog we’re going to take you through some of the basics of Content Security Policy and show you why it’s a necessity for modern applications.

How will Content Security Policy help me?

In one way or another, you have made it to this blog post on jumpingrivers.com. This means your browser has already loaded a tonne of assets that this page needs to look and act in the way it does (JavaScript, fonts, stylesheets). Without CSP, the browser will trust and not question any loaded resources from any source. If there are any vulnerabilities with this page, an attacker could run client-side JavaScript to import content hosted from their own source; for instance, a fake form or a malicious click event to skim user details or steal data from a database, just like with British Airways. Your browser simply says “Yes, why wouldn’t I trust this code?”. This is where CSP comes into play.

How does CSP link to R?

Have you ever used the {shiny}, {quarto} or {rmarkdown} R packages to make web applications or documents? If you then took the extra step to deploy your app, you should be asking the question “How safe is it to deploy this?”. {shiny}, {quarto} and {rmarkdown} pull in a lot of external resources; css, JavaScript etc. This leaves them vulnerable to cross-site scripting attacks, just like British Airways. Using CSP, we can protect our {shiny} / {rmarkdown} documents against these attacks.

The technical basics

A Content Security Policy HTTP header is set on the server side, but protects the client side. A CSP header is split into directives - each directive enabling you to specify an allow list (in some cases, a deny list) of valid sources for content that the browser can (or is not allowed to) load. For instance, one of the more common directives, script-src, allows us to specify valid sources for scripts. Any scripts that are from a source not listed within this directive will be blocked from executing in the browser. A basic CSP header using script-src might be

Content-Security-Policy: script-src 'self'`

The metasource, self, is telling the browser to allow scripts to be loaded from our domain. As there are no others sourced listed with it, we are telling the browser to only allow scripts to be loaded from our domain. There are other metasources:

'self': Content from the same domain,
'none': Nobody can include this functionality. In the case above, this would mean we accept scripts from no sources.
A nonce / hash: Accept code with a specific nonce / hash.

Of course, we can also specify specific URL / domains. For instance,

Content-Security-Policy: script-src 'self' https://posit.co/

would allow loading of scripts from our own domain, and Posit. Other common directives include

default-src: Default values for *-src directives.
font-src: Valid sources for fonts loaded using the @font-face CSS at-rule.
frame-src: Valid sources for embedded frame contents.
img-src: Valid origins from which images can be loaded.
navigate-to: Restricted URLs from which a document can initiate navigation.
style-src: Valid sources for stylesheets.
media-src: Valid sources for loading media using , and elements.

For a full list, see the MDN Web Doc.

Reporting Content-Security-Policy violations

If an attacker had found any vulnerabilities on our site, then using the directives above we would be blocking a good bunch of potential attacks for users on modern browsers. However, users on browsers (mainly Internet Explorer) that still do not support the CSP directives you’ve chosen are still at threat. It’s important that we understand which CSP directives are being targeted on our site, to protect the vulnerable on old browsers.

Directives are split into two categories; blockers and reporters. Blockers block input into the application (think script-src) and reporters deliver reports about the blocks. This allows us to understand which of our CSP directives are being targeted.

The most important reporting directive is report-to. However, it’s predecessor, report-uri, still plays a crucial role. In fact, all browsers will fall back to report-uri if it can’t find report-to. We’ll go into more detail on the differences between the two in a later blog, but for now we’ll look into report-uri (it’s a tad simpler).

The report-uri directive allows us specify the URL(s) to which our CSP violation should be reported. These URLs are usually API endpoints, which process the report JSON. The following HTTP header would POST any violations to the csp-reporting endpoint on our domain

Content-Security-Policy: script-src 'self'; report-uri /csp-reporting

Any reports sent to this endpoint will be Content-Type: application/reports+json and contain four important pieces of information (plus some others):

blocked-uri: URI of the blocked resource
document-uri: URI of the document in which the violation occurred
original-policy: The original Content Security Policy
violated-directive: The CSP directive that was violated

The format will look something like

{
 "csp-report": {
 "document-uri": "https://magecart.com/example.html",
 "referrer": "",
 "blocked-uri": "https://badwebsite.com/css/style.css",
 "violated-directive": "script-src 'self'",
 "original-policy": script-src 'self'; report-uri /csp-reporting",
 "disposition": "report"
 }
}

This report indicates that on the page magecart.com/example.html, something has tried to load the style file located at badwebsite.com/css/style.css. However, because we have the script-src directive set to "self", only scripts from our own domain may be sourced.

Some limitations

Whilst CSP is a great addition to the security toolbox, there are some “limitations”:

It’s not a magic wand. It’s a control to cut down on your application’s exposure - it will not patch vulnerabilities. Think of it like a firewall - it’s a secondary control, a defence technique. Mostly in case the developers have missed something. If you’re having trouble with any security issues, feel free to get in touch with us for advice.
It’s only useful for client-side attacks on your application. It does not help with server-side, database attacks or anything in between.
OK, this last one isn’t really a limitation, more of a warning. It’s not on by default. The Content-Security-Policy HTTP header has to be added manually with each policy individually specified.

If Content Security Policy or Shiny app security in general interests you or you want more news on WASP, our new Web Application Security Platform, then please email hello@jumpingrivers.com and we can discuss how to set this up for your applications.

For updates and revisions to this article, see the original post

Why should I use R: The Excel R Data Wrangling comparison: Part 1

Thu, 23 Feb 2023 23:59:00 +0000

This is part 1 of an ongoing series on why you should use R. Future blogs will be linked here as they are released.

Part 1: Why should I use R: The Excel R Data Wrangling comparison: Part 1 (This post)
Part 2: Why should I use R: The Excel R plotting comparison: Part 2
Part 3: Why should I use R: Handling Dates in R and Excel: Part 3

The era of data manipulation and analysis using programming languages has arrived. But it can be tough to find the time and the right resources to fully switch over from more manual, time-consuming solutions, such as Excel. In this blog we will show a comparison between Excel and R to get you started!

When choosing between R and Excel, it is important to understand how both solutions can get you the results you need. However, one can make it an easy, reputable, convenient process, whereas the other can make it an extremely frustrating, time-consuming process prone to human errors.

R and Excel

When opening Excel and applying data manipulation techniques to your data, are you easily able to tell what manipulations have been made without clicking on the column or cells? If you were to share these Excel sheets with colleagues are they easily able to replicate your analyses without you telling them where to click or which formulas were applied?

With R all of these are possible. You automatically have all the code visible and in front of you in the form of scripts. Reading and understanding the code is possible because of its easy-to-use, easy-to-read syntax which allows you to track what the code is doing without having to be concerned about any hidden functions or modifications happening in the background.

Most people already learned the basics of Microsoft Excel in school. Once the data has been imported into an Excel sheet, using a point-and-click technique we can easily create basic graphs and charts. R, on the other hand, is a programming language with a steeper learning curve. It will take at most two weeks to become familiar with the basics of the language and the RStudio user interface. Luckily using R can easily become second-nature with practice.

Replicating Analysis

R, while having a slightly steep learning curve, has the ability to reproduce analyses repeatedly and with different data sets. This is very helpful for large projects containing multiple data sets as it keeps our processes clean and consistent. Excel however, because of the point-and-click interface, allows us to rely frequently on memory and repetition, so we would have to repeat the same analyses multiple times by either copying and pasting or simply repeating the point-and-click process, which can be time-consuming, messy, and prone to human errors.

Unlike Excel, R is completely free and benefits from a large community of open-source contributors. To install R and the IDE (RStudio Desktop) to work with R, download and install the relevant versions for your operating system. Once you have successfully installed the IDE, the following user interface will be visible

The area on the left is where you will write R code in scripts, use terminals and run jobs. The right hand side of the IDE is comprised of two sections. The top is the environment that stores a list of defined variables and data sets, view the history, and connect to other database. The area below contains five different tabs: the Files tab which lists all of the folders within this project, the Plots tab, which displays any plots that have been generated; the Packages tab which allows you to manage packages within your environment; the Help tab which provides a manual; and the Viewer tab which allows you to view generated interactive content.

Loading the data sets

Excel

The data import steps in Excel are quite straightforward to a day-to-day Excel user, however, it is certainly not reproducible.

Steps:

Click the Data tab on the Ribbon
Click the Get Data button
Select From File
Select from TEXT/CSV
Select the file and click Import
Click Load

R

There are various ways to import data sets such as local files, online datasets and even through database connections. We will use the read_csv() function from the {readr} package to import our csv files. But first, what are packages? R packages are a collection of R functions, compiled code and sample data that can be installed by R users. Before using an R function such as read_csv() to import the data, we are required to install and load the {readr} package. Packages are great because rather than having to have a huge programme containing everything you could possibly need, the different packages specialise in different things, and can be loaded in as and when you need them, saving a lot of space.

# Installing the package
install.packages("readr")
# Loading the package
library("readr")
# Importing the data
movies_data = read_csv("https://jumpingrivers.com/blog/comparing-r-excel-data-wrangling/movies.csv")

Exploring our data

Before getting started with any data manipulation, let’s explore our data.

Excel

Excel has one basic data structure, which is the cell. These Excel cells are extremely flexible as they store data of various types (numeric, logical and characters). To obtain an overview of the data we could simply just scroll through the Excel data sheet. Now, let’s imagine a data set of 1 million rows and 200 columns, would it still be as easy to scroll through the data sheet to obtain an overview of data? Could we quickly and reliably view all the column names? To me, manually scrolling seems like a very time consuming, unreliable and messy process.

R

To view our data in R, we could simply click on it in the environment or we could call the name of the data set in the script. If we are working with a large data set, we can also view a subset of this data by using functions like head() and tail(). We could also use the colnames() function to programmatically display the variable names within our data.

movies_data
## # A tibble: 26 × 5
## Country Year Highest_profit Number_movies no_employees
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 England 2011 100 3 1500
## 2 America 2012 150 2 2000
## 3 America 2013 300 4 4000
## 4 England 2013 130 2 4020
## # ℹ 22 more rows

str(movies_data) # Displays the structure of the data
## spc_tbl_ [26 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:26] "England" "America" "America" "England" ...
## $ Year : num [1:26] 2011 2012 2013 2013 2013 ...
## $ Highest_profit: num [1:26] 100 150 300 130 177 350 700 650 230 440 ...
## $ Number_movies : num [1:26] 3 2 4 2 3 1 6 2 1 3 ...
## $ no_employees : num [1:26] 1500 2000 4000 4020 5300 3150 6000 5000 1420 5000 ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. Year = col_double(),
## .. Highest_profit = col_double(),
## .. Number_movies = col_double(),
## .. no_employees = col_double()
## .. )
## - attr(*, "problems")=<externalptr>

head(movies_data) # Displays the first six rows of the data
## # A tibble: 6 × 5
## Country Year Highest_profit Number_movies no_employees
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 England 2011 100 3 1500
## 2 America 2012 150 2 2000
## 3 America 2013 300 4 4000
## 4 England 2013 130 2 4020
## # ℹ 2 more rows

tail(movies_data) # Displays the last six rows of data
## # A tibble: 6 × 5
## Country Year Highest_profit Number_movies no_employees
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 England 2021 120 1 1325
## 2 America 2021 800 3 6800
## 3 America 2022 400 2 7200
## 4 China 2021 230 2 3101
## # ℹ 2 more rows

colnames(movies_data) # Displays all the variable names
## [1] "Country" "Year" "Highest_profit"
## [4] "Number_movies" "no_employees"

The movies data is comprised of five columns: country, year, highest profit gained per movie, number of movies produced and number of employees on set during production. It is clear that R programmatically displays the output of our data whereas Excel requires of a lot of eye-balling and manual scrolling. If we were interested in displaying a subset of our data, in a report for example, using R we could simply use the functions above. To do this in Excel we would have to copy and paste the first 6 rows of the data and manually add it to the report document.

Summary Statistics

Now, let’s apply some summary statistics on our data. Summary statistics provide a quick summary of data and are particularly useful for comparing one project to another, or before and after.

Excel

It is very well known that Excel has a data storage limitation per spreadsheet. It can have a very limited amount of columns and rows, while R is made to handle larger data sets. Excel files are also known to crash when they exceed 20 tabs of data. Excel is able to handle a good chunk of data, but not much. This becomes very risky when you unknowingly start to lose data because the file has become too big and is unable to save. To generate summary statistics (such as the minimum and maximum values) of our data in Excel, we followed a few steps:

Scroll to the Home tab
In the Editing group, click the arrow next to AutoSum
Click Min
Click Max
Press Enter

These steps were quite easy to follow, however, I often forget where to click or which tab to select. After discussing this workflow with a colleague, we also discovered slight differences in the steps for different versions of Excel. This did not seem very effective or reproducible to us.

R

summary(movies_data)
## Country Year Highest_profit
## Length:26 Min. :2011 Min. : 11 
## Class :character 1st Qu.:2013 1st Qu.:157 
## Mode :character Median :2017 Median :320 
## Mean :2017 Mean :350 
## 3rd Qu.:2021 3rd Qu.:485 
## Max. :2022 Max. :800 
## Number_movies no_employees 
## Min. :1.00 Min. :1325 
## 1st Qu.:2.00 1st Qu.:2275 
## Median :2.50 Median :4401 
## Mean :2.65 Mean :4338 
## 3rd Qu.:3.00 3rd Qu.:6375 
## Max. :6.00 Max. :7200

# Stardard deviation
sd(movies_data$Highest_profit)
## [1] 224

# Highest value of the Highest profit column
min(movies_data$Highest_profit)
## [1] 11

# Highest value of the Highest profit column
max(movies_data$Highest_profit)
## [1] 800

The dollar symbol, $, used here simply dictates which data set and column we are using for the analysis. It is evident that the source code of R can be used repeatedly and with different data sets in ways that Excel formulas cannot. R clearly shows the code (instructions), data and columns used for an analysis in ways that Excel does not. If I were to share this script with a colleague they would have a complete understanding on how the summary statistics were generated because of R’s human readable syntax.

Data Wrangling

Data manipulation tools assist us with modifying our data to make it easier to read and organise. For example, one of the easiest data manipulation tools in Excel is inserting columns and rows. The purpose of data manipulation is to create a consistent, organised and clean data set. With this in mind, let’s apply the following data manipulations in Excel and then R:

Rename the columns into a consistent format
Arrange the year column in ascending order
Select and create a new column
Remove a column from the data
Select only the entries for the year 2014
Remove only the entries from rows 4-11

1. Renaming columns in R and Excel

Excel

Renaming columns in R is a completely manual process, which makes it an extremely time-consuming and risky process especially if you are working between multiple messy Excel sheets.

R

For data manipulation in R, we use a powerful package in R called {dplyr}. Let’s load and install the package.

# Installing the packages
install.packages("dplyr")
library("dplyr")

To rename the columns, there is a handy function called rename(). We simply pass this function the name of our data set (movies_data), and then rename each of the columns. There are other methods available in other packages which can automatically make everything lower case, for example, but for the purposes of this blog, we will stick with {dplyr}.

# Renaming the column into a consistent format
movies_data = rename_with(movies_data, tolower)

2. Arrange the year column in ascending order

Excel

To change column to ascending order, we first had to:

Select the year column
Direct to the Sort and Filter tab
Select the option to sort from the largest to the smallest value

R

arrange(movies_data, year)
## # A tibble: 26 × 5
## country year highest_profit number_movies no_employees
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 England 2011 100 3 1500
## 2 America 2011 100 3 1500
## 3 America 2012 150 2 2000
## 4 South Ko… 2012 11 5 1333
## # ℹ 22 more rows

Again, with Excel representing a point-and-click nature, it is impossible to identify. by looking at a column, how the data was modified. If I were to replicate these steps in two years time I would likely have forgotten where to point and click. With R however, we have our code which clearly shows each step used to manipulate the data. If I were to return to my script in two years time, I would easily be able to replicate the analysis.

3. Selecting and adding a new column

Let’s reduce our data set by first selecting the country, year, number_movies and highest_profit columns. Then we will generate a new column called complete_profit. The complete_profit column should be generated from taking the highest_profit column divided by the no_movies column.

Excel

R

movies_data %>%
 select(country, year, number_movies, highest_profit) %>%
 mutate(complete_profit = highest_profit/number_movies)
## # A tibble: 26 × 5
## country year number_movies highest_profit
## <chr> <dbl> <dbl> <dbl>
## 1 England 2011 3 100
## 2 America 2012 2 150
## 3 America 2013 4 300
## 4 England 2013 2 130
## # ℹ 22 more rows
## # ℹ 1 more variable: complete_profit <dbl>

4. Removing a column

Excel

In Excel, inserting or deleting a column is a manual process. First, we select the column then right-click at the top of a column and then select the Delete option.

R

select(movies_data, -year)
## # A tibble: 26 × 4
## country highest_profit number_movies no_employees
## <chr> <dbl> <dbl> <dbl>
## 1 England 100 3 1500
## 2 America 150 2 2000
## 3 America 300 4 4000
## 4 England 130 2 4020
## # ℹ 22 more rows

In R, we simply used the select function from the {dplyr} package to select a column of our data frame. To remove a column we put a - in front of the variable to exclude it from our data.

5. Select only the entries for a particular year

Excel

Here we are interested in extracting the data collected only during the year 2021. Using Excel software, we first sort the year column and then manually select the years that we are interested in. While applying this manual technique of selecting pieces of data that we are interested in, it is very easy to select the wrong data or even accidentally delete data.

R

filter(movies_data, year == 2021)
## # A tibble: 4 × 5
## country year highest_profit number_movies no_employees
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 America 2021 800 3 6800
## 2 England 2021 120 1 1325
## 3 America 2021 800 3 6800
## 4 China 2021 230 2 3101

6. Remove only the row entries from 2-4

Excel

Removing rows in Excel is once again a manual process. We select the rows that we do not want to keep, then right click and delete those rows. These rows are now permanently deleted from the data sheet. If we were interested in adding them back into the sheet, we would have to find it (if we had a back up Excel sheet) and copy and paste it back into our data analysis Excel sheet. If we did not have a back up of the data that we had deleted, then this data would be completely lost.

R

In R we can use the slice() function to return a subset of rows based on their position. If you want to remove rows using slice() instead of retaining them you can just add a - in front of the row indices you’re passing into the function. So, to remove rows 2, 3, and 4:

slice(movies_data, -(2:4))
## # A tibble: 23 × 5
## country year highest_profit number_movies no_employees
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 England 2011 100 3 1500
## 2 South Ko… 2013 177 3 5300
## 3 America 2014 350 1 3150
## 4 South Ko… 2015 700 6 6000
## # ℹ 19 more rows

Using R and Excel

There are multiple ways in which data manipulation is used efficiently in data science. Data formatting is important and must be organised to be read by the various software programs, be it in R or Excel.

Excel is an excellent tool and is easy to use and at times it is the most appropriate tool. Excel is often used for data processing work under general and basic office requirements. However, Excel is limiting in that the data file itself can hold only approximately 1 million rows without the aid of other tools. The basic built in statistical analysis is too simple and has very little practical value. If you are an aspiring data analyst, you will need to expand your toolset and start thinking beyond the rows and columns of a spreadsheet. R functions cover almost any area where data is needed. Getting started with R is very simple especially because of the easy-to-use and understandable syntax. Most importantly, R facilitates reproducible analyses.

A hammer is great for driving nails, but it’s not the only tool out there.

If you’re interested in learning R, then attend our Introduction to R course.

For updates and revisions to this article, see the original post

Shiny in Production 2023: Workshops

Tue, 21 Feb 2023 23:59:00 +0000

Shiny in Production is returning to the Catalyst this October! Our workshop lineup has now been finalised, and our first two speakers are confirmed. If you want to read more about the speakers, or register for the conference, head over to the website. Early bird tickets are now on sale!

For the workshops this year, we see the return of the extremely popular Introduction to Posit (formerly RStudio) Connect, as well as a two new shiny-centered topics.

Shiny and Python

Gone are the days when Shiny was only for R programmers! In the last year, Posit have released Shiny for Python. Further information on the workshop to follow!

Building Responsive Shiny Apps

Shiny Testing

This is the newest of our workshops that we’re planning for the conference, so we’ll have more information on what to expect very soon. If you’re interested in the topic in the meantime, take a look at our recent blog series on end-to-end testing with {shinytest2}.

For updates and revisions to this article, see the original post

Work smarter; not harder: COVID-19 processing for the WHO/Europe

Thu, 16 Feb 2023 23:59:00 +0000

Last night, I filled a washing machine with laundry and scheduled it to finish in the morning. And do you know what I had to do next? Nothing. I simply went to bed. In stark contrast to 100 years ago, I didn’t need to fill a bucket with water, I didn’t spend an hour rubbing clothes against a washboard to agitate away the dirt, and I didn’t need to worry about whether the prolonged contact between a cleaning detergent and my hands was damaging to the skin. Instead, a machine followed its pre-programmed routine, and I slept like a log. And what could possibly be better than an extra hour in bed?

That’s just one of many examples of the small automated processes that appear throughout our lives.

But they all have a common purpose: to make our lives easier.

If you’re a regular on our blog, you may have already read about how we streamlined the data processing on an application we’re maintaining for the World Health Organisation Europe (WHO/Europe). Those steps improved the experience for users of their WHO/Europe COVID-19 Vaccine Programme Monitor, by slashing loading times and improving responsiveness.

But today, I want to tell you about how automation improved the experience for those working behind the scenes of the application. Tasks were completed automatically, taking away opportunities for human error to sneak in to our processes. Work was autonomously performed each day, providing early warnings about issues with the latest data. Software was frequently tested on a clean environment, verifying that our work could be reproduced on other systems.

Ultimately, developers and maintainers from both Jumping Rivers and the WHO/Europe spent less time on the trivial and repetitive tasks, and more time making improvements where it really mattered. And by sprinkling a little automation in your work, you might just enhance your productivity too.

Where can we delegate the tasks to?

The aim of these automated workflows is to take some of the menial tasks that are frequently performed, and complete them automatically using a continuous integration and continuous delivery (CI/CD) pipeline. Many options for performing CI/CD pipelines exist already—such as Jenkins, GitLab CI/CD, Bitbucket pipelines, CircleCI to name just a few— but in the case of the WHO/Europe COVID-19 Vaccine Programme Monitor, we utilized GitHub Actions.

In a typical CI/CD pipeline, we are allocated a blank machine, onto which we can install all the software dependencies we need and to run the tasks, before cleaning itself back out of existence. Now it may sound wasteful to be installing everything from scratch every time a pipeline runs, but there are serious benefits here: starting from scratch is the ultimate check of whether our code is portable and can be run by anyone from any machine. And with a few tricks here and a bit of caching there, set up times for CI/CD pipelines can actually be very reasonable.

The basic concept of an automated workflow

For GitHub Actions, we specify a few things in a YAML file,

When the workflow should run.
What operating system our virtual machine should use.
What environment variables should be defined.
What tasks should be performed.

What do we automate?

Tests

There are a number of processes that we automate, but we’ll start with the one that most developers will want to automate: Testing. It’s a good idea to have tests run when changes are made to the code. After all, if the new code has a mistake, it’s good for your tests to find the error before you go on to build even more code on top of it. So everytime changes are pushed to a pull request or the main branch of our git repository, a workflow runs to perform all tests.

Deployments

The WHO/Europe COVID-19 Vaccine Programme Monitor is hosted on shinyapps.io. Originally, when changes were made to the application, someone would have to manually perform the process of publishing the latest version of the application online. Not only is this needlessly inefficient to have a developer wasting time performing this operation, but it also allows for human-error to enter the situation—what if you’re logged into the wrong account, or you overwrite the wrong application, or perhaps you just patched a critical bug in your code repository but forget to publish the fixed app altogether? In this scenario, it’s better to have a pipeline watching over us, ready to step in at the right moment.

A nice feature of shinyapps.io is that multiple apps can be hosted from a single account. We took advantage of this by creating automated workflows that deploy the latest versions of the apps to shinyapps.io everytime changes were pushed to the default branch, giving users the newest version of the app at all times.

But to make life easier for ourselves, we also publish versions of the app for every proposed change that we create. Not only does this ensure the app should deploy correctly, but it provides a working version of the application that members of the WHO can view, allowing them to request changes or provide approval before all changes are confirmed. When those changes are incorporated into the main versions of the app, our automated workflows delete these development apps and publish the public version.

Data processing

Our previous blog post on the data processing mentioned how a GitHub Actions workflow now handles data processing outside of the app on a daily schedule. We don’t actually need to push code to GitHub to prompt that a workflow should run; a workflow can be scheduled to start at particular times or at regular intervals. It’s defined in a GitHub Actions workflow using a cron schedule expression— a sequence of 5 values that denote the minutes, hours, day of month, month, and day of week when a job should occur, specified according to UTC.

Let’s suppose we want to run a job at 09:30 BST (that’s UTC+01), on every weekday (Monday to Friday). We would specify this as:

30 8 * * 1-5

Let’s break that down:

30 8 at the start represents the minutes and hours, so the sheduled time is 08:30 UTC. If you’re working in a BST timezone, that’ll translate to 09:30.
* * means every day of the month and every month of the year, respectively.
1-5 represents the day of the week, where 1 is Monday and 7 is Sunday. So this represents every day from Monday to Friday.

The Crontab.guru website is useful for testing the meaning of a cron expression, or for checking you have constructed your own cron expression correctly.

GitHub Actions allows for multiple cron times to be specified, and it will run when any of the listed times are reached. And that’s a good thing, because the keen-eyed among you will have noticed the issue with the cron specification above: Daylight savings time.

Suppose we actually want to run it every weekday at 09:30 Europe/London time, which is a mixture of BST (UTC+01) between the last Sundays in March and October, and GMT (UTC+00). We can specify several cron expressions to cover different times across the year.

30 9 * 11-12,1-3 1-5' # 09:30 hours GMT from 1 Nov to 31 Mar.
30 8 25-31 3 1-5" # 09:30 hours BST from 25 Mar - 1 Apr.
30 8 * 4-10 1-5' # 09:30 hours BST from 1 Apr - 31 Oct.
30 9 25-31 10 1-5" # 09:30 hours GMT from 25 Oct - 1 Nov.

This strategy still isn’t perfect—for the last weeks in March and October, we essentially run the automated workflow twice, separated by an hour, because we can’t be sure which day daylight savings time changes.

To further complicate matters, despite our best efforts to ensure the job runs at 09:30 local time, when you’re using the shared resources of Github Actions, your job may have to wait in a queue for several minutes—or even hours—if it’s a particularly busy time for their servers. Got a mission-critical workflow that must run exactly on time? Then have the job performed by your own dedicated CI/CD runners.

How do I set up a workflow?

The method used will depend on what CI/CD runner you’ll be using. We’ll discuss a very basic workflow for an R user who has a shiny app they want to automatically deploy to shinyapps.io using Github Actions.

We’re going to start by creating a new Shiny app in RStudio, which will come initialised with a git repository and will use {renv}. The renv lockfile will already come supplied with the necessary packages needed to run the default “Old Faithful Geyser” app. We’ll also make sure we’ve deployed our app to GitHub.

Next we’ll need to generate an access token from shinyapps.io, which will allow GitHub Actions access to our account for the purposes of uploading the shiny apps.

Having logged into shinyapps.io, go to the Account → Tokens section of the menu. Click the button to “Add token”, and make a note of the Token and Secret values. For security reasons, the Secret will be hidden until you reveal it.

Now in GitHub, go to the repository’s settings and navigate to the Secrets → Actions menu. Create a new repository secret for each of the name, token and secret values taken from shinyapps.io.

When you’re done, you should have three secrets which you’ve named for use in GitHub Actions:

We’ll use an example template from r-lib actions which is made to provide a GitHub Actions workflow. This will perform a number of jobs: creating an ubuntu instance; pulling the latest version of your code from the main branch on GitHub; installing and preparing R, installing package dependencies from the renv lockfile, and then performing the necessary steps to deploy the application to GitHub Actions. We just need to edit a few lines specifying the APPNAME and SERVER, and store it in a new directory (in the GitHub repository’s root directory) of .github/workflows/.

# .github/workflows/shiny-deploy.yaml
on:
 push:
 branches: [main, master]

name: shiny-deploy

jobs:
 shiny-deploy:
 runs-on: ubuntu-latest
 env:
 GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
 steps:
 - uses: actions/checkout@v3

 - uses: r-lib/actions/setup-pandoc@v2

 - uses: r-lib/actions/setup-r@v2
 with:
 use-public-rspm: true

 - uses: r-lib/actions/setup-renv@v2

 - name: Install rsconnect
 run: install.packages("rsconnect")
 shell: Rscript {0}

 - name: Authorize and deploy app
 env:
 # Provide your app name and deployment server below
 APPNAME: github-deployed-app
 SERVER: shinyapps.io
 run: |
 rsconnect::setAccountInfo("${{ secrets.SHINYAPPS_NAME }}",
 "${{ secrets.SHINYAPPS_TOKEN }}",
 "${{ secrets.SHINYAPPS_SECRET }}")
 rsconnect::deployApp(appName = "${{ env.APPNAME }}",
 account = "${{ secrets.SHINYAPPS_NAME }}",
 server = "${{ env.SERVER }}")
 shell: Rscript {0}

When we commit the new file and push the change to the default branch, GitHub will automatically run the workflow on their servers for us. We can see progress on the “Actions” page of the repository, where it will display whether a pipeline is currently running, or has finished with a pass or fail status. Details for a failing pipeline can be viewed by clicking on the failed pipeline and viewing the output generated during that workflow.

When the pipeline has succeeded, we can view the newly deployed app on shinyapps.io. The app’s deployment address will be of the format https://[USERNAME].shinyapps.io/[APPNAME], where [USERNAME] and [APPNAME] are replaced with the values used in the deployment .yaml file.

What’s the net result?

Creating the automated processes and workflows to manage the WHO/Europe COVID-19 Vaccine Programme Monitor for the WHO/Europe required an investment in time and money. But those costs over the short-term have generated long-term savings in terms of the maintenance and time required to manage their data processing and the hosting of the dashboard.

It’s important to note that not everything is done automatically for us. As is the way with real world data, there are always going to be a few data quality anomalies that mean members of WHO/Europe will prepare a small amount of the data themselves as part of the overall workflow. This is not necessarily a bad thing; there are many instances where fully automated systems have produced ludricous results when left to operate unsupervised, so maintaining a human touch can help keep things in check. But with 95% of the work being handled automatically, members from both WHO/Europe and Jumping Rivers are free to focus on other more important matters.

For the last few months, the app has mostly looked after itself in a reliable way. And for an automated process, there can be no higher praise.

For updates and revisions to this article, see the original post

Should I learn Stan?

Thu, 09 Feb 2023 23:59:00 +0000

A little bit about you

Let’s assume you’re familiar with Bayesian statistics; you know what I mean when I say prior, likelihood and posterior. Recall that an MCMC scheme constructs a Markov chain as a method to sample from the posterior density.

You may have used a probabilistic programming language (PPL) in the past, such as BUGS, to perform Bayesian inference. You’ve heard about Stan and want to learn a little more. Or maybe you’re about to step into the Bayesian paradigm and don’t know where to start. You want to know whether you should make the switch from JAGS to Stan, or you’ve used neither of JAGS or Stan and want to know which will suit you best. This post will focus solely on the differences between JAGS and Stan as I have experience with both of them, but there are many more PPLs out there. For example, I have never used Bean Machine, but of all the PPLs, it certainly takes the crown for best name.

Although Stan is a PPL, JAGS technically isn’t a programming language (more on this later). We will use the term “Bayesian modelling software” to talk about them both.

Why use Bayesian modelling software?

When we do a (fully) Bayesian analysis we essentially have two ways to estimate the model parameters. If you have too much spare time on your hands, option A: write a bespoke sampling scheme might appeal to you. If you have other things to do, and want your inferences to be reliable, then I’d recommend option B: construct your model with purpose built software.

The advantages of using Bayesian modelling software over hand-coding a bespoke sampler are similar to the advantages of using a package or library over hand-coding any other model. There’s no need to reinvent the wheel when somebody else has done all the hard work.

Differences at a glance

Stan is a free, open source PPL based on C++. It was developed to allow us to conduct Bayesian inference without the need to write bespoke sampling algorithms. Stan is named after Stanislaw Ulam, who helped develop the first MCMC methods in the 1940s. Andrew Gelman, one of the lead Stan developers, thinks that in hindsight, Arianna would have been a better name than Stan, as it was Arianna Rosenbluth who programmed the first MCMC algorithm. A basic Stan program for linear regression looks like this:

// A linear regression in Stan
data {
int N; // sample size
vector[N] y; // response variable
vector[N] x; // predictor variable
}
parameters {
real alpha; // intercept
real beta; // slope
real<lower=0> tau; // precision
}
model {
// likelihood
y ~ normal(alpha + beta * x, 1 / sqrt(tau));
// prior
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
tau ~ gamma(2, 2);
}

Stan might feel intimidating if you’ve never used a statically typed language before (languages like C++ and Java). Statically typed means we must declare the type of all variables in Stan. For example, our sample size, N is of type int: it is an integer. If we try to set N = 12.5 the Stan program will not run! R programmers like myself often take types for granted, especially numerical types.

Similarly, JAGS is free, written in C++, and allows us to perform Bayesian computation without knowing too much about MCMC schemes. JAGS is an acronym for “Just Another Gibbs Sampler”; we’ll expand on this a bit later. A simple linear regression in JAGS might look like:

## A linear regression in JAGS
model {
 # likelihood
 for (i in 1:N) {
 y[i] ~ dnorm(alpha + beta * x[i], tau)
 }
 # prior
 alpha ~ dnorm(0, 1) # intercept
 beta ~ dnorm(0, 1) # slope
 tau ~ dgamma(2, 2) # residual precision
}

One difference between the two softwares is that a Stan program is broken into “blocks” which allows the user to tell Stan what all the different variables in our code represent. There are more optional blocks to a Stan program. Conversely, the JAGS model is usually just one block. JAGS will work out for itself which of the included variables are known (data) and which are unknown (parameters) based on what data is passed to the JAGS program. Another difference in the model specification is vectorisation. Stan allows (and encourages) you to vectorise your code. In Stan, we wrote y ~ normal(alpha + beta * x, 1 / sqrt(tau)). This is “short hand” for a for loop; y[i] ~ normal(alpha + beta * x[i], 1 / sqrt(tau)). Vectorising gives us cleaner looking code and can bring computational advantages. Conversely, JAGS code is much more difficult to vectorise, thus we must rely on slow for loops more often.

If you’ve used R a lot, the JAGS code might invoke some kind of déjà vu. JAGS code is supposed to look a bit like R code. Distributions are specified by d*(), the types of a variable are interpreted (JAGS figures out if things are real, integer, etc) and we use # to write comments. With JAGS, the normal distribution is parameterised by the precision rather than the variance, but otherwise, if you have a basic understanding of R, you will be pretty good at guessing what JAGS code does.

Differences in user experience

Running your models

JAGS and Stan can be run on their own, via the command line, but we will likely be pre-processing our data in a more general language like R or Python. An interface between our go-to language and our Bayesian modelling software allows our “main” language to run Stan or JAGS code. For R users, {rstan} provides this functionality for Stan, and {rjags} provides this for JAGS. There are similar interfaces for other languages (e.g. for Python, use PyStan and PyJAGS). These interfaces have a similar feel.

Writing code

A big part of coding up a Bayesian model in Bayesian modelling software is, well, coding up the model.

One thing I really like about Stan is the functions block. Good coding practice tells us we should put commonly used blocks of code into a function. This could be handy if we want to fit a non-linear regression model. In Stan, if we wanted to use the expression $ \alpha + e^ {\beta x}$ many times we could define this as a function:

functions {
real non_linear_mean(real alpha, real beta, real x) {
return alpha + exp(beta * x);
}
}

This function can be used just like any inbuilt Stan function. JAGS does not have friendly support for user-defined functions or distributions; essentially you need to write your own JAGS module in C++.

Another thing I like about Stan is that syntax highlighting is supported in many popular IDEs. RStudio has out the box support for Stan, whilst other popular IDEs such as Vim and Jupyter have Stan plug-ins. This is because Stan is a language. As far as I can tell, the only editor that supports JAGS syntax is Emacs (and that’s not even out of the box). The lack of support is probably because JAGS is a program and not a language. Personally, I’d find a full data science workflow in Emacs less than ideal. For this post, I used R syntax highlighting on the JAGS code. However, we normally write JAGS code within a string, so the entire model would be highlighted as a string. Stan syntax highlighting is supported by more user friendly environments and is even supported by Quarto, which is handy if you’re teaching Bayesian modelling!

Getting help

Reaching out online to get help is a huge part of the user experience. The Stan Forums are a hive of activity with questions regularly being answered by the Stan development team themselves. As far as I can tell, the JAGS community does not seem to be as active. Stan even has its own conference!

Differences under the hood

This section is a little technical. The TL;DR is: JAGS has a toolbox of relatively simple sampling schemes; under special circumstances some of these schemes are very effective. Stan uses a Behemoth of a sampling scheme called Hamiltonian Monte Carlo (HMC). This is a complex sampling scheme but can be very effective for complex models.

Unpacking the technical stuff

JAGS looks at the Bayesian model you have provided and tailors the type of sampling schemes used to maximise performance. When it can be used, JAGS will use a Gibbs sampler. Gibbs is a computationally cheap sampling scheme but can only be used for a small set of likelihood-prior combinations. When JAGS can’t use simple, tractable sampling methods it uses more general purpose, but often less efficient methods, such as slice sampling.

The HMC algorithms within Stan are inspired by statistical physics. By default Stan uses the No U-Turn Sampler (NUTS), a variant on HMC. NUTS utilises Hamiltonian Dynamics, which relies on the gradient of the log posterior, $ \nabla \log \pi (\theta \mid x) $. This clever mathematics will (hopefully!) produce a statistically efficient MCMC scheme. The downside of employing complex mathematics is that each iteration of the MCMC scheme can be computationally complex. For more on HMC see Michael Betancourt’s introduction to HMC. The power of HMC is that it can produce a statistically efficient MCMC scheme; we may not need to run the MCMC scheme for as many iterations to obtain satisfactory results, thus the overall run time may be less.

One sampling scheme to rule them all?

Suppose we wanted to fit a model in Stan where one or more of the unknowns is discrete. This might be because I have some missing count data, for example. In this instance, $ \nabla \log \pi (\theta \mid x) $ will not exist, and therefore HMC, and thus Stan, cannot be used. Algorithms like slice sampling can work in this situation, so JAGS would be an appropriate tool. I also mentioned that JAGS can be fast for simple models. If your Bayesian model exhibits (semi-)conjugacy, JAGS will probably be more efficient as Gibbs sampling can be used.

If your Bayesian model does not exhibit any conjugacy, Stan will probably be a better option. There are also other scenarios where Stan will probably be better than JAGS. The first is when the model is complex to write down; if your model is complex enough to warrant user-defined functions, I’d use Stan.

Stan also has other functionally that JAGS does not. For example, Stan has an extensive math library. This allows us to solve algebraic equations and differential equations. You can use Stan to solve these types of equations as standalone problems. If we have observed some data about a physical system described by differential equations, you can use Stan’s differential equation solvers in a Bayesian framework to conduct uncertainty quantification about the fitted parameters of a differential equation and propagate posterior beliefs into predictions. Pretty cool, right?

One sampling scheme to rule them all?

I was taught JAGS as an undergraduate student but taught myself Stan as a postgraduate researcher. I think for that reason, it’s not entirely fair to compare my learning experiences, but I did find self-teaching Stan to be harder than learning JAGS from a professor. Prior to learning Stan, I had never worked with a statically typed language which definitely took some getting used to. However, I was keen to learn Stan because my complex models were taking a long time to run in JAGS. I found that, after a lot of teething problems (mostly forgetting to end lines with ;), that my Stan implementations of models were much faster than the JAGS equivalent.

So, which sampling algorithm is best? As with everything in statistics, it depends. Jorgen Bolstad’s blog post compared the efficiency of JAGS and Stan for a handful of different Bayesian models. The broad summary is, from an efficiency perspective models with conjugacy are better suited to JAGS, whereas non-conjugate models are better suited to Stan. I think from a programming and user-experience perspective, Stan really wins for complex models.

From a programming perspective, after getting over the learning curve, Stan is a better environment for developing Bayesian models. For me, this is because Stan has the flexibility for user-defined functions (which can allow us to specify bespoke distributions as well!).

I can’t tell you what’s going to be best for your particular circumstances, but as a general rule I’d say for simpler models, JAGS is probably better and for complex models, Stan is probably better.

If you do think Stan is the right tool for you, then why not consider attending one of our Stan courses? Our courses are a great hands-on and interactive way of getting up-and-running and fitting models with Stan!

For updates and revisions to this article, see the original post

February Training Update

Tue, 07 Feb 2023 23:59:00 +0000

We have a great selection of online public training courses coming up over the next two months, including a variety of R courses, as well as some more stats-heavy courses on Bayesian Inference! Read on for a taste of what’s in store, or head over to our training page for full details and to book!

Bayesian Inference

Our upcoming courses on Bayesian inference take you from an introduction through to implementing models using Stan with R.

(Introduction to Bayesian Inference)[https://www.jumpingrivers.com/training/course/introduction-bayesian-inference-rstan-monte-carlo/]

Course level: Foundation

Next course date: 20th February 2023

The capturing and quantification of uncertainty is a very important aspect of model-fitting and parameter inference. Bayesian inference represents a fully-probabilistic approach to parameter inference, allowing a practitioner to quantify their uncertainties through probability densities. However, fitting models in a Bayesian framework can be an involved and complicated affair, often necessitating the use of Markov chain Monte Carlo (MCMC) algorithms and their programmatic implementation.

Introduction to Bayesian Inference using RStan

Course level: Intermediate

Next course date: 20th-23rd February 2023

The course will teach participants how to interface with Stan through R!

R

If you already have the basics of R down, and want to get a bit more adventurous with it, take a look at some of our more advanced R courses for plotting and data wrangling. We also offer a course on R best practices, so you can make sure your code stands up to the tests of time.

Data visualisation with ggplot2

Course level: Intermediate

Next course date: 6th-7th March 2023

R Best Practices

Course level: Intermediate

Next course date: 20th-21st March 2023

Data Wrangling in the Tidyverse

Course level: Foundation

Next course date: 27th-28th March 2023

For updates and revisions to this article, see the original post

Quarto for the Python user

Thu, 02 Feb 2023 23:59:00 +0000

As data scientists we often need to communicate conclusions drawn from data. Additionally, as more data is collected, our reports invariably need updating. This is where automated reporting tools such as Quarto come in! In this blog post we will look at how Quarto allows us to weave together text and Python code to generate reproducible reports.

What is Quarto?

Quarto is a technical publishing system built on Pandoc. By combining code with plain text, it allows you to create reports that can easily be updated when the data changes. For example, imagine you have to report on the profits of a company each month. With Quarto, you can create your report with any key figures and charts, then with just the click of a button update it each month with new data. You can also create content in a variety of formats, from articles and scientific papers to websites and presentations, in HTML, PDF, MS Word and more.

How does it work?

.qmd: For Quarto we work in a .qmd file. This will contain a mix of markdown and code chunks.
Jupyter: When the file is rendered with Quarto, the code chunks are interpreted by Jupyter. You can also select which Jupyter kernel you want to use.
.md: The code and output, as well as the rest of the content, is then converted to plain markdown.
Pandoc: The markdown file is converted to a variety of other formats using Pandoc.
.html/.pdf/.docx: A .qmd file can be rendered in multiple different formats without having to change any content.

Where do I run Quarto?

There are a couple of IDEs where you can run Quarto with Python. For this post we will be focusing on the Quarto extension for VS Code, which offers an extensive variety of tools for editing your documents. As we will show in an upcoming post, you can also render Quarto documents directly from Jupyter notebooks.

First things first you will need to install Quarto. From VS Code, you can then find the extension by clicking on “Settings”, then “Extensions”, then typing “quarto” into the search bar. Select the “Quarto” extension, click “Install” and after a few seconds you’ll be good to go!

A Quarto document is essentially a text file with a .qmd extension. This can be created in VS Code by clicking on “File”, then “New File…”, then “Quarto Document (qmd)”. Clicking the “Render” button (or using the keyboard shortcut Ctrl+Shift+K) will open a side window with a live preview that will update as you edit the document:

You can also run Quarto via the terminal:

To preview your document as you edit it:

quarto preview <your-doc>.qmd

To convert the document from .qmd into the desired output format:

quarto render <your-doc>.qmd

Preparing a document

Let’s use Quarto to write an html web report about penguins! 🐧

If you wish to run the code yourself you will need the following dependencies:

These can be installed with:

python3 -m pip install pandas plotly statsmodels

1) YAML header

To start, we’ll need a YAML header.

YAML is a human readable language often used to write configuration files. In Quarto, it’s used to configure the settings for the presentation and formatting of the documents.

The header is fenced above and below by three hyphens (---). The example below includes some common settings:

---
title: "Reporting on the bill length of penguins"
author: "Myles Mitchell & Parisa Gregg"
date: "14 December 2022"
format: html
jupyter: python3
---

The first three should be self-explanatory!
format sets the preferred output format for your document (html, pdf, docx, …)
jupyter sets the kernel for executing embedded Python code

You don’t have to specify a Jupyter kernel if the first code chunk is in Python; in that case, Quarto will know to use Jupyter (although you may still wish to select a specific kernel).

2) Markdown text

The main body of text is written in markdown syntax. If you haven’t used markdown before, it’s an easy-to-learn language that allows you to combine plain text and blocks of code.

We’ll say a bit more about Python code chunks below, but for a quick guide to markdown basics, the Quarto documentation is a great place to start!

Here’s an opening passage for our report, written in markdown:

## Abstract

Prepare yourself for a life-changing article about penguins...

## Introduction

[Penguins](https://en.wikipedia.org/wiki/Penguin) are a family
(**Spheniscidae**) of aquatic flightless
[birds](https://en.wikipedia.org/wiki/Bird) that live primarily in the
[Southern Hemisphere](https://en.wikipedia.org/wiki/Southern_Hemisphere).
Their diet consists of:

- Krill
- Fish
- Squid
- More fish

There are 18 species of penguin, including:

1. Macaroni penguin (*Eudyptes chrysolophus*)
2. Chinstrap penguin (*Pygoscelis antarcticus*)
3. Gentoo penguin (*Pygoscelis papua*)

We’ve included hyperlinks, bullet points, numbered lists, bold and italic font using the asterisk symbol, and subheadings using the hash symbol.

The screenshot below shows the rendered output so far:

3) Code chunks

We can use code chunks to insert code into the document. These are fenced off by three backticks (```). To specify the language we can include {python} after the first set of backticks.

The Python code is not just for show! It can also be used to dynamically generate content including figures and tables. Let’s use some Python code to include a plot in our document. We’ll start by loading in some data using pandas:

```{python}
import pandas as pd

data = pd.read_csv(
 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv'
)
data.head()
```

The first five rows of the DataFrame will be displayed by data.head() in the rendered document, along with the code used to load in the data:

Now let’s make a plot. Because we’re creating a web document, let’s generate an interactive figure using the plotly library:

```{python}
#| echo: false
#| fig-cap: "Bill length as a function of body mass"
#| fig-width: 8
import plotly.express as px

px.scatter(
 data,
 x="body_mass_g",
 y="bill_length_mm",
 color="species",
 facet_col="year",
 trendline="ols",
)
```

YAML code chunk options can be provided at the top of a code block, and are prefixed with #| followed by a space. Here we have used three options:

Setting echo to false will hide the code chunk in the rendered document
A figure caption will be added by fig-cap
The figure width is controlled with fig-width

Some other common options include:

eval: if false, the code will not be evaluated
warning: if false, warning messages will be hidden
error: if true, the code is allowed to error and the error message will be displayed in the output

4) Inline-ish code

To insert code inline, just use a pair of backticks: `data = pd.read_csv(penguins_url)`. Additionally, if you want the code to have Python formatting you can use `data = pd.read_csv(penguins_url)`{.python}.

You may also wish to execute code inline. Unfortunately, there isn’t a tidy way to add Python-executable code inline as you can with the R language. However, there does exist a workaround where you can create markdown code within a Python codeblock and include values that require Python-execution in the created markdown.

Let’s demonstrate this by adding a sentence stating the average bill length:

```{python}
#| echo: false
from IPython.display import display, Markdown

avg_length = data['bill_length_mm'].mean()
display(Markdown(
f"""
According to our data, the average bill length is
{round(avg_length, 1)} mm.
"""
))
```

We have made use of an f-string to insert a Python variable (rounded to one decimal place) in the sentence. The Markdown() function is used to convert the string into markdown, and this is displayed in the rendered document using display(). If our data changes, we just need to re-render the document and this text will be updated automatically!

The screenshot below shows this sentence (along with our plot) in the rendered document:

Wrapping up

Let’s put all of this together and apply some finishing touches:

---
title: "Reporting on the bill length of penguins"
author: "Myles Mitchell & Parisa Gregg"
date: "14 December 2022"
format: html
jupyter: python3
---

## Abstract

Prepare yourself for a life-changing article about penguins...

## Introduction

[Penguins](https://en.wikipedia.org/wiki/Penguin) are a family
(**Spheniscidae**) of aquatic flightless
[birds](https://en.wikipedia.org/wiki/Bird) that live primarily in the
[Southern Hemisphere](https://en.wikipedia.org/wiki/Southern_Hemisphere).
Their diet consists of:

- Krill
- Fish
- Squid
- More fish

There are 18 species of penguin, including:

1. Macaroni penguin (*Eudyptes chrysolophus*)
2. Chinstrap penguin (*Pygoscelis antarcticus*)
3. Gentoo penguin (*Pygoscelis papua*)

## Methods

To determine whether a higher body mass implies a longer bill, we loaded a
penguins dataset using pandas:

```{python}
import pandas as pd

data = pd.read_csv(
 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv'
)
data.head()
```

## Results

The figure below shows the bill length plotted as a function of the body mass
for three species across a 3-year period.

```{python}
#| echo: false
#| fig-cap: "Bill length as a function of body mass"
#| fig-width: 8
import plotly.express as px

px.scatter(
 data,
 x="body_mass_g",
 y="bill_length_mm",
 color="species",
 facet_col="year",
 trendline="ols",
)
```

```{python}
#| echo: false
from IPython.display import display, Markdown

avg_length = data['bill_length_mm'].mean()
display(Markdown(
f"""
According to our data, the average bill length is
{round(avg_length, 1)} mm.
"""
))
```

Try copying this into your Quarto document or alternatively you can download the full code here. Upon rendering, an html document like the one at this webpage should be created.

Hopefully you can now appreciate the beauty of Quarto! By having the code used to generate the content embedded in the document, our report is fully automated; if the data changes, we just need to click render to update the content. This also makes it easy for a colleague to reproduce the report themselves. And because Quarto uses plain text files, it’s also great for version control with Git!