R Package Quality: Validation and beyond!

As is often the case, it’s pretty easy to talk about “good” R packages. We can even wave our hands and talk about packages following “software standards” or “best practices”. But what does that mean?
Most of us would agree that packages like {Rcpp}
or {dplyr}
are solid.
At the other end of the spectrum, we could point to outdated, poorly tested or unmaintained packages as “risky”.
But the reality is that most R packages fall somewhere in between.
However, the reality is considerably more nuanced: the vast majority of R packages exist somewhere along the continuum between these two extremes. They may exhibit excellence in certain aspects whilst falling short in others, or they might represent perfectly adequate solutions for specific use cases whilst being unsuitable for mission-critical applications. The primary objective of this post is to assist organisations and individual practitioners in developing a clearer, more systematic understanding of the packages upon which they depend. It’s important to acknowledge upfront that any scoring system will have limitations—some genuinely high-quality packages might receive unexpectedly low scores due to specific circumstances, whilst some packages with significant underlying issues might score well on surface-level metrics. However, this doesn’t diminish the considerable value of establishing a consistent, structured framework for package assessment.
In developing Litmus, our solution for R package assessment and validation, we’ve had to wrestle with these concepts in great detail. We have come up with a framework that we believe addresses the challenges presented by package validation. In the coming series of Litmus blog posts, we will be examining in detail the choices we made to balance the need for both robustness and flexibility in R package quality assessment.
Before examining the specifics of how we evaluate and score packages, it’s crucial to understand the foundational principles that underpin our methodology. In this post, we will be digging into the core principles of our approach.
Guideline 1: Scores are not static
At first glance, this principle might appear counterintuitive, but it reflects a fundamental reality: the standards we apply to R packages today cannot reasonably be identical to those we might have employed in 2015, nor should they remain unchanged looking forward to 2030.
Consider the obvious evolution in scale: package download numbers have increased dramatically over the past decade, reflecting both the growth of the R community and the maturation of package distribution infrastructure. More subtly but equally importantly, the general tooling ecosystem has undergone dramatic improvements. Modern development practices now routinely include automated testing via GitHub Actions, comprehensive code coverage analysis, automated dependency checking, and sophisticated static analysis tools. Packages developed today have access to these resources in ways that simply weren’t feasible or standard practice a decade ago. Since number of downloads represents a metric of package popularity, what is considered a high vs. low number of downloads will need to be periodically adjusted.
Furthermore, our scoring approach is explicitly tied to specific package versions. When a maintainer releases a new version of a package, potentially addressing security vulnerabilities, improving documentation, adding new features, or enhancing test coverage, the previous version often becomes a less optimal choice despite having been perfectly adequate when it was current.
Solution: We implement an annual comprehensive audit of our scoring mechanisms. This yearly review process serves multiple functions: updating the underlying data used to generate scores where relevant (such as adjusting download thresholds to reflect ecosystem growth), introducing new scoring criteria as best practices evolve, and retiring metrics that may have become less relevant or discriminatory.
Guideline 2: Scores shouldn’t change often
While we acknowledge that scores are transient, they shouldn’t change often or dramatically. For example, it makes sense to yearly audit our scoring mechanism for downloads and adjust the criteria. This would change scores on packages, but only in a small way.
Solution: We maintain disciplined annual audits of our scoring mechanisms, with changes implemented deliberately and with clear documentation of the rationale. Between these annual reviews, scoring criteria remain stable unless critical issues are identified.
Guideline 3: Cutoffs depend on use Cases
In an ideal world, we should “hand analyse” all packages, spending time assessing each package individually. From a practical perspective, focusing our attention on the borderline packages, those that are almost good enough or just good enough to make the cut, makes sense. However, what constitutes “borderline” varies dramatically depending on the intended application. A package being considered for use in a regulatory submission to the FDA faces entirely different quality requirements compared to one being used in an MSc Statistics project or an exploratory data analysis. The former context demands extensive validation, comprehensive documentation, and demonstrated stability, whilst the latter might reasonably accept some additional risk in exchange for cutting-edge functionality or convenience.
Solution: Rather than imposing universal “risky” package thresholds, we advocate for situation-dependent cutoffs that reflect the specific requirements and risk tolerance of different use cases. We provide guidance for establishing appropriate thresholds for common scenarios whilst recognising that organisations may need to customise these based on their specific regulatory, commercial, or academic contexts.
See our post on Risk Appetite for more on this.
Guideline 4: “Good” packages may have serious issues
It’s crucial to recognise that even the most well-regarded packages can face problems that lie entirely outside their maintainers’ direct control. For example, a package might depend on a system library that subsequently reveals a security vulnerability, or one of its dependencies might become unmaintained. Alternatively, changes in the broader R ecosystem—such as modifications to base R or updates to critical dependencies—might create compatibility issues that haven’t yet been addressed. These scenarios highlight why a single numerical score, whilst valuable for initial triage, cannot capture the full complexity of package risk assessment. Some issues represent genuine “showstoppers” that require immediate attention regardless of a package’s overall score.
Solution: Whilst maintaining our commitment to clear, interpretable numerical scores for initial assessment, we supplement these with specific flags for “showstopper” issues that require immediate human review. These might include known security vulnerabilities, dependencies on risky packages, or compatibility issues with current R versions.
Guideline 5: Avoid cliff edges
Regardless of your statistical persuasion, we can all agree that having a super hard cut-off of “p = 0.05” is silly. The idea that “p = 0.05000001” is “not significant”, but “p = 0.4999999” can change the world, doesn’t really make sense. The same idea should apply to scores. Where possible, the scoring mechanism should be smooth and continuous.
Solution: We employ continuous, smooth scoring functions wherever possible. For example, rather than awarding full points for packages with >80% test coverage and zero points for those with <80%, we use gradual scoring curves that reward improvements at all levels whilst still recognising meaningful distinctions in quality.
Guideline 6: Not all scores are created equally
A score based on whether or not there is a maintainer should count more towards an overall score than a score based on whether or not there is a website URL. The former is more important than the latter in most cases, and thus should contribute more towards an overall score, if this overall score is to be considered useful.
Solution: Creating a scoring strategy that weighs individual metrics sensibly within categories, which are also weighted to reflect their relative importance. We will discuss this strategy in more detail in a later blog post, but here is the general idea.
We think of package quality as having four attributes:
- Documentation (weight 15%): Assess the quality and completeness of the package documentation. This is clearly subjective, as a package with full documentation, could have “bad” or outdated documentation. Nevertheless, packages that lack examples in their help pages, vignettes or NEWS files have lower scores.
- Code (weight 50%): This evaluates the quality and structure of the package code. Key components of this score include package dependencies (always a controversial topic), the number of exported objects, vulnerabilities, and test coverage.
- Maintenance (weight 20%): Reviews standard maintenance aspects of the package, including frequency updates, bug management, and number of contributors.
- Popularity (weight 15%): Review the package’s popularity. This includes package downloads over the last year and reverse dependencies. The idea is that these are strong indicators that the community has already placed trust in that package.
These numbers can of course be adjusted.
Implementation Considerations and Future Development
This scoring framework represents an ongoing effort to bring greater systematisation and transparency to R package quality assessment. As the R ecosystem continues to evolve, we anticipate that both our methodology and our understanding of what constitutes package quality will require ongoing refinement. We welcome feedback from the community about both the theoretical framework presented here and its practical implementation. Particular areas where community input would be valuable include the appropriate weightings for different quality attributes, the identification of additional metrics that might enhance assessment accuracy, and the development of context-specific guidance for different usage scenarios. Our commitment to annual methodology review ensures that this framework will adapt to reflect changes in best practices, tooling availability, and community standards whilst maintaining the stability and predictability that users require for practical decision-making.
Get in Touch
If you’re interested in learning more about R validation and how it can be used to unleash the power of open source in your organisation, contact us.
References
- Risk Appetite in R packages
- White paper from the Validation Hub on assessing R package accuracy
- Case Studies from various companies. Our approach builds on these ideas.
