class: center, middle, inverse, title-slide # 20 years of CRAN ## The Porcelain Jubilee ### Colin Gillespie --- class: center, middle, inverse # [jumpingrivers.com/t/rss2017/](jumpingrivers.com/t/rss2017/) --- # About * Academic: Senior Statistics lecturer at Newcastle University * Bsc/PhD @ Strathclyde - at the turn of the century * Consultant: [Jumping Rivers](www.jumpingrivers.com) * Data science consultancy * Training: R, Stan, Scala * Shiny, package & algorithm development * Twitter: [@csgillespie](https://twitter.com/csgillespie) -- * Warning: the talk occasionally veers off topic - sorry --- background-image: url(graphics/Rlogo_fade.png) # A brief history of R - 1993: Research project in Auckland, NZ - 1995: R released as open-source software -- - 1997: - R core group formed - April 1st Mailing list started - April 23rd: CRAN is started (3 mirrors); now `\(\sim\)` 140 mirrors -- * 2000: R 1.0.0 released - I started using R around this time - Looking back my __`commands`__ file from my PhD: ``` Splus like package "R" R ``` --- # (One of) my first R graphs (Oct 26th, 2000) ```r a_scan("alpha+beta=10/3dSD_A+B=10.out", list(x=0,y=0,z=0)) z_matrix(a$z,ncol=101,byrow=T) x <- seq(0, 50) y <- seq(0, 10, by=0.1) persp(x,y,z, theta=330, scale=0.8, phi=20, xlab = "Counts", ylab = "alpha1", zlab ="Probability", ticktype="detailed", col=terrain.colors(20)) ``` * R-mailing list: [persp plot question..](http://tolstoy.newcastle.edu.au/R/help/00b/1428.html) - OK question, terrible title * Question answered by Ross Ihaka! --- class: center # (One of) my first R graphs (Oct 26th, 2000) <img src="graphics/3d.png" width="60%" /> --- background-image: url(graphics/Rlogo_fade.png) # A brief history of R - 1993: Research project in Auckland, NZ - 1995: R released as open-source software - 1997: - R core group formed - April 1st Mailing list started - April 23rd: CRAN is started (3 mirrors); now `\(\sim\)` 140 mirrors - 2000: R 1.0.0 released - 2003: R Foundation founded - 2004: Version 2.0.0 - 2013: Version 3.0.0 --- class: center, middle # What is CRAN? --- class: center, middle # Wikipedia to the rescue --- ## Cran may refer to: * Calorie restriction with adequate nutrition, a dietary regimen * C-RAN, a proposed architecture for future cellular telecommunication networks * CRAN (R programming language), a package archive network for the R programming language * __Cran (unit), a measurement of uncleaned herring__ --- class: center, middle > __CRAN__ is a network of ftp and web servers around the world that store identical, > up-to-date, versions of code and documentation for R --- class: center, middle ## A collection of R packages --- # CRAN Packages ![](index_files/figure-html/unnamed-chunk-3-1.svg)<!-- --> --- <!-- https://unsplash.com/search/package?photo=JuFcQxgCXwA --> background-image: url(graphics/packages_resize.jpg) class: center, bottom, inverse # What is an R package? --- # An R package * R code (mainly functions) - The `R/` directory * Documentation - `man/` * Data sets - `data/` * C/Fortran/Javascript code - `inst/` * A list of dependencies - `NAMESPACE` file - `DESCRIPTION` file --- # Listing dependencies * R v2.14 introduced mandatory [namespaces](http://r-pkgs.had.co.nz/namespace.html) -- * What are namespaces? - They provide _spaces_ for _names_ - They give an object context * When we ask for function `f()` - it now has a location --- # Example: the filter() function * In the __`stats`__ package, `filter()` applies a linear filtering to a time series - __`stats`__ is loaded by default -- * In __`dplyr`__, `filter()` manipulates data frames -- * We can access the functions directly via `::` ```r stats::filter dplyr::filter ``` -- * When I call `filter()` which function do I access? - The last package attached! --- # The search path ```r search() ``` ``` ## [1] ".GlobalEnv" "package:dplyr" "package:ggplot2" ## [4] "package:stats" "package:graphics" "package:grDevices" ## [7] "package:utils" "package:datasets" ".env" ## [10] "package:methods" "Autoloads" "package:base" ``` When I ask for the `filter()` function, R checks each namespace for a copy - `.GlobalEnv`, then - `package:dplyr` then - `package:ggplot2` then ... So in this case we would stop at __`dplyr`__ --- # To be fair... ```r library("dplyr") ``` ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` --- # In case you've ever wondered ```r # sd(x) just calls `sqrt(var(x))` sd ``` ``` ## function (x, na.rm = FALSE) ## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), ## na.rm = na.rm)) ## <bytecode: 0x42404f0> ## <environment: namespace:stats> ``` -- So what happens if we do ```r # Redefine var var = function(x) 1 ``` -- Everything still works! ```r # sd() looks in it's own namespace first! sd(runif(10)) ``` ``` ## [1] 0.2915 ``` --- # The `:::` operator * R has no concept of "private" (unlike Java) * But packages don't have to export everything * The `:::` operator retrieves non-exported objects ```r ## Doesn't work ggplot2::absoluteGrob ``` ```r ggplot2:::absoluteGrob ``` ``` ## function (grob, width = NULL, height = NULL, xmin = NULL, ymin = NULL, ## vp = NULL) ## { ## gTree(children = grob, width = width, height = height, xmin = xmin, ## ymin = ymin, vp = vp, cl = "absoluteGrob") ## } ## <environment: namespace:ggplot2> ``` --- # Package dependencies Dependencies are specified in the `DESCRIPTION` file * `Depends`: adds the package to the search path - Strongly discouraged -- * `Imports`: places the imported package in `<imports:packageName>` - Doesn't pollute the `search()` path; the "correct" way - `Imports` and `Depends` are installed if missing - One of the reasons R is so popular - _it just works_ -- * `Suggests`: packages not required for standard package workflow - Required for testing, or advanced plotting -- * `Enhances`: packages listed here are “enhanced” by your package. - Very rarely used - 99% have 0 "enhancements" - Record for most enhancing package - Philip Leifeld (Glasgow Uni) --- class: center, inverse, middle # Back to CRAN --- background-image: url(graphics/gandalf.jpg) class: center, inverse # CRAN is the package gatekeeper --- background-image: url(graphics/gandalf_fade.jpg) class: inverse # CRAN checks * Documentation - Although you can't force _good_ documentation - Some packages make creative use of aliases -- * Poorly written code, .e.g polluting the global namespace * Ensure examples work -- * Installation - Linux, Windows, Mac, Solaris(?) -- * Time taken to run tests :( * Formatting of the DESCRIPTION file * Details in [R-devel version of _Writing R Extensions_](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Tools) manual --- # CRAN * Volunteers spend a significant amount time communicating to package authors (remember over 10K packages!) - Solving problems - Many (the vast majority?) of package authors are __not__ programmers by training * CRAN runs checks of R packages with each nightly build of R - This takes 90 days of compute time --- class: center, inverse # Getting a package on CRAN > “The greatest victory is that which requires no battle.” > ― __Sun Tzu__, _The Art of War_ --- # Use roxygen2 * In-source documentation * Uses tags ``` #' @title A really good R function #' @description The best R function. No really, it's just the best #' @export the_best_the_very_best = function() return("Trump") ``` * Takes (some of) the pain away from documentation --- class: center, middle # Use a test suite --- # Use continuous integration - [travis-ci](http://travis-ci.org/) and [appveyor](https://www.appveyor.com/) - When I change my package, travis-ci - builds the package under R-past, R-release and R-devel - runs the test suite - builds the vignettes - runs `R CMD check --as-cran` - detect errors early - Makes it easier to incorporate changes from users --- # R-hub [r-hub](https://github.com/r-hub/rhub) looks promising ```r rhub::platforms()$name ``` ``` ## [1] "debian-gcc-devel" "debian-gcc-patched" ## [3] "debian-gcc-release" "fedora-clang-devel" ## [5] "fedora-gcc-devel" "linux-x86_64-centos6-epel" ## [7] "linux-x86_64-centos6-epel-rdt" "linux-x86_64-rocker-gcc-san" ## [9] "ubuntu-gcc-devel" "ubuntu-gcc-release" ## [11] "windows-x86_64-devel" "windows-x86_64-oldrel" ## [13] "windows-x86_64-patched" "windows-x86_64-release" ``` ```r rhub::check(platform = "windows-x86_64-release") ``` .footnote[ [1] Developed by the R Consortium ] --- class: center, middle # Use [win-builder](http://win-builder.r-project.org/) --- class: center, middle, inverse # Packages don't exist in isolation --- # Imports & Depends ![](index_files/figure-html/unnamed-chunk-15-1.svg)<!-- --> --- background-image: url(graphics/drat.png) --- background-image: url(graphics/empty_packages.png) --- background-image: url(graphics/cran_white_edges.png) --- # Can we detect clusters * Use `igraph::walktrap.algorithm()` * Tries to find, small tightly connected sub graphs .footnote[ [1] Credit: [Andrie de Vries](https://www.slideshare.net/RevolutionAnalytics/the-network-structure-of-cran-2015-0702-final) ] --- background-image: url(graphics/cran.png) --- # Think of the package authors * __Rcpp__ and __ggplot2__ have a large number of reverse dependencies * Whenever they change, they need to make sure they don't break other packages --- # What happens if a package is "wrong" --- # The schoolmath package * [schoolmath](https://cran.r-project.org/web/packages/schoolmath/): Functions and datasets for school mathematics - Simple prime number algorithms - Two versions: released within 24 hours of each other * It passes all of the CRAN checks, but... * Indisputable bugs found * Who's in charge of removing the package? --- # Do we need to use CRAN? --- # Create your own repositories! * The [drat](https://cran.rstudio.com/web/packages/drat/) package makes this straightforward - Create and maintain repositories - Since GitHub provides web-pages - Put your repository there - with version control! * Create repository ```r library("drat") initRepo() ``` * Add to repository ```r insertPackage("abc-1.0.0.tar.gz") ``` * Install packages using the usual dance routine ```r install.packages("abc", repos = "...") ``` --- # My set-up for running courses - Each course has it's own package - We don't bother CRAN, since the packages are of limited interest - We are in complete control - Packages are hosted on GitHub - When we make a change, travis-CI checks the package - If the package passes, pushes it to our repository - Combined with [miniCRAN](https://cran.r-project.org/web/packages/miniCRAN/), we have all packages + dependencies on a USB stick! --- class: center, middle background-image: url(graphics/bag.jpg) # But, CRAN does provide trust... ??? Image credit: [Splitshire.com](https://www.splitshire.com/vintage-bag-on-country-roads/) --- # Pyramid of trust <img src="graphics/hieRarchy.png" width="60%" style="display: block; margin: auto;" /> .footnote[ Image credit: [rud.is](https://rud.is/b/2017/02/23/on-watering-holes-trust-defensible-systems-and-data-science-community-security/) ] --- background-image: url(graphics/skyfield.jpg) # Summary * One of the reasons R has been a success is due to CRAN - `install.package("XYZ")` just works * A large number of people work to make this possible * Will CRAN be able to keep up as is? - perhaps... -- ## Thank-you and any questions? --- ## Credits * Images from: - https://unsplash.com/ - https://www.splitshire.com/ * R logo from [CRAN](https://www.r-project.org/logo/)