class: center, middle, inverse, title-slide # What, why and how of visualisation ## Chapter 3 ### Colin Gillespie (
@csgillespie
) --- layout: true background-image: url(assets/white_logo.png) <div class="jr-header-inverse"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter3.html</div></div> --- class: inverse, center, middle A picture paints a thousand words __Frederick R. Barnard__ --- layout: true <div class="jr-header"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter3.html</div></div> --- # Visualisation * Data visualisation is an old topic; but in the last few years, it's become particularly trendy * With the correct visualisation, we can highlight key features that don't require any fancy statistics. -- ## Prerequisites (R) ```r install.packages(c("ggplot2", "hrbrthemes")) ``` ```r library("ggplot2") library("hrbrthemes") ``` --- layout: true background-image: url(assets/white_logo.png) <div class="jr-header-inverse"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter3.html</div></div> --- class: inverse, center, middle # Historical visualisations --- layout: true <div class="jr-header"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter3.html</div></div> --- # John Snow (not __that__ John Snow) John Snow (who did know something) created a simple dot map to illustrate the cluster of cholera cases around a water pump <img src="graphics/Snow-cholera-map-1.jpg" width="50%" style="display: block; margin: auto;" /> --- # Hereford Mappa Mundi (1300) It is the largest medieval [map](https://en.wikipedia.org/wiki/Hereford_Mappa_Mundi) known still to exist * At the centre of the map is Jerusalem * In the bottom left is the UK * At the top is the garden of Eden! <img src="graphics/hereford.jpg" width="35%" style="display: block; margin: auto;" /> --- # Florence Nightingale [Florence Nightingale](https://en.wikipedia.org/wiki/Florence_Nightingale): a true pioneer in the graphical representation of statistics <img src="graphics/Nightingale-mortality.jpg" width="70%" style="display: block; margin: auto;" /> --- # DatasauRus <img src="graphics/DinoSequential.gif" width="90%" style="display: block; margin: auto;" /> --- # Scatter plots * A scatter plot has horizontal and vertical axes to plot data points * Typically they are used to show how one variables is affected by another --- # Example: Bond > Has Bond become more violent over time? -- ```r *ggplot(bond, aes(x = Number, y = Kills)) + # Create a canvas geom_point() # Add a layer of points ``` <img src="chapter3_files/figure-html/unnamed-chunk-5-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Example: Bond > Has Bond become more violent over time? ```r ggplot(bond, aes(x = Number, y = Kills)) + # Create a canvas * geom_point() # Add a layer of points ``` <img src="chapter3_files/figure-html/unnamed-chunk-6-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Example: Bond (in colour) ```r g = ggplot(bond, aes(x = Number, y = Kills)) + # Create a canvas * geom_point(aes(colour = Actor)) # Add a layer of points g ``` <img src="chapter3_files/figure-html/3-5-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Exercise Using the beauty data set, produce some interesting plots, e.g. * `beauty` vs `students` ```r # Perhaps use colour? ggplot(beauty, aes(x = age, y=students)) + geom_point() ``` * `beauty` vs `age`, coloured by `tenured` * Anything else? --- # Trend lines ```r (g_smooth = g + stat_smooth(colour = "steelblue", se = FALSE, method = "loess")) ``` <img src="chapter3_files/figure-html/unnamed-chunk-8-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Insights 1. The number of kills appears to be increasing with each bond movie 2. Pierce Brosnan (the green points) liked killing people 3. Daniel Craig's movies didn't contain (that) many deaths <img src="chapter3_files/figure-html/unnamed-chunk-9-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Styling your plot * **ggplot2** has the concept of themes that can style the plot -- * [hrbrthemes](https://github.com/hrbrmstr/hrbrthemes) package > typography-centric themes and theme components for ggplot2 Basically, it's opinionated --- # Less is more ![](graphics/data-ink.gif) --- # Pretty picture ```r *library("hrbrthemes") g_smooth + theme_ipsum() + scale_color_ipsum() + labs(x = "Movie number", y = "No. of Bond Kills", title = "Kills through time", subtitle = "Is there a relationship?", caption = "Pierce wasn't a nice man!") ``` --- # Pretty picture ```r library("hrbrthemes") g_smooth + * theme_ipsum() + scale_color_ipsum() + labs(x = "Movie number", y = "No. of Bond Kills", title = "Kills through time", subtitle = "Is there a relationship?", caption = "Pierce wasn't a nice man!") ``` --- # Pretty picture ```r library("hrbrthemes") g_smooth + theme_ipsum() + * scale_color_ipsum() + labs(x = "Movie number", y = "No. of Bond Kills", title = "Kills through time", subtitle = "Is there a relationship?", caption = "Pierce wasn't a nice man!") ``` --- # Pretty picture ```r library("hrbrthemes") g_smooth + theme_ipsum() + scale_color_ipsum() + * labs(x = "Movie number", * y = "No. of Bond Kills", * title = "Kills through time", * subtitle = "Is there a relationship?", * caption = "Pierce wasn't a nice man!") ``` --- # Pretty picture <img src="chapter3_files/figure-html/unnamed-chunk-14-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Exercise / Q & A * Does alcohol consumption change through time? ```r g = ggplot(bond, aes(x = Number, y = Alcohol_Units)) + geom_point(aes(colour = Actor)) g ``` * Does there appear to a relationship between alcohol and the number of kills? - Add on `stat_smooth()` * Change method from `smooth` to `lm` - Add on `stat_smooth(method = "lm")` --- # Histograms * A histogram is a graphical representation of the distribution of continuous, numerical data * If the area under the graph adds up to one, then it is a _probability_ distribution -- * Divide the entire range into a series of intervals and count how many values fall into each interval * Bins do not have to be the same size (but they probably should be in general) -- * Nice example at [Tin Lizzie](http://tinlizzie.org/histograms/) --- # Histograms: OKCupid ```r ggplot(cupid, aes(x = age)) + geom_histogram(binwidth = 1) ``` <img src="chapter3_files/figure-html/3-8-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Facets ```r ggplot(cupid, aes(x = age)) + geom_histogram(binwidth = 1) + facet_wrap(~ sex) + coord_cartesian(xlim = c(18, 80)) ``` <img src="chapter3_files/figure-html/unnamed-chunk-17-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Exercise Generate histograms for * `height` ```r ggplot(cupid, aes(x = height)) + geom_histogram() ``` * Try different binwidths, e.g. `geom_histogram(binwidth = 1)` * Use facets to split by `sex` --- # Insights 1. There are many more male OK Cupid users than female (not surprising) 2. The general age distribution looks similar between sexes 3. The most surprising aspect is that the spike in age 42 is due to female users - This could be by pure chance - users lying about their age - a mistake in data entry --- # Boxplots <img src="chapter3_files/figure-html/unnamed-chunk-19-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Boxplot: OKcupid ```r # Missing values are represented with a -1 ggplot(cupid, aes(x = income, y = age)) + geom_boxplot() ``` <img src="chapter3_files/figure-html/unnamed-chunk-20-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Barplots > A bar chart or bar graph is a chart that presents grouped data with rectangular bars with lengths proportional to the values that they represent. --- # Barplots (Bond) ```r ggplot(bond) + geom_bar(aes(x = Actor)) ``` <img src="chapter3_files/figure-html/unnamed-chunk-21-1.svg" width="70%" style="display: block; margin: auto;" /> --- # Barplots (Bond) ```r ggplot(bond) + geom_bar(aes(x = Actor)) ``` * Typically ordering the axis alphabetically isn't best * We could order the either by the number of movies * By the date the actor first appeared. --- # Barplots (Bond) <img src="chapter3_files/figure-html/unnamed-chunk-23-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Summary * We need measures of location __and__ spread - Also be wary of just point estimates * Means and variances can be used for streaming data * Simple charts can highlight interesting features -- > Garbage in, Garbage out! -- > "See" you tomorrow, Good night