class: center, middle, inverse, title-slide # Condensing data with numerical summaries ## Chapter 2 ### Colin Gillespie (
@csgillespie
) --- layout: true background-image: url(assets/white_logo.png) <div class="jr-header-inverse"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter2.html</div></div> --- class: inverse, center, middle > James Bond: [to Vesper] Why is it that people who can't take advice always insist on giving it? --- layout: true <div class="jr-header"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter2.html</div></div> --- # Overview * Data scientists need to condense large amounts of information into a few key summaries * There are many possible choices when simplifying data - "One size __doesn't__ fits all" -- * Look at standard measures of location and spread - Building blocks for many statistical and machine learning algorithms -- * Some measures fit naturally into a big data scenario while others do not --- # A [wee](http://www.dsl.ac.uk/entry/snd/wee_n1_adj_adv) bit of maths * Basic notation - Generalise all situations with a simple shorthand * We replace actual numbers with letters in order to be able to write general formulae - Upper case letter to represent our random variable - Lower case to represent sample data - Subscripts to distinguish individual observations in the sample --- # Example Suppose we ask three people how many mobile phone calls they made yesterday - We might get the following data: 1, 5, 7 - Another sample we will most likely get different data, say 2, 0, 3 - Algebra represents the general case as `\(x_1\)`, `\(x_2\)`, `\(x_3\)` -- | | | | ------------|--|---|---|--- 1st sample: | 1 | 5 | 7 2nd sample: | 2 | 0 | 3 `\(\vdots\)` | `\(\vdots\)` | `\(\vdots\)` | `\(\vdots\)` Typical sample: | `\(x_1\)` | `\(x_2\)` | `\(x_3\)` --- # Further generalisation * A random variable as `\(X\)` and the `\(i\text{th}\)` observation in the sample as `\(x_i\)`. * Previously, we had 1, 5, 7 so * `\(x_1 = 1\)`, `\(x_2 = 5\)` and `\(x_3 = 7\)` -- * The total number of observations in a sample `\(i\)` the _sample size_ - Referred to by the letter `\(n\)`; so `\(n = 3\)` above --- # Sums * The next important piece of notation to introduce is the symbol `\(\sum\)` - Upper case of the Greek letter `\(\sigma\)`, pronounced ''sigma'' * It is used to represent the phrase 'sum the values' $$ \sum_{i=1}^n x_i = x_1 + x_2 + \dots + x_n. $$ - Sum of all the values in our data (from the first `\(i = 1\)` to the last `\(i = n\)`) -- - Often shorten to `\(\sum x\)` --- # Measures of location (averages) - What's a typical observation -- - Mean and the median --- # Sample mean `$$\bar x = \frac{x_1 + x_2 + \ldots + x_n}{n} = \frac{1}{n} \sum_{i=1}^n x_i$$` where * `\(x_i\)` are our data points * `\(n\)` is the sample size, i.e. the number of data points in our sample --- # Example So if our data set was `\(0,3,2,0\)`, then `\(n=4\)`. Hence, $$\bar x = \frac{1}{n} \sum_{i=1}^n x_i = \frac{0 + 3 + 2 + 0}{4} = 1.25 $$ In statistics, it is common to use a potential bar over a variable to denote the mean --- # Example: The beauty data set * This is __not__ made up --- # The relationship between beauty and teaching [Study](https://doi.org/10.3386/w9853) where researchers were interested in whether a lecturers' attractiveness affected their course evaluation! -- * evaluation: the questionnaire result * tenured: does the lecturer have tenure; 1 == Yes. In R, this value is continuous * minority: does the lecturer come from an ethnic minority (in the USA) -- * age: the lecturers' age * gender: a factor: Female or Male * students: number of students in the class * beauty: each of the lecturers' pictures was rated by six undergraduate students: three women and three men --- # The relationship between beauty and teaching * The `beauty` data set contains results for 463 classes * Does attractiveness affect teaching ability? --- # Example: The beauty data set After loading the data set into R we can extract a particular column using the dollar operator ```r ## Attractiveness score beauty$beauty ## Number of students per class beauty$students ``` --- # Example: The beauty data set We can use the built-in function `mean()` ```r ## Attractiveness score (normalised) mean(beauty$beauty) #> [1] -0.0883 ``` -- ```r ## Number of students per class mean(beauty$students) #> [1] 55.2 ``` --- # Sample median * Occasionally used instead of the mean - Particularly when the data have an asymmetric profile or when there are outliers (as indicated by graph, perhaps) - Wages! -- * Work with _ranked_ observations -- - `\(x_{(1)}, x_{(2)}, \ldots, x_{(n)}\)` - notice the brackets __()__ - `\(x_{(1)}\)` minimum - `\(x_{(n)}\)` maximum --- # Sample median * Odd sample size `$$\text{Sample median} =x_{(n+1)/2}$$` -- * Even sample size `$$\text{Sample median} =\frac{1}{2}x_{(n/2)} + \frac{1}{2} x_{(n/2+1)}$$` --- # Example example For our simple data set `\(\{0, 3, 2, 0\}\)`, to calculate the median we - re-order it to: `\(\{0, 0, 2, 3\}\)` -- - take the average of the middle two observations `\((0 + 2)/2\)`, to get 1 --- # Example: The beauty data set R has a built-in function called `median()` ```r median(beauty$beauty) #> [1] -0.156 median(beauty$students) #> [1] 29 ``` --- # Exercise (10 minute or so) For the beauty data set, calculate the `mean()` and `median()` for the columns * tenured * minority * age --- layout: true background-image: url(assets/white_logo.png) <div class="jr-header-inverse"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter2.html</div></div> --- class: inverse, center, middle # Measures of spread --- layout: true <div class="jr-header"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter2.html</div></div> --- # Measures of spread A measure of location is insufficient in itself to summarise data * It only describes the value of a typical outcome -- * Compare: * 10, 20, 30 and 19, 20, 21 * Both have the same mean (20) and the same median (20) * Both are clearly different --- # Range The range is easy to calculate - largest minus the smallest -- `$$\text{Range }= x_{(n)} - x_{(1)}$$` where `\(x_{(n)}\)` is the largest value in our data set and `\(x_{(1)}\)` is the smallest value. -- So for our data set of `\(\{0,3,2,0\}\)`: * the range is `\(3-0=3\)` * Useful for data checking purposes --- # Range: issues 1. Unduly influenced by one particularly large or small value, known as an outlier -- 1. Only really suitable for comparing (roughly) equally sized samples as it is more likely that large samples contain the extreme values of a population -- > Useful for data checking --- # Example: OKCupid ranges ```r range(cupid$age) #> [1] 18 110 ``` Perhaps a little on the high side? -- It turns out that there are: * 31 sixty nine year olds in the sample * A single person at 109 * A single person at 110 --- # Exercise: 5 minute thought exercise In terms of data collection, how would you avoid getting these types of observations? --- # Sample variance and standard deviation * The mean and variance are the most popular * Nice mathematical properties The sample variance, `\(s^2\)`: `$$s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2$$` -- The formula can be rewritten as `$$s^2= \sum_{i=1}^n \frac{x_i^2}{n-1} - \left(\frac{n}{n-1}\right) \bar x^2$$` --- # Rule of thumb Approximately 95% of data is contained in the interval `$$\bar x \pm 2 \times s$$` > Tomorrow we'll see where this rule comes from --- # Example: Beauty data set ```r sd(beauty$evaluation) #> [1] 0.555 var(beauty$evaluation) #> [1] 0.308 ``` --- # Quartiles and the interquartile range * `\(Q_3\)`: Upper quartile = `\(3 (n+1)/4^{\text{th}}\)` smallest observation * `\(Q_1\)`: Lower quartile = `\((n+1)/4^{th}\)` smallest observation `$$IQR = Q_3 - Q_1$$` --- # R example The quartiles are special cases of percentiles: * The minimum is the `\(0^{th}\)` percentile * `\(Q_1\)` is the `\(25^{th}\)` percentile * The median, `\(Q_2\)` is the `\(50^{th}\)` percentile * `\(Q_3\)` is the `\(75^{th}\)` percentile * The maximum is the `\(100^{th}\)` percentile ```r quantile(cupid$age) #> 0% 25% 50% 75% 100% #> 18 26 30 37 110 ``` --- # R example ```r ## The middle 90% quantile(cupid$age, probs = c(0.025, 0.975)) #> 2.5% 97.5% #> 20 58 ``` -- ```r # Rule of thumb mean(cupid$age) - 2 * sd(cupid$age) #> [1] 13.4 mean(cupid$age) + 2 * sd(cupid$age) #> [1] 51.2 ``` --- # Exercise For the beauty data set calculate the `sd()`, `range()` and `quantiles()` for * `tenured` * `age` * `beauty` --- layout: true background-image: url(assets/white_logo.png) <div class="jr-header-inverse"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter2.html</div></div> --- class: inverse, center, middle ## Streaming data is data that is generated continuously by multiple sources. --- layout: true <div class="jr-header"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter2.html</div></div> --- # The mean and variance * Suppose we have observed `\(k-1\)` values and our current estimate of the mean is `\(m_{k-1}\)` - If it helps, just make `\(k\)` a number, e.g. `\(k = 1000\)` -- * A new observation, `\(x_k\)` arrives. How would we update our estimate? -- * An obvious method is `$$m_k = \frac{1}{k} ((k-1) \times m_{k - 1} + x_k)$$` -- * However this method isn't _numerically stable_ * Basically `\((k-1) \times m_{k - 1}\)` gets large and we'll lose precision. --- # Streaming data: mean & variance Instead we should use `$$m_k = m_{k-1} + \frac{x_k - m_{k-1}}{k}$$` We have a similar algorithm for the [variance](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) --- # Example: OKCupid The OKCupid data set contains almost 60,000 individuals. <img src="chapter2_files/figure-html/unnamed-chunk-11-1.svg" width="70%" style="display: block; margin: auto;" /> --- # The median and quantiles * Keeping a [running median](https://stackoverflow.com/q/10657503/203420) is a non-trivial task * We need to maintain a sorted data structure containing the data * The key issues are storage cost of the data and retrieval time --- # Relevant R functions <table> <caption>Summary of R commands in this chapter.</caption> <thead> <tr> <th style="text-align:left;"> Command </th> <th style="text-align:left;"> Comment </th> <th style="text-align:left;"> Example </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> mean </td> <td style="text-align:left;"> Calculates the mean of a vector </td> <td style="text-align:left;"> mean(x) </td> </tr> <tr> <td style="text-align:left;"> sd </td> <td style="text-align:left;"> Calculates the standard deviation of a vector </td> <td style="text-align:left;"> sd(x) </td> </tr> <tr> <td style="text-align:left;"> var </td> <td style="text-align:left;"> Calculates the variance of a vector </td> <td style="text-align:left;"> var(x) </td> </tr> <tr> <td style="text-align:left;"> quantile </td> <td style="text-align:left;"> The vector quartiles </td> <td style="text-align:left;"> quantile(x) </td> </tr> <tr> <td style="text-align:left;"> range </td> <td style="text-align:left;"> Calculates the vector range </td> <td style="text-align:left;"> range(x) </td> </tr> </tbody> </table>