Condensing data with numerical summaries

class: center, middle, inverse, title-slide

# Condensing data with numerical summaries
## Chapter 2
### Colin Gillespie (<a href="https://twitter.com/csgillespie">@csgillespie</a>)

---

layout: true
background-image: url(assets/white_logo.png)

<div class="jr-header-inverse">
        <img class="logo" src="assets/white_logo_full.png"/>
        <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span>
      </div>
      
<div class="jr-footer-inverse"><span>&copy; 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter2.html</div></div>
---
class: inverse, center, middle

> James Bond: [to Vesper] Why is it that people who can't take advice always insist on giving it?

---
layout: true

<div class="jr-header">
        <img class="logo" src="assets/white_logo_full.png"/>
        <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span>
      </div>
      
<div class="jr-footer"><span>&copy; 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter2.html</div></div>

---

# Overview

* Data scientists need to condense large amounts of information into a few key summaries
  
  * There are many possible choices when simplifying data
  
    - "One size __doesn't__ fits all"

* Look at standard measures of location and spread
  
    - Building blocks for many statistical and machine learning algorithms

*  Some measures fit naturally into a big data scenario while others do not

---

# A [wee](http://www.dsl.ac.uk/entry/snd/wee_n1_adj_adv) bit of maths

* Basic notation
    
    - Generalise all situations with a simple shorthand
    
  * We replace actual numbers with letters in order to be able to write general formulae
  
    - Upper case letter to represent our random variable 
    
    - Lower case to represent sample data
      - Subscripts to distinguish individual observations in the sample
    
---

# Example

Suppose we ask three people how many mobile phone calls they made yesterday
  
  - We might get the following data: 1, 5, 7
  
  - Another sample we will most likely get different data, say 2, 0, 3
    
  - Algebra represents the general case as `$x_1$`, `$x_2$`, `$x_3$`

| | | |
------------|--|---|---|---
1st sample: | 1 | 5 | 7
2nd sample: | 2 | 0 | 3 
`$\vdots$`  | `$\vdots$` | `$\vdots$` | `$\vdots$` 
Typical sample: | `$x_1$` | `$x_2$` | `$x_3$`

---

# Further generalisation

* A random variable as `$X$` and the `$i\text{th}$` observation in the sample as `$x_i$`.
  
  * Previously, we had 1, 5, 7 so
  
    * `$x_1 = 1$`, `$x_2 = 5$` and `$x_3 = 7$`
--

* The total number of observations in a sample `$i$` the _sample size_
  
    - Referred to by the letter `$n$`; so `$n = 3$` above

---

# Sums

* The next important piece of notation to introduce is the symbol `$\sum$`
  
    - Upper case of the Greek letter `$\sigma$`, pronounced ''sigma''

* It is used to represent the phrase 'sum the values'

$$
\sum_{i=1}^n x_i = x_1 + x_2 + \dots + x_n.
$$

- Sum of all the values in our data (from the first `$i = 1$` to the last `$i = n$`)
  
--

- Often shorten to `$\sum x$`

---

# Measures of location (averages)

- What's a typical observation
  
--

- Mean and the median

---

# Sample mean

`$$\bar x = \frac{x_1 + x_2 + \ldots + x_n}{n} = \frac{1}{n} \sum_{i=1}^n x_i$$`

where

* `$x_i$` are our data points
  * `$n$` is the sample size, i.e. the number of data points in our sample

---

# Example

So if our data set was `$0,3,2,0$`, then `$n=4$`. Hence,
$$\bar x = \frac{1}{n} \sum_{i=1}^n x_i = \frac{0 + 3 + 2 + 0}{4} = 1.25 $$

In statistics, it is common to use a potential bar over a variable to denote the 
mean

---

# Example: The beauty data set

* This is __not__ made up

---

# The relationship between beauty and teaching

[Study](https://doi.org/10.3386/w9853) where researchers were 
interested in whether a lecturers' attractiveness affected 
their course evaluation!

* evaluation: the questionnaire result
 
 * tenured: does the lecturer have tenure; 1 == Yes. In R, this value is continuous
 
 * minority: does the lecturer come from an ethnic minority (in the USA)

* age: the lecturers' age
 
 * gender: a factor: Female or Male
 
 * students: number of students in the class
 
 * beauty: each of the lecturers' pictures was rated by six undergraduate students: three women and three men
 
---

# The relationship between beauty and teaching

* The `beauty` data set contains results for 463 classes
  
  * Does attractiveness affect teaching ability?
  
---

# Example: The beauty data set

After loading the data set into R we can extract a particular column using the dollar 
operator

```r
## Attractiveness score
beauty$beauty
## Number of students per class
beauty$students
```

---

# Example: The beauty data set

We can use the built-in function `mean()`

```r
## Attractiveness score (normalised)
mean(beauty$beauty)
#> [1] -0.0883
```
--

```r
## Number of students per class
mean(beauty$students)
#> [1] 55.2
```

---

# Sample median

* Occasionally used instead of the mean
  
    - Particularly when the data have an asymmetric profile or when there are outliers (as indicated by graph, perhaps)
    
    - Wages!
--

* Work with _ranked_ observations
--

- `$x_{(1)}, x_{(2)}, \ldots, x_{(n)}$`  - notice the brackets __()__
    - `$x_{(1)}$` minimum
    - `$x_{(n)}$` maximum

---

# Sample median

* Odd sample size

`$$\text{Sample median} =x_{(n+1)/2}$$`
--

* Even sample size

`$$\text{Sample median} =\frac{1}{2}x_{(n/2)} + \frac{1}{2} x_{(n/2+1)}$$`

---

# Example example

For our simple data set `$\{0, 3, 2, 0\}$`, to calculate the median we

- re-order it to: `$\{0, 0, 2, 3\}$`

--
  
  - take the average of the middle two observations `$(0 + 2)/2$`, to get 1

---

# Example: The beauty data set

R has a built-in function called `median()`

```r
median(beauty$beauty)
#> [1] -0.156
median(beauty$students)
#> [1] 29
```

---

# Exercise (10 minute or so)

For the beauty data set, calculate the `mean()` and `median()` for the
columns

* tenured
  
  * minority
  
  * age

---
layout: true
background-image: url(assets/white_logo.png)

# Measures of spread

---
layout: true

# Measures of spread

A measure of location is insufficient in itself to summarise data

* It only describes the value of a typical outcome

* Compare:

* 10, 20, 30 and 19, 20, 21

* Both have the same mean (20) and the same median (20)

* Both are clearly different

---

# Range
  
The range is easy to calculate - largest minus the smallest

`$$\text{Range }= x_{(n)} - x_{(1)}$$`

where `$x_{(n)}$` is the largest value in our data set and `$x_{(1)}$` is the smallest value.

So for our data set of `$\{0,3,2,0\}$`:

* the range is `$3-0=3$`
  
  * Useful for data checking purposes

---

# Range: issues

1. Unduly influenced by one particularly large or small value, known as an outlier

1. Only really suitable for comparing (roughly) equally sized samples as 
it is more likely that large samples contain the extreme values of a population

> Useful for data checking

---

# Example: OKCupid ranges

```r
range(cupid$age)
#> [1]  18 110
```

Perhaps a little on the high side?

It turns out that there are:

* 31 sixty nine year olds in the sample
  * A single person at 109
  * A single person at 110

---

# Exercise: 5 minute thought exercise

In terms of data collection, how would you avoid getting these types of observations?

---

# Sample variance and standard deviation
  
  * The mean and variance are the most popular
  
  * Nice mathematical properties
  
The sample variance, `$s^2$`:

`$$s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2$$`

The formula can be rewritten as

`$$s^2= \sum_{i=1}^n \frac{x_i^2}{n-1}  - \left(\frac{n}{n-1}\right) \bar x^2$$`

---

# Rule of thumb

Approximately 95% of data is contained in the interval

`$$\bar x \pm 2 \times s$$`

> Tomorrow we'll see where this rule comes from

---

# Example: Beauty data set

```r
sd(beauty$evaluation)
#> [1] 0.555
var(beauty$evaluation)
#> [1] 0.308
```

---

# Quartiles and the interquartile range
  
  * `$Q_3$`: Upper quartile = `$3 (n+1)/4^{\text{th}}$` smallest observation

* `$Q_1$`: Lower quartile = `$(n+1)/4^{th}$` smallest observation

`$$IQR = Q_3 - Q_1$$`

---

# R example

The quartiles are special cases of percentiles:

* The minimum is the `$0^{th}$` percentile
 * `$Q_1$` is the `$25^{th}$` percentile
 * The median, `$Q_2$` is the `$50^{th}$` percentile
 * `$Q_3$` is the `$75^{th}$` percentile
 * The maximum is the `$100^{th}$` percentile

```r
quantile(cupid$age)
#>   0%  25%  50%  75% 100% 
#>   18   26   30   37  110
```

---

# R example

```r
## The middle 90% 
quantile(cupid$age, probs = c(0.025, 0.975))
#>  2.5% 97.5% 
#>    20    58
```

```r
# Rule of thumb
mean(cupid$age) - 2 * sd(cupid$age)
#> [1] 13.4
mean(cupid$age) + 2 * sd(cupid$age)
#> [1] 51.2
```

---

# Exercise

For the beauty data set calculate the `sd()`, `range()` and `quantiles()` for

* `tenured`
  
  * `age`
  
  * `beauty`

---
layout: true
background-image: url(assets/white_logo.png)

## Streaming data is data that is generated continuously by multiple sources.

---
layout: true

# The mean and variance

* Suppose we have observed `$k-1$` values and our current estimate of the mean is `$m_{k-1}$`
 
  - If it helps, just make `$k$` a number, e.g. `$k = 1000$`

* A new observation, `$x_k$` arrives. How would we update our estimate? 
--

* An obvious method is
`$$m_k = \frac{1}{k} ((k-1) \times m_{k - 1} + x_k)$$`

* However this method isn't _numerically stable_

* Basically `$(k-1) \times m_{k - 1}$` gets large and we'll lose precision.

---

# Streaming data: mean & variance

Instead we should use

`$$m_k = m_{k-1} + \frac{x_k - m_{k-1}}{k}$$`

We have a similar algorithm for the [variance](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance)

---

# Example: OKCupid

The OKCupid data set contains almost 60,000 individuals.

---

# The median and quantiles

* Keeping a [running median](https://stackoverflow.com/q/10657503/203420) is a non-trivial task
  
  * We need to maintain a sorted data structure containing the data
  
  * The key issues are storage cost of the data and retrieval time

---

# Relevant R functions

<table>
<caption>Summary of R commands in this chapter.</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Command </th>
   <th style="text-align:left;"> Comment </th>
   <th style="text-align:left;"> Example </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> mean </td>
   <td style="text-align:left;"> Calculates the mean of a vector </td>
   <td style="text-align:left;"> mean(x) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sd </td>
   <td style="text-align:left;"> Calculates the standard deviation of a vector </td>
   <td style="text-align:left;"> sd(x) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> var </td>
   <td style="text-align:left;"> Calculates the variance of a vector </td>
   <td style="text-align:left;"> var(x) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> quantile </td>
   <td style="text-align:left;"> The vector quartiles </td>
   <td style="text-align:left;"> quantile(x) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> range </td>
   <td style="text-align:left;"> Calculates the vector range </td>
   <td style="text-align:left;"> range(x) </td>
  </tr>
</tbody>
</table>