Margin of Error

class: center, middle, inverse, title-slide

# Margin of Error
## Chapter 5
### Colin Gillespie (<a href="https://twitter.com/csgillespie">@csgillespie</a>)

---

layout: true
background-image: url(assets/white_logo.png)

<div class="jr-header-inverse">
        <img class="logo" src="assets/white_logo_full.png"/>
        <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span>
      </div>
      
<div class="jr-footer-inverse"><span>&copy; 2019 Jumping Rivers (jumpingrivers.com)</span><div></div></div>
---
class: inverse, center, middle

### Testing leads to failure, and failure leads to understanding 
### __Burt Rutan__

---
layout: true

<div class="jr-header">
        <img class="logo" src="assets/white_logo_full.png"/>
        <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span>
      </div>
      
<div class="jr-footer"><span>&copy; 2019 Jumping Rivers (jumpingrivers.com)</span><div></div></div>
---

# Big picture

> By assuming an underlying normal distribution we can use information from a
sample to inform us about the population

This sounds like a big assumption, but it's not too bad

---

# Introduction & motativating example

We want to compare two advert designs
  
  - At great expense, it has been decided to change the font to Comic Sans
    
  - Does this change work? 
    
--

* Being a (data) scientist we decide to (humanely) experiment on people by randomly showing them the advert

--
  
  * Past experience, you know that customers spent 45 seconds (on average) on your site.

* After switching to comic sans, we recorded the amount of time spent on the site by 20 customers

```
    34 51 30 79 54 31 57 62 59 41 77 55 35  3 69 46 47 66 63 59
    ```
    Should we consider switching to comic sans?

---

# Introduction & motativating example

* Clearly time will vary visit–by–visit
  
  * The average time

`$$\bar x = \frac{34 + 51 + 30 + \ldots + 59}{20} = 50.9$$`

* The new website does seem to be perform slightly better (compared to 45)
  
  * But we have a very small sample

* How to account for variability?
  
    - Hypothesis test

---

# One sample test

* One–sample z–test 
  
    * Compare the mean of a set of sample observations to a target value
--

* The mean in our sample is denoted by `$\bar x$`
  
  * The population mean is denoted as `$\mu$` (pronounced "mu")
  
    * `$\bar x$` is our sample _estimate_ of `$\mu$`

---

# Hypothesis testing

Null hypothesis

`$$H_0: \mu = 45$$`

We usually test against a general alternative hypothesis `$H_1$`
`$$H_1: \mu \ne 45$$`

which says `$\mu$` is not equal to 45

* Null hypthesis is always the dull/boring one

---

# Hypothesis testing

* When performing the hypothesis test, we _assume_ `$H_0$` to be true
  
  * We then ask ourselves the question

> How likely is it that we would observe the data we have, or 
> indeed anything more extreme than this, if the null hypothesis is true?

---

# Hypothesis testing

* Use the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem)

* Although we will not go into the details here, this result tells us that the quantity

`$$Z = \frac{\bar x - \mu}{s/\sqrt{n}}$$`

follows a normal distribution (when `$n$` is reasonably large)

* `$\bar x$` is our sample mean
  * `$\mu$` is the assumed value of the population mean under the null hypothesis `$H_0$`
  * `$s$` is the sample standard deviation
  * `$n$` is the sample size

---

# Hypothesis testing

If the null hypothesis is true, then `$\mu = 45$`, so

`$$Z = \frac{\bar x - \mu}{s/\sqrt{n}} = \frac{50.9 - 45}{18.2/\sqrt{20}} = 1.45$$`
How likely is it to have observed this value?

---

# How likely is this value?

<img src="chapter5_files/figure-html/unnamed-chunk-1-1.svg" width="80%" style="display: block; margin: auto;" />
---

# How likely is this value?

* Since the normal distribution is symmetric, `$Z = 1.45$` is just as extreme as `$Z = −1.45$`
    
    * The shaded region in the following diagram illustrates the `$p$`–value 
    
    * The probability of observing the data we have

* The closer the area of the shaded region (the `$p$`–value) is to 0
  
    - The less plausible it is, the more evidence we have to reject `$H_0$`

---

# How likely is this value?

So, we need to work out the area of the shaded region under the curve in the
diagram above, which can be done using R

```r
pnorm(1.45, lower.tail = FALSE) * 2
#> [1] 0.147
```
So the `$p$`-value is 0.15.

---

# How likely is this value?

Earlier, we said that the smaller this `$p$`–value is, the more evidence we have to reject `$H_0$`. 
The question now, is:

> What constitutes a p–value small enough to reject `$H_0$`?

The convention (but by no means a hard–and–fast cut–off) is to reject `$H_0$` if the p–value is
smaller than 5%. Thus, here we would say:

* Our p–value is greater than 5% (in fact, it’s larger than 10% – a computer
can tell us that it’s exactly 14.7%)

* Thus, we do not reject `$H_0$`
  
  * There is insufficient evidence to suggest a real deviation from the previous value

---

# How likely is this value?

> Absence of evidence is not evidence of absence

---

# Example R

```r
comic = c(34, 51, 30, 79, 54, 31, 57, 62, 59, 41, 77, 55, 35, 3, 69, 46, 47, 66, 63, 59)
t.test(comic, mu = 45)
#> 
#> 	One Sample t-test
#> 
#> data:  comic
#> t = 1, df = 20, p-value = 0.2
#> alternative hypothesis: true mean is not equal to 45
#> 95 percent confidence interval:
#>  42.4 59.4
#> sample estimates:
#> mean of x 
#>      50.9
```

---

# Example: OKCupid

* The OKCupid dataset provides heights of their users
  
  * How consistent are the heights given by users with the average height across the USA?
  
  * From the [CDC](https://www.cdc.gov/nchs/data/series/sr_11/sr11_252.pdf) paper we discover the 
  average height in the USA is 69.3 inches
  
---

# Example: OKCupid

```r
## Select Males
height = cupid$height[cupid$sex == "m"]

## Remove missing values
height = height[!is.na(height)]
mean(height)
#> [1] 70.4
```

---

# Example: OKCupid

```r
t.test(height, mu = 69.3)
#> 
#> 	One Sample t-test
#> 
#> data:  height
#> t = 70, df = 40000, p-value <2e-16
#> alternative hypothesis: true mean is not equal to 69.3
#> 95 percent confidence interval:
#>  70.4 70.5
#> sample estimates:
#> mean of x 
#>      70.4
```

---

# Exercise

* From the CDC, the average female height is 63.8 inches
  
  * Test whether females are taller in the OKcupid sample

---

# Errors

---

## Two sample z-test

* Suppose we want to test another improvement to our website
  *  We think that adding a [blink](https://en.wikipedia.org/wiki/Blink_element) tag would be a good way of
attracting customers.

* Monitoring the first twenty customers we get

```
    21 32 46 19 29 31 37 28 50 29 34 40 26 20 48  7 39 30 40 34
    ```

How do we compare the website that uses the Comic Sans font to the blinking site? We use a two sampled z-test!

---

# Two sample z-test

`$$H_0: \mu_1 = \mu_2$$`

While the alternative hypothesis is that the two pages differ, i.e.

`$$H_1: \mu_1 \ne  \mu_2.$$`
--

The corresponding test statistic is

`$$Z = \frac{\bar x_1 - \bar x_2}{s \sqrt{1/n_1 + 1/n_2}}.$$`

---

# Two sample z-test

```r
blink = c(21, 32, 46, 19, 29, 31, 37, 28, 50, 29, 34, 40, 26, 20, 48, 7, 39, 30, 40, 34)
t.test(comic, blink, var.equal = TRUE)
#> 
#> 	Two Sample t-test
#> 
#> data:  comic and blink
#> t = 4, df = 40, p-value = 3e-04
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>   9.37 28.43
#> sample estimates:
#> mean of x mean of y 
#>      50.9      32.0
```
In this example, since the p-value is relatively, we can conclude that the two
web-designs do appear to be different.

---

# Confidence intervals

*  When we get an answer, we don't just want a point estimate, i.e. a single number
    - we want a plausible range
    
  * Confidence intervals provide an alternative to hypothesis tests for assessing questions about the
population mean (or population means in two sample problems)

* Recall that the sample mean `$\bar x$` is an estimate of the population mean `$\mu$`
  
  *  The problem is, if we were to take many samples from the population, and so calculate many `$\bar x$`'s, they are all likely to be different to each other.  Which one would we trust the most?  
  
--

Central to the idea of margin of error, is the [_central limit theorem_](https://en.wikipedia.org/wiki/Central_limit_theorem).

---

# Construction

1. Find the mean in our sample, `$\bar x$`

1. Subtract some amount from `$\bar x$` to obtain the _lower bound_ of our confidence interval

1. Add the same amount in (2) to our sample mean `$\bar x$` to obtain the _upper bound_ of our confidence interval

---

# Formula

`$$\left(\bar{x}-z \times \frac{s}{\sqrt{n}}, \hspace{0.5cm} \bar{x}+z\times \frac{s}{\sqrt{n}}\right),$$`

often condensed to just

`$$\bar{x} \pm z \times \frac{s}{\sqrt{n}}$$`

where `$z$` is a critical value from the standard normal distribution.

For the standard interval 95%  confidence interval, the `$z$` value is 1.96, often rounded to 2. So the interval becomes

`$$\bar{x} \pm \frac{2 s}{\sqrt{n}}$$`

If we wanted a 90% interval, we would use `$z = 1.645$`. For a 99% interval, we would use
`$z = 2.576$`

---

# Example: Comic Sans

Let's return to our Comics Sans example. The average time spent on the site was `$\bar x = 50.9$` with
a standard deviation of `$s = 18.2$`. This gives a 95% confidence interval of

`$$50.9 \pm 1.96 \frac{18.2}{\sqrt{20}} = (42.92, 58.88)$$`

---

# Example: Comic Sans

Alternatively, we could use R and extract the confidence interval from

```r
t.test(comic)
#> 
#> 	One Sample t-test
#> 
#> data:  comic
#> t = 10, df = 20, p-value = 1e-10
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#>  42.4 59.4
#> sample estimates:
#> mean of x 
#>      50.9
```
to get the interval `$(42.38,59.42)$`. Notice this interval is slightly wider, since it's using the
exact `$t$`-distribution.

---

# Summary

The region
`$$\bar x \pm 2 s$$`
contains approximately 95% of the data

The region
`$$\bar x \pm 2 s/\sqrt(n)$$`
is a 95% confidence interval for the mean

---

# Summary

```r
s = sd(cupid$age)
n = length(cupid$age)
m = mean(cupid$age)
```

```r
# Approx 95% of the data
c(m - 2 * s, m + 2 * s) 
#> [1] 13.4 51.2
```

```r
# A confidence interval around the mean
c(m - 2 * s/sqrt(n), m + 2 * s/sqrt(n)) 
#> [1] 32.3 32.4
```

---
layout: true
background-image: url(assets/white_logo.png)

# Break time