# Capturing relationships with linear regression

class: center, middle, inverse, title-slide

# # Capturing relationships with linear regression
## Chapter 6
### Colin Gillespie (<a href="https://twitter.com/csgillespie">@csgillespie</a>)

---

layout: true
background-image: url(assets/white_logo.png)

<div class="jr-header-inverse">
        <img class="logo" src="assets/white_logo_full.png"/>
        <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span>
      </div>
      
<div class="jr-footer-inverse"><span>&copy; 2019 Jumping Rivers (jumpingrivers.com)</span><div></div></div>
---
class: inverse, center, middle

### One of the first things taught in introductory statistics textbooks is that correlation is not causation. It is also one of the first things forgotten.

### __Thomas Sowel__

---
layout: true

<div class="jr-header">
        <img class="logo" src="assets/white_logo_full.png"/>
        <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span>
      </div>
      
<div class="jr-footer"><span>&copy; 2019 Jumping Rivers (jumpingrivers.com)</span><div></div></div>
---

# Introduction

* Use information about some variables to help predict the outcome of another variable
  
    - Use a persons financial history to predict the likelihood of them defaulting on a mortgage

* The idea of using past data to predict future events is central to data science and statistics

* Simplest relationship  between two variables is linear
  
---

# Capturing linear relationships

---

# Correlation

*  This is a measure of the _linear_ association
  
  * The sample correlation coefficient is defined as

`$$r=\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}$$`

where

* `$n$` is the sample size
 
 * `$x_{i},y_{i}$` are the single samples indexed with `$i$`
 
 * `$\bar {x} = \frac {1}{n} \sum _{i=1}^{n} x_{i}$` is the  (the sample mean)

---

# Example: Starbucks calorie content

* The Starbucks data set contains nutritional value of 113 items
  
  * For each item on the menu we have the number of calories, and the carbohydrate, fat, fiber and protein
amount

We can quickly get an overview in R

```r
head(starbucks)
#>                                  Product Calories Fat Carb Fiber Protein
#> 1                           Chonga Bagel      300   5   50     3      12
#> 2                           8-Grain Roll      380   6   70     7      10
#> 3                       Almond Croissant      410  22   45     3      10
#> 4                          Apple Fritter      460  23   56     2       7
#> 5                       Banana Nut Bread      420  22   52     2       6
#> 6 Blueberry Muffin with Yogurt and Honey      380  16   53     1       6
```
---

# Example: Starbucks calorie content

---

# Example: Starbucks calorie content

```r
## Drop the first column since it's the food names
cor(starbucks[, -1])
#>          Calories   Fat  Carb Fiber Protein
#> Calories    1.000 0.829 0.708 0.471   0.619
#> Fat         0.829 1.000 0.281 0.276   0.423
#> Carb        0.708 0.281 1.000 0.408   0.204
#> Fiber       0.471 0.276 0.408 1.000   0.472
#> Protein     0.619 0.423 0.204 0.472   1.000
```

* There is a diagonal of 1, since the correlation of a variable with itself is 1
 
--
  
 * The matrix is _symmetric_ since the correlation between `$X$` and `$Y$` is the same as the correlation between `$Y$` and `$X$`

Out of the four component parts, `Fat` is very highly correlated with `Calories`.
 
---

# Exercise

Calculate the correlation coefficients for the beauty datasets

```r
# Hint, to remove the gender column, use 
# beauty[, -5]
```

Also check out [http://www.tylervigen.com/spurious-correlations](http://www.tylervigen.com/spurious-correlations)

---

# Linear Regression

* The next step is use information about one variable to inform you about another

* Equation of a straight line

`$$Y = \beta_0 + \beta_1 x$$`

where

* `$\beta_0$` is the `$y$`-intercept (in the UK, we used `$c$` instead of `$\beta_0$`)
  
  * `$\beta_1$` is the gradient (in the UK, we used `$m$` instead of `$\beta_1$`)
  
---

# Linear Regression

`$$Y = \beta_0 + \beta_1 x$$`

* `$Y$` the response variable (the thing we want to predict)
  
  * `$x$` the predictor (or covariate)

* The aim of the model is to estimate the values of `$\beta_0$` and `$\beta_1$`. 
  
  * Sample means uncertainity

---

# Linear Regression

```r
# This is an R formula
# Read as: Calories is modeled by Fat
(m = lm(Calories ~ Fat, data = starbucks))
#> 
#> Call:
#> lm(formula = Calories ~ Fat, data = starbucks)
#> 
#> Coefficients:
#> (Intercept)          Fat  
#>       148.0         12.8
```
The output from R gives estimates of `$\beta_0 = 148.0$` and `$\beta_1 = 12.8$`.

---

# Linear Regression

```r
summary(m)
#> 
#> Call:
#> lm(formula = Calories ~ Fat, data = starbucks)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -187.64  -44.88   -8.67   44.09  155.47 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  147.983     14.972    9.88   <2e-16
#> Fat           12.759      0.817   15.61   <2e-16
#> 
#> Residual standard error: 71.8 on 111 degrees of freedom
#> Multiple R-squared:  0.687,	Adjusted R-squared:  0.684 
#> F-statistic:  244 on 1 and 111 DF,  p-value: <2e-16
```

---

# Prediction and Interpretation

The estimated model,

`$$\text{Calories} = 148 + 12.8 \times \text{Fat}$$`

allows us to predict the calorie content based on the fat.

* If the fat content was 10, then the estimated calorie content would be 276
  
--
  
  * But if `$\text{Fat} = 0$`, then our model would estimate the calorie content as `$148$`

* This seems a bit high for a glass of water!

---

# How do we estimate the model coefficients?

> "Minimising the sum of squared residuals"

---

# Classical statistics interpretation
  
`$$Y = \beta_0 + \beta_1 x + \epsilon$$`

where `$\epsilon$` is normally distributed

* If assume that the errors follow a normal distribution
  
  * Then to estimate the parameter values we minimise the sum of squared residuals.

---

# Machine learning interpretation

* We have a cost function that we wish to minimise
  
  * It just so happens that the cost function corresponds to assuming normality

> One approach isn't better than the other.

---

# Multiple linear regression models

A multiple linear regression model is the natural extension of the simple linear 
regression model. If we have two predictors, e.g.

`$$Y = \beta_0 + \beta_1 \text{Fat} + \beta_2 \text{Carb}$$`

This is equivalent to fitting a plane (a sheet of paper) through the points

---

# Multiple linear regression models

---

# Fitting the model

Fitting the model in R is a simple extension

```r
(m = lm(Calories ~ Fat + Carb, data = starbucks))
#> 
#> Call:
#> lm(formula = Calories ~ Fat + Carb, data = starbucks)
#> 
#> Coefficients:
#> (Intercept)          Fat         Carb  
#>       11.36        10.52         4.17
```
Notice that the coefficient  for `Fat` has decreased from 12.8 to 10.52 due to the influence
of the Carbohydrate component.

---
layout: true
background-image: url(assets/white_logo.png)

### Thanks for listening

### Good morning/night/bye