class: center, middle, inverse, title-slide # # Capturing relationships with linear regression ## Chapter 6 ### Colin Gillespie (
@csgillespie
) --- layout: true background-image: url(assets/white_logo.png) <div class="jr-header-inverse"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div></div></div> --- class: inverse, center, middle ### One of the first things taught in introductory statistics textbooks is that correlation is not causation. It is also one of the first things forgotten. ### __Thomas Sowel__ --- layout: true <div class="jr-header"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div></div></div> --- # Introduction * Use information about some variables to help predict the outcome of another variable - Use a persons financial history to predict the likelihood of them defaulting on a mortgage * The idea of using past data to predict future events is central to data science and statistics -- * Simplest relationship between two variables is linear --- # Capturing linear relationships <img src="chapter6_files/figure-html/6-1-1.svg" width="80%" style="display: block; margin: auto;" /> --- # Correlation * This is a measure of the _linear_ association * The sample correlation coefficient is defined as `$$r=\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}$$` where * `\(n\)` is the sample size * `\(x_{i},y_{i}\)` are the single samples indexed with `\(i\)` * `\(\bar {x} = \frac {1}{n} \sum _{i=1}^{n} x_{i}\)` is the (the sample mean) --- # Example: Starbucks calorie content * The Starbucks data set contains nutritional value of 113 items * For each item on the menu we have the number of calories, and the carbohydrate, fat, fiber and protein amount We can quickly get an overview in R ```r head(starbucks) #> Product Calories Fat Carb Fiber Protein #> 1 Chonga Bagel 300 5 50 3 12 #> 2 8-Grain Roll 380 6 70 7 10 #> 3 Almond Croissant 410 22 45 3 10 #> 4 Apple Fritter 460 23 56 2 7 #> 5 Banana Nut Bread 420 22 52 2 6 #> 6 Blueberry Muffin with Yogurt and Honey 380 16 53 1 6 ``` --- # Example: Starbucks calorie content <img src="chapter6_files/figure-html/6-2-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Example: Starbucks calorie content ```r ## Drop the first column since it's the food names cor(starbucks[, -1]) #> Calories Fat Carb Fiber Protein #> Calories 1.000 0.829 0.708 0.471 0.619 #> Fat 0.829 1.000 0.281 0.276 0.423 #> Carb 0.708 0.281 1.000 0.408 0.204 #> Fiber 0.471 0.276 0.408 1.000 0.472 #> Protein 0.619 0.423 0.204 0.472 1.000 ``` -- * There is a diagonal of 1, since the correlation of a variable with itself is 1 -- * The matrix is _symmetric_ since the correlation between `\(X\)` and `\(Y\)` is the same as the correlation between `\(Y\)` and `\(X\)` -- Out of the four component parts, `Fat` is very highly correlated with `Calories`. --- # Exercise Calculate the correlation coefficients for the beauty datasets ```r # Hint, to remove the gender column, use # beauty[, -5] ``` Also check out [http://www.tylervigen.com/spurious-correlations](http://www.tylervigen.com/spurious-correlations) --- # Linear Regression * The next step is use information about one variable to inform you about another -- * Equation of a straight line `$$Y = \beta_0 + \beta_1 x$$` where * `\(\beta_0\)` is the `\(y\)`-intercept (in the UK, we used `\(c\)` instead of `\(\beta_0\)`) * `\(\beta_1\)` is the gradient (in the UK, we used `\(m\)` instead of `\(\beta_1\)`) --- # Linear Regression `$$Y = \beta_0 + \beta_1 x$$` * `\(Y\)` the response variable (the thing we want to predict) * `\(x\)` the predictor (or covariate) -- * The aim of the model is to estimate the values of `\(\beta_0\)` and `\(\beta_1\)`. * Sample means uncertainity --- # Linear Regression ```r # This is an R formula # Read as: Calories is modeled by Fat (m = lm(Calories ~ Fat, data = starbucks)) #> #> Call: #> lm(formula = Calories ~ Fat, data = starbucks) #> #> Coefficients: #> (Intercept) Fat #> 148.0 12.8 ``` The output from R gives estimates of `\(\beta_0 = 148.0\)` and `\(\beta_1 = 12.8\)`. --- # Linear Regression ```r summary(m) #> #> Call: #> lm(formula = Calories ~ Fat, data = starbucks) #> #> Residuals: #> Min 1Q Median 3Q Max #> -187.64 -44.88 -8.67 44.09 155.47 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 147.983 14.972 9.88 <2e-16 #> Fat 12.759 0.817 15.61 <2e-16 #> #> Residual standard error: 71.8 on 111 degrees of freedom #> Multiple R-squared: 0.687, Adjusted R-squared: 0.684 #> F-statistic: 244 on 1 and 111 DF, p-value: <2e-16 ``` --- # Prediction and Interpretation The estimated model, `$$\text{Calories} = 148 + 12.8 \times \text{Fat}$$` allows us to predict the calorie content based on the fat. * If the fat content was 10, then the estimated calorie content would be 276 -- * But if `\(\text{Fat} = 0\)`, then our model would estimate the calorie content as `\(148\)` * This seems a bit high for a glass of water! --- # How do we estimate the model coefficients? > "Minimising the sum of squared residuals" <img src="chapter6_files/figure-html/6-3-1.svg" width="90%" style="display: block; margin: auto;" /> --- # Classical statistics interpretation `$$Y = \beta_0 + \beta_1 x + \epsilon$$` where `\(\epsilon\)` is normally distributed * If assume that the errors follow a normal distribution * Then to estimate the parameter values we minimise the sum of squared residuals. --- # Machine learning interpretation * We have a cost function that we wish to minimise * It just so happens that the cost function corresponds to assuming normality -- > One approach isn't better than the other. --- # Multiple linear regression models A multiple linear regression model is the natural extension of the simple linear regression model. If we have two predictors, e.g. `$$Y = \beta_0 + \beta_1 \text{Fat} + \beta_2 \text{Carb}$$` This is equivalent to fitting a plane (a sheet of paper) through the points --- # Multiple linear regression models <img src="chapter6_files/figure-html/6-4-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Fitting the model Fitting the model in R is a simple extension ```r (m = lm(Calories ~ Fat + Carb, data = starbucks)) #> #> Call: #> lm(formula = Calories ~ Fat + Carb, data = starbucks) #> #> Coefficients: #> (Intercept) Fat Carb #> 11.36 10.52 4.17 ``` Notice that the coefficient for `Fat` has decreased from 12.8 to 10.52 due to the influence of the Carbohydrate component. --- layout: true background-image: url(assets/white_logo.png) <div class="jr-header-inverse"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div></div></div> --- class: inverse, center, middle ### Thanks for listening ### Good morning/night/bye