Foundational data science

# Foundational data science
## Chapter 1
### Colin Gillespie (<a href="https://twitter.com/csgillespie">@csgillespie</a>)

---

<div class="jr-header-inverse">
        <img class="logo" src="assets/white_logo_full.png"/>
        <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span>
      </div>
      
<div class="jr-footer-inverse"><span>&copy; 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter1.html</div></div>

---
class: inverse, center, middle

## It is the mark of a truly intelligent person to be moved by statistics __George Bernard Shaw__

---
layout:true

---

# About
.pull-right[
* Academic:

* Senior [Statistics](http://www.mas.ncl.ac.uk/~ncsg3/) lecturer, [Newcastle University](https://en.wikipedia.org/wiki/Newcastle_University), UK

* Consultant at [Jumping Rivers](https://jumpingrivers.com)

* Data science training & consultancy

* R, [Stan](https://www.jumpingrivers.com/courses/13_introductions-to-bayesian-inference-using-rstan), Scala

* [Efficient R programming](http://shop.oreilly.com/product/0636920047995.do), O'Reilly

* [LinkedIn](https://www.linkedin.com/in/colin-gillespie-25028332/)
]

---

# Shock news

* Statistics lectures can get boring 
  
  * Oddly, combining it with programming doesn't help (much)
  
    - Lots of small tasks
    
  * Regular, small breaks

* Slides are a subset of the [lecture notes](https://github.com/jumpingrivers/foundational-data-science)
  
    - These are notes, not a polished book

---
layout: true
---

# Book/Course/Lecture overview

* Data science is a vague term
  
  * Everyone can agree what a vet or a plumber does
  
  * But what does a data scientist do
  
    - No Venn diagrams are allowed!

---
layout: true

<div class="jr-header">
        <img class="logo" src="assets/white_logo_full.png"/>
        <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span>
      </div>
      
<div class="jr-footer"><span>&copy; 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter1.html</div></div>
---

# Four minute exercise

- In the Q & A channel, what's the difference between a
  
    - Statistics
    
    - Data Science
    
    - Machine learning
    
  - No fights!

---

# Data Science

*  My background is in statistics 
  
    - I see data science as the logical extension of applying statistical methods to large scale data sets
    
  * Machine learning by contrast doesn't focus quite so much on understanding what's happening

---

# Course aim

* Introduction to data science from a statistical point of view
  
  * Small, interesting data sets
  
  * Small enough to be informative, but still have real world interest
  
--

* Introduce some mathematical notation
  
  * We'll cover means (the average). Why do we use:
    
      * `\(\mu\)` and `\(\bar x\)` for the same(?) thing?

* Simple cases to give insight in more complex examples
--

* Allegedly interesting examples

---
layout:true
---

# The trouble with ~~tibbles~~ data

> The trouble with interesting data, is that there are so many potential problems

---
layout: true

---

# Population

> In general we never observe the whole population

---

# Population

* In practice it is difficult to observe whole populations
  
    - Unless we are interested in a very limited population
    
  * We reality we usually observe a subset of the population
  
    - A _sample_

---

# Sample size

* Larger samples will generally give more precise information about the population

- Quality matters!
    
    - Facebook users aren't representation of the population

---

# Designing the experiment

* Designing an experiment is hard!

--
  
  * Let's suppose the two adverts were displayed on facebook. 
    - Display advert 1 followed by advert 2
    - Leads to [_confounding_](https://en.wikipedia.org/wiki/Confounding)

* Is traffic on facebook the same on each day, e.g. is Friday different from Saturday?
 
 * Is the week before Christmas the same as the week after Christmas?
 
 * Was your advert on the same page as a more popular advert?

---

# Randomisation

* Display the advert with a probability
  
  * The probability could be equal
  
    - Or if you were trying a new and untested approach, you might have probability `\(0.95\)` standard advert, `\(0.05\)` new advert
    
--

* Whenever possible, use simple random sampling as it makes problems disappear
  
--

* Also interesting is the art of asking an embarrassing question
  
  
---

# Even the best get it wrong

Google

* google "image analysis racist"
  
  * Or [Soap dispenser](https://www.youtube.com/watch?v=YJjv_OeiHmo)
  
Just because you are doing "AI", the problem doesn't go away