class: center, middle, inverse, title-slide # Foundational data science ## Chapter 1 ### Colin Gillespie (
@csgillespie
) --- layout: true background-image: url(assets/white_logo.png) <div class="jr-header-inverse"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter1.html</div></div> --- class: inverse, center, middle ## It is the mark of a truly intelligent person to be moved by statistics __George Bernard Shaw__ --- layout:true --- background-image: url(graphics/doll.jpg) background-size: cover # About .pull-right[ * Academic: * Senior [Statistics](http://www.mas.ncl.ac.uk/~ncsg3/) lecturer, [Newcastle University](https://en.wikipedia.org/wiki/Newcastle_University), UK * Consultant at [Jumping Rivers](https://jumpingrivers.com) * Data science training & consultancy * R, [Stan](https://www.jumpingrivers.com/courses/13_introductions-to-bayesian-inference-using-rstan), Scala * [Efficient R programming](http://shop.oreilly.com/product/0636920047995.do), O'Reilly * [LinkedIn](https://www.linkedin.com/in/colin-gillespie-25028332/) ] --- background-image: url(graphics/scream.jpg) background-size: cover # Shock news * Statistics lectures can get boring * Oddly, combining it with programming doesn't help (much) - Lots of small tasks * Regular, small breaks -- * Slides are a subset of the [lecture notes](https://github.com/jumpingrivers/foundational-data-science) - These are notes, not a polished book --- layout: true --- background-image: url(graphics/venn.jpg) background-size: cover # Book/Course/Lecture overview * Data science is a vague term * Everyone can agree what a vet or a plumber does * But what does a data scientist do - No Venn diagrams are allowed! --- layout: true <div class="jr-header"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter1.html</div></div> --- # Four minute exercise - In the Q & A channel, what's the difference between a - Statistics - Data Science - Machine learning - No fights! --- # Data Science * My background is in statistics - I see data science as the logical extension of applying statistical methods to large scale data sets * Machine learning by contrast doesn't focus quite so much on understanding what's happening --- # Course aim * Introduction to data science from a statistical point of view * Small, interesting data sets * Small enough to be informative, but still have real world interest -- * Introduce some mathematical notation * We'll cover means (the average). Why do we use: * `\(\mu\)` and `\(\bar x\)` for the same(?) thing? * Simple cases to give insight in more complex examples -- * Allegedly interesting examples --- layout:true --- background-image: url(graphics/tibbles.jpg) background-size: cover # The trouble with ~~tibbles~~ data > The trouble with interesting data, is that there are so many potential problems --- layout: true <div class="jr-header"> <img class="logo" src="assets/white_logo_full.png"/> <span class="social"><table><tr><td><img src="assets/twitter.gif"/></td><td> @jumping_uk</td></tr></table></span> </div> <div class="jr-footer"><span>© 2019 Jumping Rivers (jumpingrivers.com)</span><div>jumpingrivers.com/t/foundational-data-science/chapter1.html</div></div> <!-- --- --> <!-- # Example: Missing data --> <!-- * Most real world data sets have missing data --> <!-- * There are different types of missing data! --> <!-- --- --> <!-- # Missing completely at random (MCAR) --> <!-- * This is the easiest to handle --> <!-- * The data are missing independently of both observed and unobserved data --> <!-- -- --> <!-- * If I ask you your age, and you flip a coin --> <!-- - If heads, you tell me the answer --> <!-- - If tails, you refuse --> <!-- -- --> <!-- * Easy to fill in the blanks --> <!-- --- --> <!-- # Missing at random (MAR) --> <!-- * Given the observed data, data are missing independently of unobserved data --> <!-- -- --> <!-- * People from New York are more likely to refuse to tell us their voting preferences --> <!-- -- --> <!-- * But it does not depend on __who__ they are voting for --> <!-- -- --> <!-- * National opinion polls often rely on samples in the region of just `\(1,000\)` --> <!-- - Carefully chosen sample --> <!-- - Need to account for the [shy tory](https://en.wikipedia.org/wiki/Shy_Tory_Factor) or --> <!-- the [Bradley effect](https://en.wikipedia.org/wiki/Bradley_effect) --> <!-- --- --> <!-- # Missing not at random (MNAR) --> <!-- * Missing observations related to values of unobserved data --> <!-- -- --> <!-- * As a concrete example, a good predictor of whether someone survived the titanic disaster was whether or not they --> <!-- had their age recorded --> <!-- --- --> <!-- # Talking about data --> <!-- * Quantities measured in a study are called _random variables_ --> <!-- * A particular outcome is called an _observation_ --> <!-- -- --> <!-- * A collection of observations is the _data_ (or _sample_) --> <!-- * The collection of all possible outcomes is the _population_ --> <!-- --- --> <!-- # Example --> <!-- * A customer spends on our website --> <!-- -- --> <!-- * The time spent on a website is our random variable --> <!-- -- --> <!-- * A particular customer's time would be an observation --> <!-- -- --> <!-- * If we recorded the time for every person during the day, those times would be our data --> <!-- - A _sample_ from the population of all customers --> <!-- -- --> <!-- * We often use a capital letter (say `\(X\)`) to represent our random variable --> <!-- -- --> <!-- * A lower case, with subscripts (e.g. `\(x_1\)`, `\(x_2\)`, `\(\ldots\)`, ), to represent observations _in a general sense_. --> --- # Population > In general we never observe the whole population --- # Population * In practice it is difficult to observe whole populations - Unless we are interested in a very limited population * We reality we usually observe a subset of the population - A _sample_ <!-- --- --> <!-- # Data types --> <!-- * _quantitative_ (numeric) --> <!-- - Discrete: data moves in steps, e.g. shoe sizes --> <!-- - Continuous: take _any_ value, but accuracy may make it look discrete --> <!-- -- --> <!-- * _qualitative_ (non-numeric), aka ordinal or categorical --> <!-- - e.g. colours, survival --> <!-- --- --> <!-- # Exercise --> <!-- Identify the type of data described in each of the following examples. --> <!-- * An opinion poll was taken asking people which party they would vote for in a general election --> <!-- * In a steel production process the temperature of the molten steel is measured and recorded every 60 seconds --> <!-- * A market researcher stops you on the street and asks you to rate between 1 (disagree strongly) and 5 (agree strongly) --> <!-- your response to opinions presented to you --> <!-- * The hourly number of units produced by a beer bottling plant is recorded --> <!-- Also consider potential sources of error in the four scenarios --> <!-- Qualitative → Categorical --> <!-- Quantitative → Continuous --> <!-- Quantitative → Discrete → Ordinal --> <!-- Quantitative → Discrete --> --- # Sample size * Larger samples will generally give more precise information about the population - Quality matters! - Facebook users aren't representation of the population --- # Designing the experiment * Designing an experiment is hard! -- * Let's suppose the two adverts were displayed on facebook. - Display advert 1 followed by advert 2 - Leads to [_confounding_](https://en.wikipedia.org/wiki/Confounding) -- * Is traffic on facebook the same on each day, e.g. is Friday different from Saturday? * Is the week before Christmas the same as the week after Christmas? * Was your advert on the same page as a more popular advert? --- # Randomisation * Display the advert with a probability * The probability could be equal - Or if you were trying a new and untested approach, you might have probability `\(0.95\)` standard advert, `\(0.05\)` new advert -- * Whenever possible, use simple random sampling as it makes problems disappear -- * Also interesting is the art of asking an embarrassing question --- # Even the best get it wrong Google * google "image analysis racist" * Or [Soap dispenser](https://www.youtube.com/watch?v=YJjv_OeiHmo) Just because you are doing "AI", the problem doesn't go away <!-- --- --> <!-- # Exercise --> <!-- Rashid, Ellen and Hamish all work as business analysts for a leading clothing manufacturer. The design team want to know how successful a new pair of jeans will be, and so they ask the analysts to conduct some research. --> <!-- * Hamish is lazy. He goes home that night and asks his four flatmates to give the new jeans a score from 1--5, and then reports his findings back to the design team the next day. _What sort of sampling scheme has Hamish adopted, and why might it be flawed?_ --> <!-- * Ellen thinks she will try to collect a simple random sample of people who might wear the jeans; she will then try to elicit their opinions. _Why might a simple random sample be difficult to obtain here?_ --> <!-- * Rashid is more thoughtful than both Hamish and Ellen. He thinks about the target population for the new jeans: females in the age group 21--30. He then spends a full day walking up and down Oxford Street, stopping people he thinks fulfil this criteria and asking them to decide whether or not they would buy the new jeans. _What sort of sampling scheme has Rashid adopted?_ --> <!-- --- --> <!-- # Solutions --> <!-- * Hamish is lazy. He goes home that night and asks his four flatmates to give the new jeans a score from 1--5, and then reports his findings back to the design team the next day. _What sort of sampling scheme has Hamish adopted, and why might it be flawed?_ --> <!-- > Accessibility sampling. Could be flawed as it is non–random and is prone to bias. --> <!-- --- --> <!-- # Solutions --> <!-- * Ellen thinks she will try to collect a simple random sample of people who might wear the jeans; she will then try to elicit their opinions. _Why might a simple random sample be difficult to obtain here?_ --> <!-- > Difficult – if not impossible – to get a complete list of everyone in the target population. --> <!-- --- --> <!-- # Solutions --> <!-- * Rashid is more thoughtful than both Hamish and Ellen. He thinks about the target population for the new jeans: females in the age group 21--30. He then spends a full day walking up and down Oxford Street, stopping people he thinks fulfil this criteria and asking them to decide whether or not they would buy the new jeans. _What sort of sampling scheme has Rashid adopted?_ --> <!-- > Judgemental sampling - potentially bias -->