Introductory Data Science using R Lecture 1: The Data Science Framework
Structure Lecture 1: The data science framework Lecture 2: Using R
Lecture 3: Data science with R
Lecture 4: Exploratory analysis of Kiva.org data
The scientific method is an empirical method of acquiring knowledge that has characterized the development of science since at least the 17th century. It involves careful observation, applying rigorous skepticism about what is observed, given that cognitive assumptions can distort how one interprets the observation. It involves formulating hypotheses, via induction, based on such observations; experimental and measurement-based testing of deductions drawn from the hypotheses; and refinement (or elimination) of the hypotheses based on the experimental findings. These are principles of the scientific method, as distinguished from a definitive series of steps applicable to all scientific enterprises The scientific inquiry data + model —> understand
Not new, arises in many fieldsNatural sciences Econometrics Psychology Sociology etc. From a data science perspective, we are interested in the numerical aspects. qualitative vs quantitative It really is not new. Examples? Giuseppe Piazzi’s observations in the Monatliche Correspondenz, September 1801.
18th century Collected data on position of a celestial object. Data and model show that the object did not behave like it was supposed to. Announced it as a comet but really was a planet. Design of experiments; randomised control trials. Sir Ronald Fisher (1890–1962). Fisher credited with the methods to analyze these types of data sets ANOVA Note the deliberate intent of collecting data for this specific purpose c.f. surveys Data is now available by happenstance, and not just collected by design.
Big Data The more we measure, the more we don’t understand
Breadth vs depth paradox; Big p Small n; The curse of dimensionality “Data first” paradigm Ethics; privacy Data collected was manageable and intended. E.g. surveys Computing power Able to quantify greater degree the actions of individuals, but less able to characterize society Data comes after the question. Often do not have the luxury of tailoring what data is collected. Fundamental statistics issues surrounding data are thrown out the window: precision and accuracy. bias in data. define: Data Science
The “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data.
Multi-displinary field Goal: extract knowledge and insights from structured and unstructured data In essence, need a systematic way of dealing with data. Need to combine knowledge from various fields. While every field was working in silos, they specialised in their own thing. Data science unites the fields of stats/maths and computer science to make data actionable. Examples of Data Science problems Real-world problems from the Alan Turing Institute
Real-time jammer detection, identification and localization in 3G and 4G networks Automated matching of businesses to government contract opportunities Using real-world data to advance air traffic control Personalised lung cancer treatment modelling using electronic health records and genomics ATI is the national institute for data science and artificial intelligence. interesting to ponder, why was it named after Alan Turing, the comuting pioneer? Examples of Data Science problems Real-world problems from the Alan Turing Institute
Identify potential drivers of engaging in extremism News feed analysis to help understand global instability Improved strength training using smart gym equipment data Scope: Exploratory
Focus on transform and visualise Modelling requires a specific skill set (Stats or ML) GOAL: Generate many promising leads that you can later explore in more depth Machine Learning vs Statistics Statistics aims to turn humans into robots.
Concept of “statistical proof” Often interest is inference Machine learning aims to turn robots into humans.
Make sense of patterns from big data Often interest is prediction Statistics aims to remove the bias of humans when perceiving patterns in data sets. Learn not to be conned; when someone tells you it is such, need proof. Stats: How big is big, and is it enough? Measuring effects. Important question: causality? On the other hand ML or AI aims to equip computers with human skills: image understanding, speech recognition, natural language processing, etc. Kind of “reverse engineering” of world processes based on data that is observed. Generate large labelled data sets from humans. Train models. Interesting note: programming language also speaks as to what your background is. R for stats, Python for ML. Data Quality and Readiness There’s a sea of data, but most of it is undrinkable
Data neglect: data cleaning is tedious and complex
80-20 rule of Data Science Most time is spent cleaning up data Affectionally called data “wrangling” [TBA] Data Readiness levels (Bands A, B and C) So much for the world’s most sexiest job of the 21st century! according to business harvard review 2012. Company hires ML, software engineers, but not data cleaners! The importance of data is hard to overstate. Types of data Structured dataData is in a nicely organised repository E.g. Tables, matrices, etc. Unstructured dataInformation does not have a predefined data model E.g. images, colours, text, sound, etc. Types of data Continuous dataMeasurements are taken on a continuous scale e.g. height, weight, temperature, GDP, distance, etc. Usually arises from physical experiments Discrete dataMeasurements which can only take certain values e.g. sex, survey responses (Likert scales), occupation, ratings, ranks, etc. Usually arises in social sciences Types of data Treatment Continuous Data Categorical Data Import class numeric
factor
, ordinal
Visualise Histograms, density plots, scatter plot, box & whisker plot, pie charts Bar plots, Summarise 5-point summaries Frequency tables
Exploratory Data Analysis Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
More on this later…
Modelling $$y_i = \alpha + \beta x_i + \epsilon_i$$
$$\epsilon_i \sim \text{N}(0,\sigma^2)$$
EDA does not help in providing statistical proof, nor give predictions To do this, engage in statistical or ML models Many types of models, depending on what question you want answered The R programming language R
is a language and environment for statistical computing and graphics https://www.r-project.org/about.html
It is free and open source Runs everywhere Supports extensions Engaging community Links to other languages Exercise What exploratory analyses would you conduct on this data set? What other data do you need to supplement your analyses? What questions do you aim to answer? End of Lecture 1 Questions?
Inference vs Prediction Source: datascienceblog.com
Inference: Use the model to learn about the data generation process. Prediction: Use the model to predict the outcomes for new data points. Model interpretability Model interpretability is necessary for inference In a nutshell, a model is interpretable if we can “see” how the model generates its estimates c.f. Blackboxes Interpretable models often uses simplified assumptions Inference: Use the model to learn about the data generation process. Prediction: Use the model to predict the outcomes for new data points. Model complexity A complex model is often better at prediction tasks “More parameters to tune” However, model interpretability suffers Bias-Variance tradeoff $$
E[f(x) - \hat f (x)]^2 = \text{Bias}^2[\hat f(x)] + \text{Var}[\hat f(x)] + \sigma^2
$$
Bias: How close to the truth Variance: How varied the predictions will be under a new data set Linear regression Economic freedom = 2.6 + 0.6 Trade
Trade: tariffs, regulatory trade barriers, black market, control movement of capital and people, trade Survey Methodology Source: Groves et al. (2009)
Three populations Sampling design for BSA survey Target: Adults aged 18 or over in GB Survey: Private households south of the Caledonian Canal Frame: Addresses in the Postcode address file Multistage design:
Stratify by postcode sectors Simple random sampling of addresses Simple random sampling of individuals From 60mil people, obtained 3,297 respondents in final sample.
What is random? Not predetermined Everyone should be able to be sampled with positive probability Unbiased Data Readiness Band C: Hearsday data. Is it really available? Has it actually been recorded? Format: PDF, log books, etc. Band B: Ready for exploratory analysis, visualisations. Missing values, anomalies, … Band A: Ready for ML/Stats models.