Lecture 1: The Data Science Framework

- Content available at https://haziqj.ml/teaching
- 4 x 2hr lectures
- 10min break on the hour
- Ask questions as we go along

- Lecture 1: The data science framework
- Lecture 2: Using
`R`

- Lecture 3: Data science with
`R`

- Lecture 4: Exploratory analysis of Kiva.org data

data + model —> understand

- Not new, arises in many fields
- Natural sciences
- Econometrics
- Psychology
- Sociology
- etc.

Giuseppe Piazzi’s observations in the Monatliche Correspondenz, September 1801.

- Design of experiments; randomised control trials.
- Sir Ronald Fisher (1890–1962).

Data is now available by happenstance, and not just collected by design.

The more we measure, the more we don’t understand

- Breadth vs depth paradox; Big p Small n; The curse of dimensionality
- “Data first” paradigm
- Ethics; privacy

`define: Data Science`

*The “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data.*

- Multi-displinary field
- Goal: extract knowledge and insights from structured and unstructured data

Real-world problems from the Alan Turing Institute

- Real-time jammer detection, identification and localization in 3G and 4G networks
- Automated matching of businesses to government contract opportunities
- Using real-world data to advance air traffic control
- Personalised lung cancer treatment modelling using electronic health records and genomics

Real-world problems from the Alan Turing Institute

- Identify potential drivers of engaging in extremism
- News feed analysis to help understand global instability
- Improved strength training using smart gym equipment data

- Focus on
*transform*and*visualise* - Modelling requires a specific skill set (Stats or ML)
- GOAL: Generate many promising leads that you can later explore in more depth

**Statistics** aims to turn humans into robots.

- Concept of “statistical proof”
- Often interest is
*inference*

**Machine learning** aims to turn robots into humans.

- Make sense of patterns from big data
- Often interest is
*prediction*

There’s a sea of data, but most of it is undrinkable

Data neglect: data cleaning is tedious and complex

- Most time is spent cleaning up data
- Affectionally called data “wrangling”
- [TBA] Data Readiness levels (Bands A, B and C)

- Structured data
- Data is in a nicely organised repository
- E.g. Tables, matrices, etc.

- Unstructured data
- Information does not have a predefined data model
- E.g. images, colours, text, sound, etc.

- Continuous data
- Measurements are taken on a continuous scale e.g. height, weight, temperature, GDP, distance, etc.
- Usually arises from physical experiments

- Discrete data
- Measurements which can only take certain values e.g. sex, survey responses (Likert scales), occupation, ratings, ranks, etc.
- Usually arises in social sciences

Treatment | Continuous Data | Categorical Data |
---|---|---|

Import class | `numeric` | `factor` , `ordinal` |

Visualise | Histograms, density plots, scatter plot, box & whisker plot, pie charts | Bar plots, |

Summarise | 5-point summaries | Frequency tables |

Generate questions about your data.

Search for answers by visualising, transforming, and modelling your data.

Use what you learn to refine your questions and/or generate new questions.

$$y_i = \alpha + \beta x_i + \epsilon_i$$ $$\epsilon_i \sim \text{N}(0,\sigma^2)$$

- EDA does not help in providing statistical proof, nor give predictions
- To do this, engage in statistical or ML models
- Many types of models, depending on what question you want answered

`R`

is a language and environment for statistical computing and graphics https://www.r-project.org/about.html

- It is free and open source
- Runs everywhere
- Supports extensions
- Engaging community
- Links to other languages

`ggplot2`

in R`ggplot2`

in R`ggplot2`

in Rhttps://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding#kiva_loans.csv

- What exploratory analyses would you conduct on this data set?
- What other data do you need to supplement your analyses?
- What questions do you aim to answer?

Questions?

Source: datascienceblog.com

- Model interpretability is necessary for inference
- In a nutshell, a model is interpretable if we can “see” how the model generates its estimates
- c.f. Blackboxes
- Interpretable models often uses simplified assumptions

- A complex model is often better at prediction tasks
- “More parameters to tune”
- However, model interpretability suffers

$$ E[f(x) - \hat f (x)]^2 = \text{Bias}^2[\hat f(x)] + \text{Var}[\hat f(x)] + \sigma^2 $$

Economic freedom = 2.6 + 0.6 Trade

Source: towardsdatascience.com

Source: Groves et al. (2009)

- Target: Adults aged 18 or over in GB
- Survey: Private households south of the Caledonian Canal
- Frame: Addresses in the Postcode address file

Multistage design:

- Stratify by postcode sectors
- Simple random sampling of addresses
- Simple random sampling of individuals

From 60mil people, obtained 3,297 respondents in final sample.

- Band C: Hearsday data. Is it really available? Has it actually been recorded? Format: PDF, log books, etc.
- Band B: Ready for exploratory analysis, visualisations. Missing values, anomalies, …
- Band A: Ready for ML/Stats models.