R for Data Science

AITI Talk

Haziq Jamil, PhD

Assistant Professor in Statistics, Universiti Brunei Darussalam
Visiting Fellow, London School of Economics and Political Science

https://haziqj.ml/aiti-talk/

May 21, 2025

Who we are

Instructions and material available at https://haziqj.ml/ aiti-talk/

Plan for today

Time	Activity
0830 – 0900	Introduction & Getting Started with R
0900 – 1000	Lecture 1: Basic Statistics
1000 – 1030	Break
1030 – 1130	Lecture 2: Advanced R stuff
1130 – 1200	Networking

slido.com code: 3244786

Let’s start

Introduction

Datasaurus supports you!

Automate End-to-End Survey Processing
Write one script that pulls raw survey responses, cleans and validates fields and outputs ready-to-analyse datasets.
Standardise Analysis & Quality Checks
Embed your business rules into reusable code so every round adheres to the same quality standards.
Generate Dynamic Reports in Seconds
Turn your cleaned data into up-to-date charts, tables and written summaries automatically.
Quickly Prototype “What-If” Scenarios
Simulate alternative weighting schemes, forecast adoption trends or run sensitivity analyses on key ICT indicators to guide policy adjustments.

The main game

Why choose R?

R is an interpreted programming language for statistical computing and data visualisation. It has been adopted in many fields, especially quantitive fields like data science.

R and RStudio

You can run R in the terminal, the R GUI, or other apps like RStudio.
RStudio is an IDE (integrated development environment) for R.
Alternatives include VSCode, Emacs, and <insert favourite IDE>.

my_string <- "Hello, World!"
print(my_string)

[1] "Hello, World!"

# Create a vector, manipulate it
x <- c(1, 2, 3, 4, 5)
sum(x) / length(x)

[1] 3

for (i in x) print(i)

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Lay of the land

Project folder

Download material from https://github.com/haziqj/aiti-talk
Choose a location to save the project
Go to the R/ folder, and open the aiti.RProj file
This will open a new RStudio project

Project folder

R needs to know where is your project’s “home” directory. By clicking on the RProj file, RStudio will set the working directory to the project folder.

Lay of the land

RStudio

Preamble

library(tidyverse)  # data wrangling tools
library(tinyplot)   # for quick plotting
library(tidytext)   # bigrams
library(tm)         # text mining
library(wordcloud)  # word clouds
library(gtsummary)  # pretty summary tables
library(bruneimap)  # for mapping

theme_set(theme_bw())  # ggplot2
tinytheme("clean2")    # tinyplot

Installing packages

In RStudio, if a package is not installed, a yellow ribbon will appear prompting you to install it. You can also manually install packages by running:

install.packages("<package name>")  # only need to install once
library("<package name>")           # but load the package every time!

Or browse the ‘Packages’ pane in RStudio.

Importing data

dat <- read_csv("fake_survey.csv")
glimpse(dat)

Rows: 2,000
Columns: 13
$ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ kampong    <chr> "Kg. Lorong Tiga Selatan", "Kg. Anggerek Desa", "Kg. Bebati…
$ mukim      <chr> "Mukim Seria", "Mukim Berakas A", "Mukim Pangkalan Batu", "…
$ district   <chr> "Belait", "Brunei-Muara", "Brunei-Muara", "Belait", "Brunei…
$ gender     <chr> "Female", "Male", "Female", "Male", "Male", "Female", "Male…
$ age        <dbl> 47, 38, 42, 47, 50, 33, 54, 38, 33, 38, 18, 39, 33, 25, 53,…
$ education  <chr> "O Level", "O Level", "O Level", "Higher National Diploma",…
$ q_fbspeed  <dbl> 54, 58, 711, 56, 187, 58, 55, 88, 888, 146, 20, 37, 53, 192…
$ q_fbqual   <chr> "Fair", "Very Good", "Poor", "Very Good", "Good", "Fair", "…
$ q_mbqual   <chr> "Poor", "Good", "Good", "Good", "Good", "Good", "Good", "Fa…
$ q_fbexpend <dbl> 782.00000, 78.29368, 745.31846, 78.29368, 640.10037, 622.72…
$ q_fbusage  <dbl> 620, 260, 750, 320, 290, 410, 120, 450, 310, 120, 390, 190,…
$ q_limiting <chr> "I'll almost never download software that costs hundreds of…

Demographic vs study questions

Usually, a survey contains two types of question: 1) Demographic, and 2) study questions.

Data types

graph TD
    A[**Data Type**]
    
    A --> B["**Logical**<br><br>e.g. TRUE, FALSE"]
    A --> C[**Numeric**]
    A --> D["**Complex**<br><br>e.g. 1+2i, 3+4i"]
    A --> E["**Character**<br><br>e.g. 'cat', 'blue'"]
    
    C --> CA["**Integer**<br><br>e.g. 1L, 314L"]
    C --> CB["**Double**<br><br>e.g. 1.23, 3.141"]
    
    E --> EA["**Factor**<br><br>e.g. 'MOE', 'MTIC', 'MOH'"]
    E --> EB["**Ordered**<br><br>e.g. 'Disagree', 'Neutral',<br>'Agree'"]
    
    %% Assign nodes to classes
    class EA pink
    class EB pink

    %% Define styles for the classes
    classDef pink fill:#f9c,color:#fff,stroke:#333,stroke-width:1px

Know your data types

We must know the data types of our variables to perform the correct operations on them. For example, if we want to calculate the mean of a variable, it must be numeric. If it is a factor/ordered, we need to convert it to numeric first.

Transforming data

dat <-
  dat |>
  mutate(
    gender = factor(gender, levels = c("Male", "Female")),
    # Convert education to factors
    education = factor(education, levels = c(
      "Primary School", "Lower Secondary", "O Level", "A Level", 
      "National Certificate", "Diploma", "National Diploma", 
      "Higher National Diploma", "Bachelor Degree", "Master Degree", "PhD"
    )),
    # Convert Likert scale to ordered factors
    across(c(q_mbqual, q_fbqual), function(x) ordered(x, levels = c(
      "Very Poor", "Poor", "Fair", "Good", "Very Good", "Excellent"
    )))
  )

glimpse(dat)

Transforming data

Rows: 2,000
Columns: 13
$ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ kampong    <chr> "Kg. Lorong Tiga Selatan", "Kg. Anggerek Desa", "Kg. Bebati…
$ mukim      <chr> "Mukim Seria", "Mukim Berakas A", "Mukim Pangkalan Batu", "…
$ district   <chr> "Belait", "Brunei-Muara", "Brunei-Muara", "Belait", "Brunei…
$ gender     <fct> Female, Male, Female, Male, Male, Female, Male, Male, Male,…
$ age        <dbl> 47, 38, 42, 47, 50, 33, 54, 38, 33, 38, 18, 39, 33, 25, 53,…
$ education  <fct> O Level, O Level, O Level, Higher National Diploma, Bachelo…
$ q_fbspeed  <dbl> 54, 58, 711, 56, 187, 58, 55, 88, 888, 146, 20, 37, 53, 192…
$ q_fbqual   <ord> Fair, Very Good, Poor, Very Good, Good, Fair, Very Good, Ve…
$ q_mbqual   <ord> Poor, Good, Good, Good, Good, Good, Good, Fair, Very Poor, …
$ q_fbexpend <dbl> 782.00000, 78.29368, 745.31846, 78.29368, 640.10037, 622.72…
$ q_fbusage  <dbl> 620, 260, 750, 320, 290, 410, 120, 450, 310, 120, 390, 190,…
$ q_limiting <chr> "I'll almost never download software that costs hundreds of…

head(as.numeric(dat$q_mbqual), 15)

 [1] 2 4 4 4 4 4 4 3 1 5 4 5 6 4 4

head(as.numeric(dat$kampong), 15)

 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Tidy data

Tidy data (cont.)

All happy families are alike; each unhappy family is unhappy in its own way. —Leo Tolstoy

Mindset shift & expectations

Emphasis on writing reproducible R code
Appreciate that there’s a leaRning curve
Goal of the talk is to show what R is capable of

Basics

Variability

At its core, statistics is the science of understanding variability.

No variability = no insight.
Segmentation, patterns or predictors of dissatisfaction.
Inform policy decisions.

Summary statistics

Continuous data

x <- dat$age
head(x)

[1] 47 38 42 47 50 33

# Mean and standard deviation
mean(x)

[1] 38.537

sd(x)

[1] 11.96885

# Quick summary of the data
summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.00   29.00   39.00   38.54   47.00   69.00

Boxplots

boxplot(x, horizontal = TRUE, main = "Boxplot of Age", ylab = "Age", 
        col = "lightblue")

Histograms

hist(x, main = "Histogram of Age", xlab = "Age", ylab = "Frequency", 
     col = "lightblue", breaks = 10)

Histograms and density plots

hist(x, main = "Histogram of Age with density overlaid", xlab = "Age", 
     ylab = "Density", col = "lightblue", breaks = 10, prob = TRUE)
lines(density(x), lwd = 3, col = "red3")

Summary statistics

Nominal (discrete) data

x <- dat$gender
head(x)

[1] Female Male   Female Male   Male   Female
Levels: Male Female

# No such thing as the 'mean' of character vectors!
mean(x)

[1] NA

# Instead, do this:
table(x)

x
  Male Female 
   990   1010

prop.table(table(x))

x
  Male Female 
 0.495  0.505

# If you're fancy:
chisq.test(table(x))


    Chi-squared test for given probabilities

data:  table(x)
X-squared = 0.2, df = 1, p-value = 0.6547

Bar plots

x <- dat$education
barplot(table(x), las = 2, cex.names = 0.8, main = "Barplot of Education", 
        ylab = "Frequency", col = "lightblue")

Co-variability

Continuous vs continuous

Continuous vs nominal

Nominal vs nominal

Scatter plots

plot(q_fbexpend ~ q_fbusage, data = dat,
     main = "Monthly expenditure vs data usage", 
     xlab = "Data usage (GB)", ylab = "Monthly expenditure (BND)")

Strength of linear relationships

cor(dat$q_fbexpend, dat$q_fbusage)

[1] 0.5741981

\[ \rho = \frac{\text{Cov(X,Y)}}{\text{SD(X)}\times\text{SD(Y)}} \in [-1,1] \]

(Simple) Linear regression model

fit <- lm(q_fbexpend ~ q_fbusage, data = dat)
summary(fit)

\[ \begin{gathered} y = \beta_0 + \beta_1 x + \epsilon \\ \epsilon \sim N(0, \sigma^2) \end{gathered} \]


Call:
lm(formula = q_fbexpend ~ q_fbusage, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-150.05  -25.56   -6.91   15.91  593.04 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 55.601997   2.223077   25.01   <2e-16 ***
q_fbusage    0.215099   0.006861   31.35   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.07 on 1998 degrees of freedom
Multiple R-squared:  0.3297,    Adjusted R-squared:  0.3294 
F-statistic: 982.8 on 1 and 1998 DF,  p-value: < 2.2e-16

Scatter plots (with linear trend line)

# continuing previous plot
abline(fit, col = "red3", lwd = 2)

Five-number summary by group

by(dat$q_fbexpend, dat$gender, summary)

dat$gender: Male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.65   79.26  101.46  114.92  133.07  640.10 
------------------------------------------------------------ 
dat$gender: Female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   76.36   99.53  113.27  130.42  782.00

boxplot(q_fbexpend ~ gender, dat, range = 5, col = "lightblue", horizontal = TRUE,
        ylab = NULL, xlab = NULL, main = "Monthly expenditure (BND)")

Contingency tables

tab1 <- table(dat$gender, dat$q_fbqual)
print(tab1)

        
         Very Poor Poor Fair Good Very Good Excellent
  Male          26   68  204  333       271        88
  Female        16   62  236  333       254       109

tab2 <- prop.table(tab1, margin = 1)  # row proportions
round(tab2, 2)

        
         Very Poor Poor Fair Good Very Good Excellent
  Male        0.03 0.07 0.21 0.34      0.27      0.09
  Female      0.02 0.06 0.23 0.33      0.25      0.11

chisq.test(tab1)


    Pearson's Chi-squared test

data:  tab1
X-squared = 7.575, df = 5, p-value = 0.1813

Mosaic plots

Mosaic plots (Titanic data set)

Advanced

More co-variability

Grammar of graphics

A statistical graphic is a…

mapping of data
which may be statistically transformed (summarized, log-transformed, etc.)
to aesthetic attributes (color, size, xy-position, etc.)
using geometric objects (points, lines, bars, etc.)
and mapped onto a specific facet and coordinate system

Breaking down the `ggplot()` call

dat |>
  mutate(
    # Categorise age
    age = cut(age, breaks = c(0, 18, 40, 60, Inf),
              labels = paste0("Age: ", c("< 18", "18-40", "40-60", "60+"))),
    # Collapse education levels into three groups
    education = fct_collapse(
      education,
      `Secondary\nor lower` = c("Primary School", "Lower Secondary", "O Level", "A Level"),
      `Post-\nsecondary` = c("National Certificate", "Diploma", "National Diploma", "Higher National Diploma"),
      Tertiary = c("Bachelor Degree", "Master Degree", "PhD")
    )
  ) |>
  ggplot(aes(x = q_fbusage, y = q_fbexpend, col = gender)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, linewidth = 0.8) +
  facet_grid(education ~ age) +
  labs(x = "Data usage (GB)", y = "Monthly expenditure (BND)", col = "Gender")

Fun part: Themes!

Any \((x,y)\) coordinate data, e.g. locations of

Masjids in Brunei
Schools in Brunei
Shops
etc.

A collection of points connected by lines, e.g.

Roads
Rivers
Train lines
etc.

A closed two-dimensional area formed by connecting a finite number of line segments, e.g.

Kampongs
Mukims
Districts
etc.

Spatial patterns

Everything is related to everything else, but near things are more related than distant things. —Waldo Tobler, on the ‘First Law of Geography’

spend_mkm_df <-
  dat |>
  summarise(spend = mean(q_fbexpend), .by = mukim)
head(spend_mkm_df, 8)

# A tibble: 8 × 2
  mukim                spend
  <chr>                <dbl>
1 Mukim Seria           116.
2 Mukim Berakas A       115.
3 Mukim Pangkalan Batu  121.
4 Mukim Berakas B       113.
5 Mukim Kota Batu       137.
6 Mukim Tanjong Maya    106.
7 Mukim Pekan Tutong    131.
8 Mukim Gadong A        120.

Understanding spatial ICT spending or usage patterns reveals digital inequality and empowers targeted investments for a more connected, inclusive Brunei.

Spatial patterns (cont.)

Spatial analysis requires joining

spatial (GIS) data, with
geo-coded study data

left_join(
    mkm_sf,  # spatial
    spend_mkm_df  # study
  ) |>
  ggplot() +
  geom_sf(aes(fill = spend))

Quantitative text analysis

# (Comment section) Describe the top reason that limits your internet access.
head(dat$q_limiting, 5)

[1] "I'll almost never download software that costs hundreds of dollars upfront – it's just not worth breaking the bank every five minutes to keep basic services running fine."
[2] "I'd rather not get internet just to pay an extra $50 for installation that'll likely wear off within a year anyway."                                                       
[3] "I don't use Video Calls enough because poor calls keep dropping mid-conversation."                                                                                         
[4] "I wish I had reliable internet at home, but it's so easy for me to just use my neighbor's place when I need it."                                                           
[5] "I'd love to stream more videos and games if it weren't for how expensive it costs me every month right now."

Word clouds and bigrams

Reproducible reports

Quarto is an open-source scientific and technical publishing system. It enables you to create dynamic documents, reports, presentations, and websites using R code and Markdown language.

DEMO See report.qmd file

Ending

Where to go from here?

Learning more R and statistics
- Faculty modules, UBD C3L, UBD ILIA
- Brunei R User Group
Hiring policy update?
Workflow update?

Thanks!

Questions?

R for Data Science

Who we are

Plan for today

Let’s start

Introduction

The main game

Why choose R?

R and RStudio

Lay of the land

Project folder

Lay of the land

RStudio

Preamble

Importing data

Data types

Transforming data

Transforming data

Tidy data

Tidy data (cont.)

Mindset shift & expectations

Basics

Variability

Summary statistics

Continuous data

Boxplots

Histograms

Histograms and density plots

Summary statistics

Nominal (discrete) data

Bar plots

Co-variability

Continuous vs continuous

Continuous vs nominal

Nominal vs nominal

Scatter plots

Strength of linear relationships

(Simple) Linear regression model

Scatter plots (with linear trend line)

Five-number summary by group

Contingency tables

Side-by-side bar charts

Mosaic plots

Mosaic plots (Titanic data set)

Advanced

More co-variability

Grammar of graphics

Breaking down the ggplot() call

Fun part: Themes!

Spatial data

Spatial patterns

Spatial patterns (cont.)

Quantitative text analysis

Word clouds and bigrams

Reproducible reports

Ending

Where to go from here?

Thanks!

Breaking down the `ggplot()` call