[1] "Hello, World!"
[1] 3
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
AITI Talk
Assistant Professor in Statistics, Universiti Brunei Darussalam
Visiting Fellow, London School of Economics and Political Science
May 21, 2025
Instructions and material available at https://haziqj.ml/ aiti-talk/
Time | Activity |
---|---|
0830 – 0900 | Introduction & Getting Started with R |
0900 – 1000 | Lecture 1: Basic Statistics |
1000 – 1030 | Break |
1030 – 1130 | Lecture 2: Advanced R stuff |
1130 – 1200 | Networking |
slido.com code: 3244786
Datasaurus supports you!
Automate End-to-End Survey Processing
Write one script that pulls raw survey responses, cleans and validates fields and outputs ready-to-analyse datasets.
Standardise Analysis & Quality Checks
Embed your business rules into reusable code so every round adheres to the same quality standards.
Generate Dynamic Reports in Seconds
Turn your cleaned data into up-to-date charts, tables and written summaries automatically.
Quickly Prototype “What-If” Scenarios
Simulate alternative weighting schemes, forecast adoption trends or run sensitivity analyses on key ICT indicators to guide policy adjustments.
R is an interpreted programming language for statistical computing and data visualisation. It has been adopted in many fields, especially quantitive fields like data science.
<insert favourite IDE>
.aiti.RProj
fileProject folder
R needs to know where is your project’s “home” directory. By clicking on the RProj file, RStudio will set the working directory to the project folder.
library(tidyverse) # data wrangling tools
library(tinyplot) # for quick plotting
library(tidytext) # bigrams
library(tm) # text mining
library(wordcloud) # word clouds
library(gtsummary) # pretty summary tables
library(bruneimap) # for mapping
theme_set(theme_bw()) # ggplot2
tinytheme("clean2") # tinyplot
Installing packages
In RStudio, if a package is not installed, a yellow ribbon will appear prompting you to install it. You can also manually install packages by running:
install.packages("<package name>") # only need to install once
library("<package name>") # but load the package every time!
Or browse the ‘Packages’ pane in RStudio.
Rows: 2,000
Columns: 13
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ kampong <chr> "Kg. Lorong Tiga Selatan", "Kg. Anggerek Desa", "Kg. Bebati…
$ mukim <chr> "Mukim Seria", "Mukim Berakas A", "Mukim Pangkalan Batu", "…
$ district <chr> "Belait", "Brunei-Muara", "Brunei-Muara", "Belait", "Brunei…
$ gender <chr> "Female", "Male", "Female", "Male", "Male", "Female", "Male…
$ age <dbl> 47, 38, 42, 47, 50, 33, 54, 38, 33, 38, 18, 39, 33, 25, 53,…
$ education <chr> "O Level", "O Level", "O Level", "Higher National Diploma",…
$ q_fbspeed <dbl> 54, 58, 711, 56, 187, 58, 55, 88, 888, 146, 20, 37, 53, 192…
$ q_fbqual <chr> "Fair", "Very Good", "Poor", "Very Good", "Good", "Fair", "…
$ q_mbqual <chr> "Poor", "Good", "Good", "Good", "Good", "Good", "Good", "Fa…
$ q_fbexpend <dbl> 782.00000, 78.29368, 745.31846, 78.29368, 640.10037, 622.72…
$ q_fbusage <dbl> 620, 260, 750, 320, 290, 410, 120, 450, 310, 120, 390, 190,…
$ q_limiting <chr> "I'll almost never download software that costs hundreds of…
Demographic vs study questions
Usually, a survey contains two types of question: 1) Demographic, and 2) study questions.
graph TD A[**Data Type**] A --> B["**Logical**<br><br>e.g. TRUE, FALSE"] A --> C[**Numeric**] A --> D["**Complex**<br><br>e.g. 1+2i, 3+4i"] A --> E["**Character**<br><br>e.g. 'cat', 'blue'"] C --> CA["**Integer**<br><br>e.g. 1L, 314L"] C --> CB["**Double**<br><br>e.g. 1.23, 3.141"] E --> EA["**Factor**<br><br>e.g. 'MOE', 'MTIC', 'MOH'"] E --> EB["**Ordered**<br><br>e.g. 'Disagree', 'Neutral',<br>'Agree'"] %% Assign nodes to classes class EA pink class EB pink %% Define styles for the classes classDef pink fill:#f9c,color:#fff,stroke:#333,stroke-width:1px
Know your data types
We must know the data types of our variables to perform the correct operations on them. For example, if we want to calculate the mean of a variable, it must be numeric. If it is a factor/ordered, we need to convert it to numeric first.
dat <-
dat |>
mutate(
gender = factor(gender, levels = c("Male", "Female")),
# Convert education to factors
education = factor(education, levels = c(
"Primary School", "Lower Secondary", "O Level", "A Level",
"National Certificate", "Diploma", "National Diploma",
"Higher National Diploma", "Bachelor Degree", "Master Degree", "PhD"
)),
# Convert Likert scale to ordered factors
across(c(q_mbqual, q_fbqual), function(x) ordered(x, levels = c(
"Very Poor", "Poor", "Fair", "Good", "Very Good", "Excellent"
)))
)
glimpse(dat)
Rows: 2,000
Columns: 13
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ kampong <chr> "Kg. Lorong Tiga Selatan", "Kg. Anggerek Desa", "Kg. Bebati…
$ mukim <chr> "Mukim Seria", "Mukim Berakas A", "Mukim Pangkalan Batu", "…
$ district <chr> "Belait", "Brunei-Muara", "Brunei-Muara", "Belait", "Brunei…
$ gender <fct> Female, Male, Female, Male, Male, Female, Male, Male, Male,…
$ age <dbl> 47, 38, 42, 47, 50, 33, 54, 38, 33, 38, 18, 39, 33, 25, 53,…
$ education <fct> O Level, O Level, O Level, Higher National Diploma, Bachelo…
$ q_fbspeed <dbl> 54, 58, 711, 56, 187, 58, 55, 88, 888, 146, 20, 37, 53, 192…
$ q_fbqual <ord> Fair, Very Good, Poor, Very Good, Good, Fair, Very Good, Ve…
$ q_mbqual <ord> Poor, Good, Good, Good, Good, Good, Good, Fair, Very Poor, …
$ q_fbexpend <dbl> 782.00000, 78.29368, 745.31846, 78.29368, 640.10037, 622.72…
$ q_fbusage <dbl> 620, 260, 750, 320, 290, 410, 120, 450, 310, 120, 390, 190,…
$ q_limiting <chr> "I'll almost never download software that costs hundreds of…
[1] 2 4 4 4 4 4 4 3 1 5 4 5 6 4 4
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
All happy families are alike; each unhappy family is unhappy in its own way. —Leo Tolstoy
Emphasis on writing reproducible R code
Appreciate that there’s a leaRning curve
Goal of the talk is to show what R is capable of
At its core, statistics is the science of understanding variability.
[1] Female Male Female Male Male Female
Levels: Male Female
[1] NA
x
Male Female
990 1010
x
Male Female
0.495 0.505
Chi-squared test for given probabilities
data: table(x)
X-squared = 0.2, df = 1, p-value = 0.6547
\[ \rho = \frac{\text{Cov(X,Y)}}{\text{SD(X)}\times\text{SD(Y)}} \in [-1,1] \]
\[ \begin{gathered} y = \beta_0 + \beta_1 x + \epsilon \\ \epsilon \sim N(0, \sigma^2) \end{gathered} \]
Call:
lm(formula = q_fbexpend ~ q_fbusage, data = dat)
Residuals:
Min 1Q Median 3Q Max
-150.05 -25.56 -6.91 15.91 593.04
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.601997 2.223077 25.01 <2e-16 ***
q_fbusage 0.215099 0.006861 31.35 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 54.07 on 1998 degrees of freedom
Multiple R-squared: 0.3297, Adjusted R-squared: 0.3294
F-statistic: 982.8 on 1 and 1998 DF, p-value: < 2.2e-16
dat$gender: Male
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.65 79.26 101.46 114.92 133.07 640.10
------------------------------------------------------------
dat$gender: Female
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 76.36 99.53 113.27 130.42 782.00
boxplot(q_fbexpend ~ gender, dat, range = 5, col = "lightblue", horizontal = TRUE,
ylab = NULL, xlab = NULL, main = "Monthly expenditure (BND)")
Very Poor Poor Fair Good Very Good Excellent
Male 26 68 204 333 271 88
Female 16 62 236 333 254 109
Very Poor Poor Fair Good Very Good Excellent
Male 0.03 0.07 0.21 0.34 0.27 0.09
Female 0.02 0.06 0.23 0.33 0.25 0.11
Pearson's Chi-squared test
data: tab1
X-squared = 7.575, df = 5, p-value = 0.1813
A statistical graphic is a…
ggplot()
calldat |>
mutate(
# Categorise age
age = cut(age, breaks = c(0, 18, 40, 60, Inf),
labels = paste0("Age: ", c("< 18", "18-40", "40-60", "60+"))),
# Collapse education levels into three groups
education = fct_collapse(
education,
`Secondary\nor lower` = c("Primary School", "Lower Secondary", "O Level", "A Level"),
`Post-\nsecondary` = c("National Certificate", "Diploma", "National Diploma", "Higher National Diploma"),
Tertiary = c("Bachelor Degree", "Master Degree", "PhD")
)
) |>
ggplot(aes(x = q_fbusage, y = q_fbexpend, col = gender)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, linewidth = 0.8) +
facet_grid(education ~ age) +
labs(x = "Data usage (GB)", y = "Monthly expenditure (BND)", col = "Gender")
Any \((x,y)\) coordinate data, e.g. locations of
A collection of points connected by lines, e.g.
A closed two-dimensional area formed by connecting a finite number of line segments, e.g.
Everything is related to everything else, but near things are more related than distant things. —Waldo Tobler, on the ‘First Law of Geography’
# A tibble: 8 × 2
mukim spend
<chr> <dbl>
1 Mukim Seria 116.
2 Mukim Berakas A 115.
3 Mukim Pangkalan Batu 121.
4 Mukim Berakas B 113.
5 Mukim Kota Batu 137.
6 Mukim Tanjong Maya 106.
7 Mukim Pekan Tutong 131.
8 Mukim Gadong A 120.
Understanding spatial ICT spending or usage patterns reveals digital inequality and empowers targeted investments for a more connected, inclusive Brunei.
# (Comment section) Describe the top reason that limits your internet access.
head(dat$q_limiting, 5)
[1] "I'll almost never download software that costs hundreds of dollars upfront – it's just not worth breaking the bank every five minutes to keep basic services running fine."
[2] "I'd rather not get internet just to pay an extra $50 for installation that'll likely wear off within a year anyway."
[3] "I don't use Video Calls enough because poor calls keep dropping mid-conversation."
[4] "I wish I had reliable internet at home, but it's so easy for me to just use my neighbor's place when I need it."
[5] "I'd love to stream more videos and games if it weren't for how expensive it costs me every month right now."
Quarto is an open-source scientific and technical publishing system. It enables you to create dynamic documents, reports, presentations, and websites using R code and Markdown language.
DEMO See report.qmd file
Questions?