Useful data science functions in R
As usual, before starting, load all the packages you need.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Data management
Importing data sets
Data can be imported by going to $\text{File}\rightarrow\text{Import Dataset}$. Alternatively, the code is
# code for importing data
Converting to tibbles
as_tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
Subsetting
To extract columns, use the $
symbol.
iris$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
Reshaping
reshape2::melt(table(diamonds$cut, diamonds$color),
var = c("cut", "color"))
## cut color value
## 1 Fair D 163
## 2 Good D 662
## 3 Very Good D 1513
## 4 Premium D 1603
## 5 Ideal D 2834
## 6 Fair E 224
## 7 Good E 933
## 8 Very Good E 2400
## 9 Premium E 2337
## 10 Ideal E 3903
## 11 Fair F 312
## 12 Good F 909
## 13 Very Good F 2164
## 14 Premium F 2331
## 15 Ideal F 3826
## 16 Fair G 314
## 17 Good G 871
## 18 Very Good G 2299
## 19 Premium G 2924
## 20 Ideal G 4884
## 21 Fair H 303
## 22 Good H 702
## 23 Very Good H 1824
## 24 Premium H 2360
## 25 Ideal H 3115
## 26 Fair I 175
## 27 Good I 522
## 28 Very Good I 1204
## 29 Premium I 1428
## 30 Ideal I 2093
## 31 Fair J 119
## 32 Good J 307
## 33 Very Good J 678
## 34 Premium J 808
## 35 Ideal J 896
Variation
Continuous variables
5-point summaries
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
ggplot(reshape2::melt(iris), aes(x = variable, y = value)) +
geom_boxplot()
## Using Species as id variables
Note: in the above code, we used the melt()
function in reshape2
package to aggregate the data.
Explore what melt()
does by running it in the console.
Histograms
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Density plots
ggplot(iris, aes(x = Sepal.Length)) +
geom_density(fill = "blue")
Discrete variables
Frequency tables
table(mpg$class)
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
prop.table(table(mpg$class))
##
## 2seater compact midsize minivan pickup subcompact suv
## 0.02136752 0.20085470 0.17521368 0.04700855 0.14102564 0.14957265 0.26495726
Barplots
ggplot(mpg, aes(x = reorder(class, class, FUN = length))) +
geom_bar() +
labs(x = "Class")
Note: The reorder()
function sorts the bars… the syntax is a bit tricky to understand, so take it as is for now.
Covariation
Continuous variables
Scatter plots
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
Smooth lines
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Binning
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_boxplot(aes(group = cut_width(displ, 0.5)))
Discrete variables
Contingency tables
table(diamonds$cut, diamonds$color)
##
## D E F G H I J
## Fair 163 224 312 314 303 175 119
## Good 662 933 909 871 702 522 307
## Very Good 1513 2400 2164 2299 1824 1204 678
## Premium 1603 2337 2331 2924 2360 1428 808
## Ideal 2834 3903 3826 4884 3115 2093 896
t(table(diamonds$cut, diamonds$color))
##
## Fair Good Very Good Premium Ideal
## D 163 662 1513 1603 2834
## E 224 933 2400 2337 3903
## F 312 909 2164 2331 3826
## G 314 871 2299 2924 4884
## H 303 702 1824 2360 3115
## I 175 522 1204 1428 2093
## J 119 307 678 808 896
Count plots
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
Heat maps
dat <- reshape2::melt(table(diamonds$cut, diamonds$color),
var = c("cut", "color"))
ggplot(dat, aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = value))
Continuous x discrete
Faceting
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ class, nrow = 2)
Adding additional aesthetics
ggplot(mpg, aes(x = displ, y = hwy, col = class)) +
geom_point()