Introduction to R
Motivation R is a featureful and sometimes magical language for doing statistical analysis. And it's free. This workshop won't cover much about the stats you can do with R, but will help you get your data in order.
Prerequisites: programming/scripting experience in any other language (e.g. you've used a for loop before, used functions/libraries)
What you'll learn: writing R scripts (using the Rstudio IDE), using/installing packages, reading/sorting/merging/filtering tabular data (e.g. CSV files), some basic statistics (incl. linear regression), visualising data/models
Things to read:
- SWC: Programming with R
- Quick-R
- Cookbook for R
- ggplot2 plotting library docs
- R Cheat Sheet
- An Introduction to Statistical Learning -- with applications in R
Getting started
Workshop backchannel questions: https://etherpad.mozilla.org/YMdULxOqGA
In this workshop you'll need the following software installed on your computer:
- R programming system
- Rstudio development "environment". It's a fancy tool that makes programming in R much better.
Once installed, launch Rstudio and follow along.
Some things to note about Rstudio:
- Editor pane
- Command pane
- Environment pane
- Help/Display pane
Download the data files we'll be using in this workshop:
Open the data file in the Rstudio file editor: File->Open file...
. Note how
unlovely a CSV file is when you look at it as just plain text.
Before we go any further, open a new R script file: File->New File->RScript
.
We'll use this to put our R code in as we go. Save the blank file to somewhere
sensible (e.g. your Desktop, your Documents folder). Call it workshop.R
.
The very basics
1+5
weight <- 5 # Notice what happens in the Environment tab
weight
weight = 5 # -> and = are mostly equivalent
Highlight a command in the script editor, and run it using ctrl-enter
.
Loading data
Loading CSV data in R is easy using the read.csv()
function:
data <- read.csv('data.csv') # you may need to put something else, e.g.
# /User/jp/Downloads/data.csv
# or,
# C:\Downloads\data.csv
Notice:
data
variable appears in the environment. Also, checkls()
.- Click
data
to explore the variable. <-
and=
are mostly equivalent for assignment.- Use
?read.csv
for help - Use
??read.csv
to search help
Data types
R has several data types. You should at least know about:
- Vectors:
a = c(1,2,3,4)
. Reference using brackets,[]
, e.g. first element:a[1]
- Lists:
a = list(1,2,3,4)
. Reference using double brackets,[[]]
, e.g. first element:a[[1]]
. Lists can also be "named lists", e.g.heights = list(alice=180,bob=160,charlie=165)
, and reference:heights[["alice"]]
, orheights$alice
- Factors: a nominal vector.
gender = factor(c("M","F"))
,levels(gender)
- Data frames (
data.frame
): the de facto table-like data structure with named columns and rows. Somewhat like a list of lists/vectors/factors.
More on dataframes
head(data)
str(data)
summary(data)
nrow(data)
ncol(data)
dim(data)
length(data)
names(data)
# Getting data from a data.frame
data$dx
data$sex
summary(data$sex)
table(data$sex)
str(data$sex)
str(data$SubjectID)
levels(data$sex)
nlevels(data$sex)
mean(data$Age)
sd(data$Age)
max(data$Age)
# Converting
data$DX = factor(data$dx)
# recoding factors
data$DX = factor(data$dx,levels=c("0","1"),labels=c("ctrl","case"))
levels(data$DX) = c("control","case") # warning
Merge and filter
Grabbing parts of a data.frame by indices:
# each of these creates a new data.frame object
data[1:10,] # rows 1-10
data[,1:10] # columns 1-10
data[c(1,4,7),] # rows 1,4 and 7
# data from just one column
data$SubjectID[1:10] # first 10 subject IDs
# 1:10 is a "range"
seq(1,10)
Filter/subset:
females = subset(data, sex = "F")
# ooooops
females = subset(data, sex == "F") # use ==
# merging row-wise
eth1 = subset(data, ethnicity == 1)
eth2 = subset(data, ethnicity == 2)
eth1_2 = rbind(eth1,eth2)
# more complex subsets
females_eth1 = subset(females, ethnicity == 1)
females_eth1 = subset(data, sex == "F" & ethnicity == 1)
Merging data.frames:
gi = read.csv('data_gi.csv')
merged = merge(data, gi)
merged_all_x = merge(data, gi, all.x = T)
Some statistics
fit = lm(cerebral_vol_l ~ Age, data = data)
summary(fit)
# extract residual
res = resid(fit)
qqnorm(res)
qqline(res)
Plotting
plot(cerebral_vol_l ~ Age, data = data)
abline(fit)
# Now, using ggplot2
install.packages(ggplot2)
library(ggplot2)
ggplot(data = data, aes(x=Age, y=cerebral_vol_l)) + geom_point()
ggplot(data = data, aes(x=Age, y=cerebral_vol_l)) + geom_point() + geom_smooth(method='lm')
# histogram
hist(data$cerebral_vol_l)
ggplot(data, aes(x=cerebral_vol_l)) + geom_histogram()
# PDF (can also manually use export in RStudio)
pdf("histogram.pdf")
hist(data$cerebral_vol_l)
dev.off()
?barplot
?boxplot
?plot.default