Introduction to R

Motivation R is a featureful and sometimes magical language for doing statistical analysis. And it's free. This workshop won't cover much about the stats you can do with R, but will help you get your data in order.

Prerequisites: programming/scripting experience in any other language (e.g. you've used a for loop before, used functions/libraries)

What you'll learn: writing R scripts (using the Rstudio IDE), using/installing packages, reading/sorting/merging/filtering tabular data (e.g. CSV files), some basic statistics (incl. linear regression), visualising data/models

Things to read:

Getting started

Workshop backchannel questions: https://etherpad.mozilla.org/YMdULxOqGA

In this workshop you'll need the following software installed on your computer:

R programming system
Rstudio development "environment". It's a fancy tool that makes programming in R much better.

Once installed, launch Rstudio and follow along.

Rstudio

Some things to note about Rstudio:

Editor pane
Command pane
Environment pane
Help/Display pane

Download the data files we'll be using in this workshop:

Open the data file in the Rstudio file editor: File->Open file.... Note how unlovely a CSV file is when you look at it as just plain text.

Before we go any further, open a new R script file: File->New File->RScript. We'll use this to put our R code in as we go. Save the blank file to somewhere sensible (e.g. your Desktop, your Documents folder). Call it workshop.R.

The very basics

1+5
weight <- 5     # Notice what happens in the Environment tab
weight
weight = 5      # -> and = are mostly equivalent

Highlight a command in the script editor, and run it using ctrl-enter.

Loading data

Loading CSV data in R is easy using the read.csv() function:

data <- read.csv('data.csv')   # you may need to put something else, e.g. 
                               #     /User/jp/Downloads/data.csv
                               # or, 
                               #     C:\Downloads\data.csv

Notice:

data variable appears in the environment. Also, check ls().
Click data to explore the variable.
<- and = are mostly equivalent for assignment.
Use ?read.csv for help
Use ??read.csv to search help

Data types

R has several data types. You should at least know about:

Vectors: a = c(1,2,3,4). Reference using brackets, [], e.g. first element: a[1]
Lists: a = list(1,2,3,4). Reference using double brackets, [[]], e.g. first element: a[[1]]. Lists can also be "named lists", e.g. heights = list(alice=180,bob=160,charlie=165), and reference: heights[["alice"]], or heights$alice
Factors: a nominal vector. gender = factor(c("M","F")), levels(gender)
Data frames (data.frame): the de facto table-like data structure with named columns and rows. Somewhat like a list of lists/vectors/factors.

More on dataframes

head(data)
str(data) 
summary(data)

nrow(data)
ncol(data)
dim(data)
length(data)
names(data)

# Getting data from a data.frame
data$dx
data$sex

summary(data$sex)
table(data$sex)

str(data$sex)
str(data$SubjectID)

levels(data$sex)
nlevels(data$sex)

mean(data$Age)
sd(data$Age)
max(data$Age)

# Converting
data$DX = factor(data$dx)

# recoding factors
data$DX = factor(data$dx,levels=c("0","1"),labels=c("ctrl","case"))
levels(data$DX) = c("control","case")   # warning

Merge and filter

Grabbing parts of a data.frame by indices:

# each of these creates a new data.frame object
data[1:10,]            # rows 1-10
data[,1:10]            # columns 1-10
data[c(1,4,7),]        # rows 1,4 and 7

# data from just one column
data$SubjectID[1:10]   # first 10 subject IDs 

# 1:10 is a "range"
seq(1,10)

Filter/subset:

females = subset(data, sex = "F")

# ooooops
females = subset(data, sex == "F")    # use ==


# merging row-wise
eth1 = subset(data, ethnicity == 1)
eth2 = subset(data, ethnicity == 2)
eth1_2 = rbind(eth1,eth2)

# more complex subsets
females_eth1 = subset(females, ethnicity == 1) 
females_eth1 = subset(data, sex == "F" & ethnicity == 1)

Merging data.frames:

gi = read.csv('data_gi.csv')

merged = merge(data, gi)

merged_all_x = merge(data, gi, all.x = T)

Some statistics

fit = lm(cerebral_vol_l ~ Age, data = data)
summary(fit)

# extract residual
res = resid(fit)
qqnorm(res)
qqline(res)

Plotting

plot(cerebral_vol_l ~ Age, data = data)
abline(fit)

# Now, using ggplot2
install.packages(ggplot2)
library(ggplot2)
ggplot(data = data, aes(x=Age, y=cerebral_vol_l)) + geom_point()
ggplot(data = data, aes(x=Age, y=cerebral_vol_l)) + geom_point() + geom_smooth(method='lm')


# histogram 
hist(data$cerebral_vol_l)
ggplot(data, aes(x=cerebral_vol_l)) + geom_histogram()

# PDF (can also manually use export in RStudio)
pdf("histogram.pdf")
hist(data$cerebral_vol_l)
dev.off()

?barplot
?boxplot
?plot.default