Data Wrangling
Using R and the dplyr package to get your data in order
Workshop date: 2018-02-23
Author: Jon Pipitone, 2018
Big Ideas
Here are the big takeaways I’d like you to leave with from this workshop:
-
Automate. Don’t manipulate data by hand.
dplyrgives you functions (“verbs”) to express how to manipulate your data in code. This is repeatable, and precise. -
Pipes: functions that transform data. Seeing functions a machines that take input, transform, and produce output is an important paradigm in computing that will come up again and again.
Learning objectives
- Review RStudio, and parts of the IDE: console, editor, heap
- Review R variables and calling functions
- Install and load libraries using
install.packages(),library() - Learn about
tidyverse, and whatdplyris for - Load data using
read_csv() - Add columns to a tibble using
mutate() - Extract columns using
select() - Write “pipelines” using the
%>%operator - Subset rows using
filter() - Aggregate a dataset using
summarise() - Aggregate by groups using
group_by() - Merge tables using
inner_join()and friends
Tasks
Set up
- Install RStudio and R
- Install the
tidyversepackage using theinstall.packagesfunction, and load it using thelibrary()function:install.packages('tidyverse') # this may take some time library(tidyverse) - Install the
hflightspackage and load it
In this workshop, we will be redoing some of the analysis we did in the spreadsheets workshop. You can download CSV versions of the sheets here:
You can also download this data directly from R using the following code:
download.file("https://pipitone.github.io/qmed-computes/assets/workshop-data/data-wrangling/vitals.csv",
destfile="vitals.csv")
download.file("https://pipitone.github.io/qmed-computes/assets/workshop-data/data-wrangling/investigations.csv",
destfile="investigations.csv")
In this workshop, we will follow the DataCamp lesson on dplyr.
Introduction to dplyr and the tidyverse
- Review some relevant R
a = 1means the variableagets the value1- Variable assignment may also be written
a = 1 - Functions look like they do in excel, e.g.
max(4,6,20,1) - You can assign the results of a function to a variable, e.g.
a = max(4,6,20,1) - Get help on any function by using the
help()function, e.g.help(max) - There are several useful functions for exploring data tables. Have a look
at the
hflightsdata:
library(hflights) library(tidyverse) colnames(hflights) nrows(hflights) head(hflights) - Watch the intro to dplyr commands, called “verbs”
Questions?
Selecting columns
-
Do the DataCamp exercise on using
select - Load the vitals.csv data using the
read_csv()functionvitals_csv = read_csv('vitals.csv') # read_csv will spit out information about how it guessed the column formats -
Take a look at the columns (hint:
colnames()), and the data itself (head()) - Use
select()to extract the columns starting withparticipant_idthrough todiastolicinto a new variablevitals:vitals = select(vitals_csv, participant_id, age, sex, HR, oxygen_sat, RR, temp, systolic, diastolic) # here's a shortcut vitals = select(vitals_csv, participant_id:diastolic) - Explore the table stored in
vitalsandvitals_csvby just typing the variable names out on the console to display them, and also by usinghead(),colnames(), or the RStudio IDE
Adding columns
-
Watch the DataCamp video on
mutate(). -
Do the two exercises that follow the video only.
-
Use
mutate()to add aMAPcolumn computed using the following formula, and store the resulting data table in a new variable calledvitals_MAP:MAP = 1/3 systolic + 2/3 diastolic -
Bonus: use the function
ifelse()to set the mean arterial pressure based on heart rate:If HR > 100, MAP = 1/2 systolic + 1/2 diastolic otherwise, MAP = 1/3 systolic + 2/3 diastolichint: your call to mutate will look something like:
mutate(vitals, MAP = ifelse(...))
Filtering data
-
Watch the DataCamp video on
filter()and do the three exercises that follow (everything up toarrange()). -
Practice using
filter()to grab parts of the vitals dataset, e.g:- Find all subjects older than 65
- Find all febrile subjects
- Find all subjects that are tachycardic but afebrile
- Find all subjects with any concerning vital
Arrange
arrange() sorts your data tables. It can be useful, but I recommend learning
about this some other time.
Summarizing a dataset
-
Watch the DataCamp video on summarise()
-
Do the exercise the follows the video. You can do the other exercises, but they require you to know about NA values and how to filter them using the
is.na()function, which we haven’t covered.
Syntactic suger: Pipes
-
Watch the DataCamp video on chaining functions. Fair warning: this video might be a tad confusing at first.
-
Skip the exercise that immediate follows the video, and do the two “Drive or fly” exercises instead.
-
Write your code that selects, mutates, and filters the vitals data from the previous exercises using pipes. My answer is below.
Summarising by groups
-
Watch the DataCamp video on
group_by -
Do the exercise that follows.
-
Now, let’s figure out the average age for participants who are tachycardic, and those who aren’t. First, group by the vitals dataset by HR > 100, then summarise by computing the
avg_ageas the mean age.My answer is below.
Merging tables
-
Load up the investigations.csv data, and explore it to remind yourself of the columns and the data. Note that we don’t have investigations for all of the subjects.
-
Also, note that the vitals table and the investigations tables share the
participant_idcolumn. If we wanted to merge the tables, we have to use this column to match rows. We can do this with a “join” function. There are several flavours depending on how exactly you want to do the merge. Runhelp(join)to find out more. -
If we want to merge all of the investigations data into the vitals table, and include empty values when the investigations are missing, we can use the
left_joinfunction:vitals = read_csv('vitals.csv') investigations = read_csv('investigations.csv') merged = left_join(vitals, investigations, by = 'participant_id') -
What does
right_joindo differently? -
What does
inner_joindo?
Putting it all together
Compute the average MAP for participants with uremia (urea > 7) by sex.
Answers
Pipes
read_csv('vitals.csv') %>%
select(participant_id:diastolic) %>%
mutate(MAP = 1/3 * systolic + 2/3 * diastolic) %>%
filter(temp > 38.0)
Summarising by groups
vitals %>%
group_by(HR > 100) %>%
summarise(avg_age = mean(age))
Putting it all together
investigations = read_csv('investigations.csv')
read_csv('vitals.csv') %>%
select(participant_id:diastolic) %>% # this isn't really necessary
filter(urea > 7) %>%
mutate(MAP = 1/3 * systolic + 2/3 * diastolic) %>%
left_join(investigations) %>% # tables joined on participant_id automatically!
group_by(sex) %>%
summarise(avg_MAP = mean(MAP))
Resources
-
DataCamp’s Introduction to the tidyverse course.
The first chapter goes through the important parts of the
dplyrverbs with a different dataset. -
Data Carpentry’s lesson on Manipulating and analysing data with dplyr.
This is another great walkthrough of dplyr that includes some other more interesting features.
-
This online book is a great and practical, analysis-focused resource for learning dplyr and the rest of tidyverse, written by the person who spearheaded the tidyverse itself.
-
tidyverse.org. The website for the tidyverse project has lots of good links out to resources to learn.