Introduction to the Linux Shell
Motivation: The shell is one of the most basic ways you can interact with a computer and it allows you to automate your analysis. It is often the only way to interact with scientific computing environments (e.g. SCC, SciNet), so you'll have to learn it eventually.
Prerequisites: Minimal familiarity with the shell (e.g. you should know
what the commands cd, ls, mv, cp, rm and man do).
What you will learn: Running commands on many files (globbing, looping, if statements), reading and writing to files, sorting/filtering data in files
Things to read:
Getting started
# connect to the SCC
ssh test_user@192.168.214.10
# make a folder for yourself, and cd into it
mkdir given-name_family-name # e.g. jon_pipitone
cd jon_pipitone
# Run a script to generate the data for this workshop
bash ~jpipitone/make-data.sh
The make-data.sh script can be found here:
https://raw.githubusercontent.com/pipitone/computing-skills/7128668508a8495d302bed5f396c7bb0b732961c/bin/make-data.sh
Exploring the data
The script you ran generates a phoney dataset of imaging and genetics data for
a number of subjects in a folder called data. Have a look around.
# for example:
ls data
cd data
ls S000
cd ..
Q: How do you see what is in the demographics file?
$ cat data/S000/demographics.csv $ less data/S000/demographics.csv # alternatively
Q: How do you find out the size of the files in a subject folder?
$ ls -l data/S000 total 16 -rw-rw-r-- 1 jp jp 35 Mar 9 16:05 demographics.csv -rw-rw-r-- 1 jp jp 5050 Mar 9 16:05 genome.dat -rw-rw-r-- 1 jp jp 2471 Mar 9 16:05 T1.nii
Q: How many lines are there in the genome.dat file?
For this you are going to need to know a new command, wc, which stands for
"word count". If you run wc file it will print three things: the number of
characters in the file, the number of words, and number of lines. Try it.
$ wc data/S000/genome.dat 50 50 5050 data/S000/genome.dat
You can also use the -l option to only show the number of lines:
$ wc -l data/S000/genome.dat 50 data/S000/genome.dat
Q: How many subjects are there?
First, make a list of the subject folders using ls:
$ ls data
Then, pipe that list into the wc command:
$ ls data | wc -l 101
Anything you pipe into wc gets counted. Try it with the history command,
for instance.
Q: How many subjects have a T1.nii image?
You can use ls with a wildcard to match all of the subject folders:
$ ls data/*/TI.nii # lists all of the T1.nii files in subfolders of data
And then you can pipe into wc -l to count that list.
Q: How many subjects have a T1.nii.gz image?
# your turn
Q: How many subjects have either aT1.nii or a T1.nii.gz image?
# your turn
Q: Which subjects don't have a T1.nii file?
# This is trickier. We'll cover this one next time. :-)
Merging and filtering text
Since each subject had a demographics.csv file, it would be nice to collect
all of it into a single CSV so that we could analyse it.
Use the
catcommand to concatenate all of the demographic data (hint: use a wildcard).$ cat data/*/demographics.csv
We can use the program grep to filter lines that match a certain pattern.
For instance, to show only the males ('M'), we could pipe the output to
grep like so:
$ cat data/*/demographics.csv | grep M
Q: How do you search for only the female subjects?
Q: How many male subjects are there? (hint: use
wc)$ cat data/*/demographics.csv | grep M | wc -lQ: How many female subjects and total # of subjects are there? (hint: use
wc)
Hmm.. looks like there are some subjects we don't know the sex of. We can use
the sort to order text by line. Since sex is the first column, it will sort
the lines by sex.
Q: Sort the demographic data by piping it into the
sortcommand.$ cat data/*/demographics.csv | sortQ: Pipe the results into
headto show the top part of the sorted list.BONUS:
sortcan sort on a specific column only, but you have to tell it how to find the columns in your data. If you're feeling bold, check themanpage or google.# to sort the second column of CSV data $ cat data/*/demographics.csv | sort --field-separator=, --key=2,2 # By default, sort sorts everything as text. To tell it to sort # numerically, pass the -n option. $ cat data/*/demographics.csv | sort --field-separator=, --key=2,2 -n
Next, we'll save the concatenated data into a file by redirecting the output
of the command into a file using the > operator.
cat *files* > master.csv
The > operator takes anything the command before it prints out and prints
it to the named file instead of displaying it on the terminal.
Q: How would you save just the male demographic data to a file?
Q: How would you use
grepto filter your data in a file?# You can pipe to grep $ cat master.csv | grep M # Or, more simply, you can tell grep to search through the file $ grep M master.csv
Organizing your data
In order to do some analysis you now want to collect all of your data types into separate folders by type (i.e. put all of your genome data together in one folder, all of your imaging data together in another, etc).
Make a folder for your imaging data called
genomes$ mkdir genomesCopy a few subject's
genome.datfiles to thegenomesfolder.$ cp data/S000/genome.dat genomes/ $ cp data/S001/genome.dat genomes/ # oops, name conflict.. # okay, we can use copy to rename as we go $ cp data/S000/genome.dat genomes/S000_genome.dat $ cp data/S001/genome.dat genomes/S001_genome.dat ...
We need a way to rename our files automatically, one by one. And this is where a loop comes in handy.
For example, here is a list that prints numbers 1-5:
$ for i in 1 2 3 4 5
do
echo ${i} # gets run for each number
done
A few new things here:
iis called the "loop variable", each time through the loop,itakes on a different value from the list.The list of things
itakes turns getting set to is everything after thein.All lines between
doanddoneget run once for each value ofi${i}is shell-speak for "the value ofi". Get it?$equals "value"... :-)Q: How would you print out the letters
athroughe.$ for j in a b c d e f do echo ${j} done
The list in a for-loop can also be made up of a wildcard that matches files. So, for instance you can loop through all of the files and folders in your current working directory like so:
$ for path in *
do
echo Found: ${path}
ls -l ${path}
done
Q: Using a for-loop, print out all of the folders in the
data/directory.$ for folder in data/* do print ${folder} doneQ: Using a for-loop, print out the number of lines for each of the genome.dat files.
$ for i in data/*/genome.dat do wc -l ${i} doneYou could have also done this like so:
$ for i in data/* do wc -l ${i}/genome.dat done
Okay, we're ready to move and rename some files. We'll first do it in a bit of a cumbersome way, and then show you how to do it more easily.
- First
cdinto yourdata/folder. From your
data/folder, how would you copy subject S045'sgenome.datfile into thegenomesfolder?$ cp S045/genome.dat ../genomesWrite a for-loop that prints the names of all of your subject folders. (hint: you just did this a few moments ago :-)
$ for i in *; do echo ${i} doneEdit the for-loop to
echoa command tocpthegenome.datfile from each subject folder to thegenomesfolder.$ for i in *; do echo cp ${i}/genome.dat ../genomes doneThe last thing we need to do is give our file a new name once it is in
../genomes. Since${i}is the name of our subject, we can use that in our name. Edit the for loop to copy each subject'sgenome.datfile toS045_genome.dat, etc.$ for i in *; do echo cp ${i}/genome.dat ../genomes/${i}_genome.dat doneOkay, remove the
echoand run your for-loop for real!
Scripts
If you put shell commands in a text file (a so-called "shell script"), you can easily re-run those commands:
$ bash commands.sh # this runs everything in commands.sh
The convention is to name our shell scripts with a .sh ending.
Q: Using nano, make a script file called
organise_genomes.shin your project folder that does two things: a) makes thegenomesfolder, and b) copies your genome data into that folder.$ cat organize_genomes.sh mkdir -p genomes # -p tells mkdir to be quietly if the folder exists cd data/ for i in *; do cp ${i}/genome.dat ../genomes/${i}_genome.dat doneQ: Remove your
genomesfolder, and run theorganize_genomes.shscript.$ rm -rf genomes/ $ bash organize_genomes.sh
Bonus section
The last thing we'll do is a bit more advanced, but will make your script
friendlier since it won't need to cd into your data/ folder in order to
work.
There is another command called basename which, given a path, returns the
last part of the path, either a filename or deepest folder in the path. For
example,
$ basename data/S000/genome.dat
genome.dat
$ basename data/S000
S000
One other trick. At any point in a shell script you can call another command
and get its value by putting that command inside parentheses,$(...). For example,
$ echo The time is $(date) right now
The time is Tue Mar 10 08:34:38 EDT 2015 right now
$ echo The basename is $(basename data/S000/genome.dat)
The basename is genome.dat
$ echo cp data/S000/genome.dat $(basename data/S000)
cp data/S000/genome.dat S000
We can re-write our loop to use basename to give the proper names out our
files. When you're doing this exercise, rather than type things out on the
command line you can edit your commands.sh script and run it. That way you have
a record of what you've done.
Start with a for-loop that loops over all the subject folders in your
data/folder.$ for i in data/*; do echo $i doneEach time through the loop, echo a
cpcommand that copies thegenome.datfile togenomes/. Don't worry about giving it a name yet.$ for i in data/*; do cp $i/genome.dat genomes/ doneNow use
basenameto get the subject folder name (i.e. turndata/S015intoS015) using the `$(...) notation so that we can use that in the name of the target file we're copying to.$ for i in data/*; do cp $i/genome.dat genomes/$(i)_genome.dat done