Introduction to the Linux Shell

Motivation: The shell is one of the most basic ways you can interact with a computer and it allows you to automate your analysis. It is often the only way to interact with scientific computing environments (e.g. SCC, SciNet), so you'll have to learn it eventually.

Prerequisites: Minimal familiarity with the shell (e.g. you should know what the commands cd, ls, mv, cp, rm and man do).

What you will learn: Running commands on many files (globbing, looping, if statements), reading and writing to files, sorting/filtering data in files

Things to read:


Getting started

# connect to the SCC
ssh test_user@192.168.214.10

# make a folder for yourself, and cd into it
mkdir given-name_family-name          # e.g. jon_pipitone
cd jon_pipitone

# Run a script to generate the data for this workshop
bash ~jpipitone/make-data.sh

The make-data.sh script can be found here: https://raw.githubusercontent.com/pipitone/computing-skills/7128668508a8495d302bed5f396c7bb0b732961c/bin/make-data.sh

Exploring the data

The script you ran generates a phoney dataset of imaging and genetics data for a number of subjects in a folder called data. Have a look around.

# for example: 
ls data
cd data
ls S000
cd ..

Q: How do you see what is in the demographics file?

$ cat data/S000/demographics.csv
$ less data/S000/demographics.csv     # alternatively

Q: How do you find out the size of the files in a subject folder?

$ ls -l data/S000
total 16
-rw-rw-r-- 1 jp jp   35 Mar  9 16:05 demographics.csv
-rw-rw-r-- 1 jp jp 5050 Mar  9 16:05 genome.dat
-rw-rw-r-- 1 jp jp 2471 Mar  9 16:05 T1.nii

Q: How many lines are there in the genome.dat file?

For this you are going to need to know a new command, wc, which stands for "word count". If you run wc file it will print three things: the number of characters in the file, the number of words, and number of lines. Try it.

$ wc data/S000/genome.dat
 50   50 5050 data/S000/genome.dat

You can also use the -l option to only show the number of lines:

$ wc -l data/S000/genome.dat
 50 data/S000/genome.dat

Q: How many subjects are there?

First, make a list of the subject folders using ls:

$ ls data

Then, pipe that list into the wc command:

$ ls data | wc -l
101

Anything you pipe into wc gets counted. Try it with the history command, for instance.

Q: How many subjects have a T1.nii image?

You can use ls with a wildcard to match all of the subject folders:

$ ls data/*/TI.nii      # lists all of the T1.nii files in subfolders of data

And then you can pipe into wc -l to count that list.

Q: How many subjects have a T1.nii.gz image?

# your turn

Q: How many subjects have either aT1.nii or a T1.nii.gz image?

# your turn

Q: Which subjects don't have a T1.nii file?

# This is trickier. We'll cover this one next time. :-)

Merging and filtering text

Since each subject had a demographics.csv file, it would be nice to collect all of it into a single CSV so that we could analyse it.

We can use the program grep to filter lines that match a certain pattern. For instance, to show only the males ('M'), we could pipe the output to grep like so:

$ cat data/*/demographics.csv | grep M

Hmm.. looks like there are some subjects we don't know the sex of. We can use the sort to order text by line. Since sex is the first column, it will sort the lines by sex.

Next, we'll save the concatenated data into a file by redirecting the output of the command into a file using the > operator.

cat *files* > master.csv

The > operator takes anything the command before it prints out and prints it to the named file instead of displaying it on the terminal.

Organizing your data

In order to do some analysis you now want to collect all of your data types into separate folders by type (i.e. put all of your genome data together in one folder, all of your imaging data together in another, etc).

We need a way to rename our files automatically, one by one. And this is where a loop comes in handy.

For example, here is a list that prints numbers 1-5:

$ for i in 1 2 3 4 5 
  do
    echo ${i}        # gets run for each number
  done

A few new things here:

The list in a for-loop can also be made up of a wildcard that matches files. So, for instance you can loop through all of the files and folders in your current working directory like so:

$ for path in * 
  do
    echo Found: ${path}
    ls -l ${path}
  done

Okay, we're ready to move and rename some files. We'll first do it in a bit of a cumbersome way, and then show you how to do it more easily.

  1. First cd into your data/ folder.
  2. From your data/ folder, how would you copy subject S045's genome.dat file into the genomes folder?

    $ cp S045/genome.dat ../genomes
    
  3. Write a for-loop that prints the names of all of your subject folders. (hint: you just did this a few moments ago :-)

    $ for i in *; do 
        echo ${i}
      done
    
  4. Edit the for-loop to echo a command to cp the genome.dat file from each subject folder to the genomes folder.

    $ for i in *; do 
        echo cp ${i}/genome.dat ../genomes
      done
    
  5. The last thing we need to do is give our file a new name once it is in ../genomes. Since${i} is the name of our subject, we can use that in our name. Edit the for loop to copy each subject's genome.dat file to S045_genome.dat, etc.

    $ for i in *; do 
        echo cp ${i}/genome.dat ../genomes/${i}_genome.dat
      done
    
  6. Okay, remove the echo and run your for-loop for real!

Scripts

If you put shell commands in a text file (a so-called "shell script"), you can easily re-run those commands:

$ bash commands.sh    # this runs everything in commands.sh

The convention is to name our shell scripts with a .sh ending.

Bonus section

The last thing we'll do is a bit more advanced, but will make your script friendlier since it won't need to cd into your data/ folder in order to work.

There is another command called basename which, given a path, returns the last part of the path, either a filename or deepest folder in the path. For example,

$ basename data/S000/genome.dat
genome.dat
$ basename data/S000
S000

One other trick. At any point in a shell script you can call another command and get its value by putting that command inside parentheses,$(...). For example,

$ echo The time is $(date) right now
The time is Tue Mar 10 08:34:38 EDT 2015 right now

$ echo The basename is $(basename data/S000/genome.dat)
The basename is genome.dat

$ echo cp data/S000/genome.dat $(basename data/S000)
cp data/S000/genome.dat S000

We can re-write our loop to use basename to give the proper names out our files. When you're doing this exercise, rather than type things out on the command line you can edit your commands.sh script and run it. That way you have a record of what you've done.

  1. Start with a for-loop that loops over all the subject folders in your data/ folder.

    $ for i in data/*; do 
        echo $i
      done
    
  2. Each time through the loop, echo a cp command that copies the genome.dat file to genomes/. Don't worry about giving it a name yet.

    $ for i in data/*; do 
        cp $i/genome.dat genomes/
      done
    
  3. Now use basename to get the subject folder name (i.e. turn data/S015 into S015) using the `$(...) notation so that we can use that in the name of the target file we're copying to.

    $ for i in data/*; do 
        cp $i/genome.dat genomes/$(i)_genome.dat
      done