Over the last couple of months I've been asked for help analyzing several different sets of high-throughput sequencing data. This is not really my area, but I can see why I'm being approached: it's obvious that I use computers and big computers are required to extract information from these data. So, as a note to my future self and as a pointer for others I decided to write three posts describing some of the computing workflows I've been using. I've been pleasantly surprised at how straightforward it is. Please add any criticism to this post on Google+. I'd love to learn how to do better and update these posts.
Raw sequencing data
One project I'm involved with is comparing the resequenced genotypes of several S. cerevisiae populations. Cell samples were prepared in the lab and sent off to an external sequencing service provider, who carried out paired-end sequencing on an Illumina machine. About a month later, they sent an email informing us that the data were ready and pointed us to a password protected webserver where we could download them. Data in this case consisted of 24 .fastq.gz files (2 per sample) approximately 1.2Gb each. You can see some examples of the contents of .fastq files, and some tips for how to read them in Darren Wilkinson's post on processing sequencing data. Below is a snippet of uncompressed raw data from a .fastq file:
@HISEQ:178:C23RCACXX:5:1101:1913:2210 1:N:0:CGATGT CGTGGATCTTAGCAGAATGGGCTATAATTGGGAAGGTAACGAATTTTCATATCACATGCTGCTTGTATCTCAATACATGTTGAATGCAGGATTACTCCAAA + 1=1=ABDDFBECECBECBDAF3FEE?E@4A+:CBED?DGICFGDCE@?999?/?*9CBBAEE3@CFF@;CE7@EE?E:?CABB3?@EAA@((5@@###### @HISEQ:178:C23RCACXX:5:1101:2634:2072 1:N:0:CGATGT ACTCCCATCGCTCTACATTATCCGATCCAAAGCAGATGTTGAGAAAAGCATCCACTGGTAGCTCAAACGATCCAAGCGCTATGACTCCTTTTTCTTCAGTA + E@@DDDDDFFDAFGIII@HBHFCGCFHEB9?@GHGEFFHBFCHGGHIIFCGI)B=C@GIGG:DHGIGE/=B;;36;??;@;?ACCCCC:>CCC>CC>CCE: @HISEQ:178:C23RCACXX:5:1101:2675:2124 1:N:0:CGATGT AATTTCACGTACTTTTTCACTCTCTTTTCAAAGTTCTTTTCATCTTTCCATCACTGTACTTGTTCGCTATCGGTCTCTCGCCAATATTTAGCTTTAGATGG + @@@FFDFFHDFFFIIIJ;CEIFGHHGIJHIJJJGHEGHIJEHIHIIIGEG>F>GGHHJHGIHIIIIIGHIJIBEAAEEEFDDDDCDDEDCDDDCDCCDD@C @HISEQ:178:C23RCACXX:5:1101:2824:2166 1:N:0:CGATGT TCATTAATGTACATGTAATAGAGATAGAACCACGTGTGTAGAATAGATTCAAGCACAACAATTCTGGAAATCCATTTGTGGAGAGGTAACAATGCTAAGTA + ?@@=DDDDF????GHBGGDH@HG@C@EDH?CEDAFG?EF4?BFAEEBE@4?BGBDGCH@@FHGBCFHGIEHFG@DGEHHFE:?DEE766@ECCCCA@CC5;
At this point it's probably worth thinking about the type of computing facilities required to handle and analyse these data comfortably.
Data transfer rate/bandwidth
The first hurdle to overcome is downloading the data. My computers are part of the Newcastle University network. Newcastle University is part of the JANET network and so has very fast internet indeed. This tool quotes my internet download speed as 95Mbps. For comparison, a fast connection from a public ISP will give an actual download speed of about 50Mbps. It took me 1 hour to download all 28.3Gb of data.
Hard Disk space
The files supplied are compressed (thus the .gz file extension). If you want to store the uncompressed versions (note, there are ways around that), they come to about 88Gb. Also, many of the output files from the analysis are of similar size. Overall, I would say that having about 350Gb of free disk-space to work with a dataset of this kind of size would be more than adequate: you would never have to think twice about whether to uncompress a file or whether it is worth keeping. Many of the files generated could be deleted before archiving.
Probably the most CPU intensive task in this workflow is read alignment (see below). The software I use for alignment is capable of taking advantage of multiple CPUs at once to speed things up. Using 12 physical 3.05Mhz CPUs I can align the reads from a pair of files in 10 minutes, aligning reads for all samples in 2 hours. Since this is not an every day task, this number of CPUs is overkill. I would expect 4 to be more than adequate.
The tools I use for analysis are not that memory hungry. I haven't paid close attention, but I don't think I used more than 3Gb of RAM at any one time.
Analysing high-throughput sequencing data involves generating several types of intermediate files. It's probably best to try to be organised and create a working directory with several sub-directories to store them all. Below is the directory structure I use. I put raw data in the 'data' directory and scripts for analysis in the high-level working directory. I downloaded a reference sequence, against which to align reads, from the Sacchromyces Genome Database website, uncompressed it and placed the files in the 'references_SGD' directory. The remaining directories are automatically populated with output files by running workflow scripts (see below).
bowtie2 is a very impressive, fast, open-source, command-line tool to align short sequence reads to a reference sequence (Langmead & Salzberg (2012)). It's easy to install on Linux and Windows.
To install bowtie2 under Linux, first download the source code package (e.g. bowtie2-2.1.0-source.zip) from the files tab on the Bowtie sourceforge page. Then unzip to a target directory, navigate to the target directory and execute "make" to compile for your machine. You will probably want to export the target directory to your path before use.
If you'd prefer to install bowtie2 under Windows, binary files (including 64-bit binaries) are also available from the Bowtie sourceforge page. Unzip the files to a target directory and add that directory (e.g. the directory containing the file "bowtie2-align.exe") to your path, as described here.
As an example of using bowtie2 from the terminal (cmd.exe under windows, shell terminal under linux), using the directory structure outlined above, if you navigate to the upper-level directory and type the following line of text, followed by enter, this will read in the reference sequence and create an 'index' file in the indices. Such indices are required for bowtie2 to quickly search the genome.
bowtie2-build reference_SGD/S288C_reference_genome_R64-1-1_20110203/S288C_reference_sequence_R64-1-1_20110203.fsa indices/SGD_S288C
To actually perform paired-end alignment, using 12 CPUs, write a command following this example (adjusting filenames to suit your case):
bowtie2-align -x indices/SGD_S288C -1 data/SampleA_1_sequence.fq -2 data/SampleA_2_sequence.fq -S alignments/SampleA.sam -p 12 --end-to-end
bowtie2 outputs alignments as uncompressed .sam files. Many other tools for further analysis need alignments to be in the alternative, compressed, binary .bam format. Fortunately samtools is another amazing, open-source, command-line tool which can do that (and many other useful things) for us (Li et al. (2009)).
To install samtools under Linux, check out the most recent source code from github. If you don't currently have git installed on your system, you will need to do that first. In Ubuntu, for example, you can do this by executing the following command in a terminal:
sudo apt-get install git
Then, check out a copy of the source code by executing the following (you probably want to execute this in your home directory):
git clone git://github.com/samtools/samtools.git
The last command makes a directory called samtools. Navigate into that directory and execute "make" to compile for your machine. You will probably want to export the following three directories to your path before use: samtools samtools/bcftools samtools/misc
To install samtools under Windows, you can download a .zip file containing pre-compiled binaries here. As for bowtie2, unzip the files to a target directory and add that directory to your path, as described here. Note that for generating consensus .fastq files (in a later post), you will additionally need to have a working perl environment, and a copy of the vcfutils.pl script, which you can download from the samtools github page.
Scripting to automate workflows
A useful feature of code-based computational work (scripting and programming) is that the code that you write (computer instructions written as plain text) is both the means by which work is carried out and a perfect record of past actions. This is unlike manual recording of work (e.g. writing records into experimental lab notebooks) where an extra layer of human interpretation can occasionally lead to omissions or errors. Code-based work is also unlike GUI-based computation, which is based on user interaction with mouse-clicks and windows and typically leaves no trace of how analysis was carried out. GUI-based computation (e.g. producing plots using Microsoft Excel) can typically only be replicated by manual execution of the same series of tasks, much like manual work. On the other hand, very complicated code-based workflows, once created, can be repeated trivially by executing a single command, or by clicking on a single icon, for example. Similarly, code describing a complicated workflow can be changed easily by editing the underlying text file. Since code is just plain text, it can easily be edited, saved, copied, distributed and archived. Given the advantages of code-based work, the fact that bowtie2 and samtools are command-line tools should be seen as a strong feature. Even if you are not familiar with this way of working, it is worth spending a little time and effort coming to grips with command-line tools, it will likely save you a great deal of time, even in the short-term.
Linux shell script for alignment
Using the bash shell scripting language it is possible to write a short script to automate analysis of these files. The script below uses bowtie2 to create an index of a reference genome and then aligns each of a set of samples to that reference genome. It then uses samtools to convert the .sam alignment files to the more useful .bam format, generating sorted versions and indices for these files.
Scripting and programming languages have a particular vocabulary and syntax. Similarly to human languages, they are not intuitive or obvious to people who are unfamilar with them. Fortunately, the vocabulary and syntax are much simpler than human languages and there are many excellent resources online to help with the process of learning, translating or writing in these languages. Lines preceded by a '#' in the code below are not used to carry out any actions, but are just human-readable descriptions of the purpose of the surrounding code.
Windows batch script for alignment
The same automated workflow can be achieved under Windows by writing a batch file. You can see a batch file which carries out the same tasks as the shell script above here. Note that in this batch file, human-readable descriptions of the purpose of the code are preceded by a '::'. Batch files are written in a different scripting language which is not as flexible as the Linux equivalent, but are still suitable for this analysis. However, one disadvantage of using batch scripts is that the list of sample IDs to iterate over must be read from a separate file (you can download an example file here).
Alignment file output
After all that effort, the data that comes out at the end (the uncompressed .sam files at least) look like the example snippet below. My next post will demonstrate how to get information from alignment files (particularly .bam files).
@HD VN:1.0 SO:unsorted @SQ SN:ref|NC_001133| LN:230218 @SQ SN:ref|NC_001134| LN:813184 @SQ SN:ref|NC_001135| LN:316620 @SQ SN:ref|NC_001136| LN:1531933 @SQ SN:ref|NC_001137| LN:576874 @SQ SN:ref|NC_001138| LN:270161 @SQ SN:ref|NC_001139| LN:1090940 @SQ SN:ref|NC_001140| LN:562643 @SQ SN:ref|NC_001141| LN:439888 @SQ SN:ref|NC_001142| LN:745751 @SQ SN:ref|NC_001143| LN:666816 @SQ SN:ref|NC_001144| LN:1078177 @SQ SN:ref|NC_001145| LN:924431 @SQ SN:ref|NC_001146| LN:784333 @SQ SN:ref|NC_001147| LN:1091291 @SQ SN:ref|NC_001148| LN:948066 @SQ SN:ref|NC_001224| LN:85779 @PG ID:bowtie2 PN:bowtie2 VN:2.1.0 HISEQ:178:C23RCACXX:5:1101:2675:2124 99 ref|NC_001144| 463908 1 101M = 463960 153 AATTTCACGTACTTTTTCACTCTCTTTTCAAAGTTCTTTTCATCTTTCCATCACTGTACTTGTTCGCTATCGGTCTCTCGCCAATATTTAGCTTTAGATGG @@@FFDFFHDFFFIIIJ;CEIFGHHGIJHIJJJGHEGHIJEHIHIIIGEG>F>GGHHJHGIHIIIIIGHIJIBEAAEEEFDDDDCDDEDCDDDCDCCDD@C AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:-2 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:2675:2124 147 ref|NC_001144| 463960 1 101M = 463908 -153 CCTGTACTTGTTCGCTATCGGTCTCTCGCCAATATTTAGCTTTAGATGGAATTTACCACCCACTTAGAGCTGCATTCCCAAACAACTCGACTCTTCGAAGG #CC@C@CCDDBDDCDDDDDDDDDBBDDFFDDFHHHHEEBIIEIIJJJIGGFIFB7FD7IIHDG@GIIGGHDEIIHGIIJIGHHFFHEADEHGHFFFFFB@@ AS:i:-2 XS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0A100 YS:i:0 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:3469:2199 83 ref|NC_001136| 889469 42 101M = 889282 -288 TTGTCGAAATCTATATATGCGCCGCCCTGGCAGGATCAGCAAATACAAGTTACTTGGAAAGTCGCGGCATAGAAAGTAGACACTTCTTTTGTCAAAGATTG EBDBDDCDDDCADDCE59DB@@B?DACCBBDCCDHGHHHCIHEGDGG@EGEHHIHEIGGG@GDEIIIIEIIGHHF@FFF:FCEEIGIIHHHHHFFFFFC@@ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:3469:2199 163 ref|NC_001136| 889282 42 101M = 889469 288 TAAGCCTGTTTCTAAACATGATCCCCAGGGTGAATATCTCCCATAGTGGCTGTGTTATTCTGTCTAAATGTGGCAGCTAAATAAGGTTTTTTCCAAGCTAT B@@FFFFFFHHHHIIJEGGIGIJIIIGJGE1CBEIIIJJJJJFHIJBGGHIF?FBFCHIGIJACHIHGGIDHGIHHHD;?DDEECC.;ACBD?AD@58:AC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:3694:2092 99 ref|NC_001144| 467018 1 101M = 467079 162 CTGCTGCCGGAAATGCTCTCTGTTCAAAAAGCTTTTACACTCTTGACCAGCGCACTCCGTCACCATACCATAGCACTCTTTGAGTTTCCTCTAATCAGGTT BB@FDFFFHHFHHGHIHGHIIGIIFFHIJJJJIHIJFGHIHIHIGBFHEHGHHIJIHHHHEDFFCACCACCCDDDDCCDDDCCBDDDECCDDEACCACDDC AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:-2 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:3694:2092 147 ref|NC_001144| 467079 1 101M = 467018 -162 CCCATACCATAGCACTCTTTGAGTTTCCTCTAATCAGGTTCCACCAAACAGATACCCCGGTGTTTCACGGAATGGTACGTTTGATATCGCTGATTTGAGAG #CDCCCDDDDEDDCCCDDCCDCDDBBDCCCADC?DEDCA;FFDFHHHHHHHDB@JJJIJJJIHFDJIHGDHGIJJJJIJIHFHHGGHJHHFDFFDDDDBBB AS:i:-2 XS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0A100 YS:i:0 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:1913:2210 99 ref|NC_001146| 451329 42 101M = 451531 303 CGTGGATCTTAGCAGAATGGGCTATAATTGGGAAGGTAACGAATTTTCATATCACATGCTGCTTGTATCTCAATACATGTTGAATGCAGGATTACTCCAAA 1=1=ABDDFBECECBECBDAF3FEE?E@4A+:CBED?DGICFGDCE@?999?/?*9CBBAEE3@CFF@;CE7@EE?E:?CABB3?@EAA@((5@@###### AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:30T70 YS:i:-7 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:1913:2210 147 ref|NC_001146| 451531 42 101M = 451329 -303 TCAAATAAAGTTATGGGGGCCATGTCGATATGTGCTGGCTCGTTTGTTGATTTTTATGGTCGTAATCAAGGCACAAAAATCTCGTCACATTATACATTTTC #############@5895(9DDBDB@AAE:CA@=@B9=:;/2@GFEDC=;E9/3???6))F?1*BED9?4DCEE?IGGGFECEHEE:DE,,FD?B;=D?;8 AS:i:-7 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:10G7T33G48 YS:i:-3 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:3259:2171 99 ref|NC_001141| 127803 42 101M = 127987 285 AACCGCTATTTTTGGTTTTATCTTCGTTTCTTTCTCCTGAACGACATTCGTCACGAAAATTGCGGCGGAAAATTTCCTGATGCGGACACTTTTTCCCGATC E?@+B@DDHFHFDHIFE;@EFFGGF?:??CGHGDHDGHIIIGHGFGHG8C8@CGEGHEEBDDECABBBEE;34ECE@CCCACEE9505((:@AC3ECB@E@ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:3259:2171 147 ref|NC_001141| 127987 42 101M = 127803 -285 TGCTTTCAGCATAGCACAGCATAGCAGCTGTGTATATCTTAAATAAGATGTAGACTGGTTTGCATTTGGAAAGGTTTTGTGTAAGAAAAGCAATACTTGAG EDCCC@EECEFDFFC@CEHFEHHIGGFIIGHGJJHDJIGGHG@IGBEFIIHIGDCHEEIIIG@IIIIGEHGHGHGJIJIHGGHGGHGBFEFGHDDDBD8@? AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP HISEQ:178:C23RCACXX:5:1101:3160:2249 83 ref|NC_001147| 90204 42 101M = 89998 -307 CACTTAATAAAAGTGCGAGCATGAAAATCGTACCTGCTGCGGGTGCCAGATTTCTTGATAATAATTCATGAATAGTAAACAAATCTGCCCCATCTGCTTCT CDCCDCDCDDDDDB?BCADDEDEDBDDCEBC@FFHHIHHFGIIJHJIGGJIGHIJIJJJJJIJJIIJJJIJJIJIJJIHJJIHEFJIIGGHHHFFFFD@@B AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP