Building consensus sequences

Introduction

In my last two posts I described aligning short reads (stored in .fastq format) produced by a sequencing machine against a reference sequence and visualising read-depth coverage along the genome. In the final post in this series of three, I will briefly describe the next step in the resequencing workflow: generating a consensus genomic sequences from read-alignment (.bam files, such as those generated in my last post).

Software & scripting to generate consensus

In the first of these three posts, I described installing the samtools software, along with using scripting tools like Bash shell scripts and Windows batch files to automate analysis. Here I will demonstrate the use samtools again, this time to call SNPs and (short) indels and to output consensus sequences. Again, I will use scripting to automate this process for each sample to be analyzed:

The script above runs the samtools mpileup function, which generates output in the Variant Call Format (VCF) by default. In order to skip that stage and move directly to the consensus sequence in .fastq format (sequence and quality measure for each of 17 chromosome "reads") the script above pipes the output from samtools mpileup directly to two other tools which are part of the samtools software: bcftools and the perl script vcfutils. In order to get the last step to work under windows you will need a working perl environment and a copy of the vcfutils.pl perl script which you can get from the samtools github page. In any case there is an example snippet of the output for the first few bases from one chromosome from one yeast sample below:

@ref|NC_001133|
mcmcaccacacccACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC

Automatically comparing gene sequences

Accessing and comparing subsets of these consensus sequences (e.g. checking if the sequence at an open-reading frame associated with your favourite gene is the same in your sample as in some other reference genome) can be done programatically using the BioPython library for Python (another excellent open-source programming language). Below is an example Python script which demonstrates reading in a reference genome (e.g. the S288C reference genome from SGD) and comparing that with each of the sample sequences generated with samtools (see script above). In particular, it only compares the sequences within the Open Reading Frames for specified genes. I have assumed the same directory structure I presented in the first post. Note that lines preceded by '#' are comments, or notes to help humans understand the motivation for commands given to the computer.

Finally, here is some example output from the script above. I have specifically chosen the genes examined not to be particularly informative, but you can see that there is no difference between the BRR1 reference sequence and sample sequence in this case, whereas a large section of the HIS3 gene is missing (functionally deleted). Note that, depending on how your browser handles word-wrapping, you will have to scroll far to the right to see the full gene sequences on this page.

BRR1

Reference BRR1 sequence:
ATGAAAAGAGGAGAAAGCCAGGCGCCTGATGCCATCTTTGGTCAATCTCGTGCATTTGCACTATCTGATTCTTCAGTGAATCCTGATGTCATTGAATATCTTAAAAGTGTTAGGCAAGAAGCACTAAGAACCAACGCCATATCGATTAAGAATCATATGAATCTGCAAAAGAGGACGCGTCACAAATCAAGCATGTACGATGATGAGGATGAGGGGGCCCTTAAAAGGCACGCCATCTCGCCATCCTTGATTAGGCTTCAAAGGAATGTAGAAATATGGGTAAGGTGGTTTAACTCTGTGAAGGCGACGGTGTTGACTAATGCCTACGAGTTTACCGGTTATGAGGACGAAACACTGGATCTTTTATTGCTTTTCTTAAAGAATTATCTCGAAGATATGCCCAGCAAATGTACTACAGTTGAAAAAATTATAAGTGTCCTAAATCAGCATTCCTTTCCTGAGAAGGCAGAAGAGAAGGAAGAAAATCTTCAGATTGATGAGGAATGGGCGAAGAATATTTTGGTACGGCTGGAAAAAACCAAGATTGACAGTGTTGAGGATGTAAAAAAAGTAATCACTGAAGGAGATAAACATGAACTAGTTGGATACAACCAGTGGTTCCAATACCTTATAAACAATGAACCGCAGCATACTACTTTTCATGAAAAGATTACCTCTAAGCAACTTTGGGTTCTGATCAAGTATATGTCGAATACATGGATAAAAGAAATACACAAGAAAGGAAGGCATTATCGTCGGCTGCAAGATTGGCTATTCTACATACTGGTACATACACCCGAAAGAGTCACGGCAGAATATACAAGCATCTTGAGAGATCTTGGAAAGAAATGCCTTGAACTGATTCAAAAGAAGCCAGTTGAAGCACATGAGAATAAAATAACACTCCCGAAGGAGATGGCGGAATTGAATGTTGAGATACCAGCCGCGGTGGAGAATATGACGATAACTGAGCTGACAGTGTCTGTTATAGCAGTGAACTATGGTCAAAAAGACTTGATAGAATAA

SN7640178_10172_8183 Sample BRR1 sequence:
ATGAAAAGAGGAGAAAGCCAGGCGCCTGATGCCATCTTTGGTCAATCTCGTGCATTTGCACTATCTGATTCTTCAGTGAATCCTGATGTCATTGAATATCTTAAAAGTGTTAGGCAAGAAGCACTAAGAACCAACGCCATATCGATTAAGAATCATATGAATCTGCAAAAGAGGACGCGTCACAAATCAAGCATGTACGATGATGAGGATGAGGGGGCCCTTAAAAGGCACGCCATCTCGCCATCCTTGATTAGGCTTCAAAGGAATGTAGAAATATGGGTAAGGTGGTTTAACTCTGTGAAGGCGACGGTGTTGACTAATGCCTACGAGTTTACCGGTTATGAGGACGAAACACTGGATCTTTTATTGCTTTTCTTAAAGAATTATCTCGAAGATATGCCCAGCAAATGTACTACAGTTGAAAAAATTATAAGTGTCCTAAATCAGCATTCCTTTCCTGAGAAGGCAGAAGAGAAGGAAGAAAATCTTCAGATTGATGAGGAATGGGCGAAGAATATTTTGGTACGGCTGGAAAAAACCAAGATTGACAGTGTTGAGGATGTAAAAAAAGTAATCACTGAAGGAGATAAACATGAACTAGTTGGATACAACCAGTGGTTCCAATACCTTATAAACAATGAACCGCAGCATACTACTTTTCATGAAAAGATTACCTCTAAGCAACTTTGGGTTCTGATCAAGTATATGTCGAATACATGGATAAAAGAAATACACAAGAAAGGAAGGCATTATCGTCGGCTGCAAGATTGGCTATTCTACATACTGGTACATACACCCGAAAGAGTCACGGCAGAATATACAAGCATCTTGAGAGATCTTGGAAAGAAATGCCTTGAACTGATTCAAAAGAAGCCAGTTGAAGCACATGAGAATAAAATAACACTCCCGAAGGAGATGGCGGAATTGAATGTTGAGATACCAGCCGCGGTGGAGAATATGACGATAACTGAGCTGACAGTGTCTGTTATAGCAGTGAACTATGGTCAAAAAGACTTGATAGAATAA
Sample: SN7640178_10172_8183 Gene: BRR1
Number/percentage of differing nucleotides: 0 0.0
HIS3

Reference HIS3 sequence:
ATGACAGAGCAGAAAGCCCTAGTAAAGCGTATTACAAATGAAACCAAGATTCAGATTGCGATCTCTTTAAAGGGTGGTCCCCTAGCGATAGAGCACTCGATCTTCCCAGAAAAAGAGGCAGAAGCAGTAGCAGAACAGGCCACACAATCGCAAGTGATTAACGTCCACACAGGTATAGGGTTTCTGGACCATATGATACATGCTCTGGCCAAGCATTCCGGCTGGTCGCTAATCGTTGAGTGCATTGGTGACTTACACATAGACGACCATCACACCACTGAAGACTGCGGGATTGCTCTCGGTCAAGCTTTTAAAGAGGCCCTAGGGGCCGTGCGTGGAGTAAAAAGGTTTGGATCAGGATTTGCGCCTTTGGATGAGGCACTTTCCAGAGCGGTGGTAGATCTTTCGAACAGGCCGTACGCAGTTGTCGAACTTGGTTTGCAAAGGGAGAAAGTAGGAGATCTCTCTTGCGAGATGATCCCGCATTTTCTTGAAAGCTTTGCAGAGGCTAGCAGAATTACCCTCCACGTTGATTGTCTGCGAGGCAAGAATGATCATCACCGTAGTGAGAGTGCGTTCAAGGCTCTTGCGGTTGCCATAAGAGAAGCCACCTCGCCCAATGGTACCAACGATGTTCCCTCCACCAAAGGTGTTCTTATGTAG

SN7640178_10172_8183 Sample HIS3 sequence:
ATGACAGAGCAGAAAGCCCTAGTAAAGCGTATTACAAATGAAACCAAGATTCAGATTGCGATCTCTTTAAAGGGTGGTCCCCTAGCGATAGAGCACTCGATCTTCCCAGAAAAAGAGGCAGAAGCAGTAGCAGAACAGGCCACACAATCGCAAGTGATTAACGTCCACACAGGTATAGGGTTTCTGGACCATATGATACATGCTCTGGCCAAGCATTCCGGCTGGTCGCTAATCGTTGAGTGCATTGGTGACTTACACATAGACGACCATCACACCACTGAAGACTGCGGGATTGCTCTCGGTcaagcttttaaagagGcccTAGggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnatcccGCATtttctcgaaagcttTGCAGAGGCTAGCAGAATTACCCTCCACGTTGATTGTCTGCGAGGCAAGAATGATCATCACCGTAGTGAGAGTGCGTTCAAGGCTCTTGCGGTTGCCATAAGAGAAGCCACCTCGCCCAATGGTACCAACGATGTTCCCTCCACCAAAGGTGTTCTTATGTAG
Sample: SN7640178_10172_8183 Gene: HIS3
Number/percentage of differing nucleotides: 189 28.5067873303