Bioinformatic analysis for biologists - part 2

In part 1, we described the steps involved in obtaining raw sequencing reads from ChIP-Seq libraries. We also explained how to choose an analysis pipeline that best suits your needs. In this post, we describe how to align the raw reads to a reference genome. Let's go!


Choose an aligner

The most important step in alignment is choosing the right 'aligner' tool for your data. We found that reading the user manuals and the theories behind each aligner did not help us much with choosing the right aligner. Each dataset is different, causing an aligner to behave differently. To be confident that we will be using an aligner that best suits our needs, we ran smaller experiments on individual datasets to enable fast inspection of the output. This Wikipedia page lists most of the aligners that we tried in our small-scale experiments. After comparing all the outputs, we decided to choose bowtie 2 (we do not have any affiliation to bowtie or its creators; use at your own risk). The reasons were:

  • Speed: bowtie aligned all of our datasets overnight.
  • Simplicity: The online manual described the purpose of each optional parameter.
  • Accuracy: Short repetitive regions, did not appear to cause a problem regarding introducing artifacts or noise.

We recommend you compare different aligners and, perhaps, run a trial with several. Pick the one that gives you the most robust output. For example, low variation among biological/technical replicates and good signal-to-noise ratios.

Regardless of the aligner that you choose, you will find that two important steps are required before you can start aligning the raw reads.

  1. Create a reference genome from publicly available genome sequencing projects
  2. Choose the proper parameters that tweak how the aligner treats your data

Below we describe these two steps in more detail. Because we used bowtie 2 to align our reads, we will use this aligner as an example. However, the basics should be the same across different aligners.

First, follow the 'getting started' instructions for your chosen aligner. The instructions for bowtie are simple and listed on their website. Briefly, you download the latest version for your operating system from here. To save time, we recommend you download the binary files (if available for the aligner) that can be used as soon as the download is finished. Save the downloaded files in a location that is easily accessible from the 'terminal' (e.g., /Applications/bowtie2).


The reference genome

Next, find and download the genome sequence in FASTA format. To get the genome assemblies for bacteria, yeast, worms, and flies, you can visit

Usually, there is a 'download data' link on the page for the appropriate organism. Clicking on this link will open an FTP connection to the server hosting these files (the hosts permit guest access). Download the appropriate files that cover the entire genome, or parts of the genome that you are interested in and place these files in the same folder as the aligner tool.


website for downloading yeast reference genome
This is the page for downloading the reference genome of S. pombe. Click on 'download DNA sequence' to access the FASTA files. You can access the files as a guest user.


After placing the genome sequence files inside the bowtie folder, we ran the following command to create a reference genome index that can be used by the aligner. The generated index is unique to the aligner, so if you switch to a different aligner tool at a later time, you will need to create the reference genome index again using the method provided by the new aligner. To generate the index with bowtie, we executed the following command in terminal:

bowtie2-build reference_genome_sequence.fa name_of_index

reference_genome_sequence.fa was placed inside the bowtie folder, and the above command was executed from inside the same folder. If the files are in different locations, don't forget to add the relative paths. Replace 'reference_genome_sequence' with the actual name of the FASTA file that was downloaded. Replace 'name_of_index' with a name of your choice (e.g. my_human_reference_v1).

You can use the bowtie2-build function to create an index for a set of FASTA files obtained from sites such as UCSC, NCBI, and Ensembl.

Before creating the index, you can customize the reference genome in any way you want. For example, you may need to add or remove a custom contig at a particular region. Caution: Do not use a text editor for editing the reference genome files (see part 1).



After the index was created, we used the following 'for-loop' to sequentially perform alignment of each of our sequencing files.  

for sample in *.fastq.trimmed.gz; 
echo $sample; 
describer=$(echo ${sample} | sed 's/.fastq.trimmed.gz//'); 
echo $describer.sam; 
/Applications/bowtie2/bowtie2 -a -x pombe ./${describer}.fastq.trimmed.gz -S ./alighned/${describer}.sam; 

The first line initiates the for-loop for all the files that end with 'fasta.trimmed.gz.' We arbitrarily call each file sample. The * is used to create a wild-card name match. Here, it means process all the file with a name that starts with any string, as long as it ends with fasta.trimmed.gz. Using this approach, we can safely have a bunch of other files (e.g., readme.txt) in the same folder along with the sequence files. However, the for-loop will only process the files that match the given wildcard. 'echo' just prints out the file name that is being processed (stored in the $sample variable). The command on line 4, extracts the first part of the file name (the unique part of the file name that is before fasta.trimmed.gz). We use this to create a new file name with the .sam extension (see below).

If you want to align only one file, you can use line 5 (the bowtie2 alignment function) from the above for-loop. In our case, the bowtie2 program was placed inside the Applications folder, and we used the full path to call its function (/Applications/bowtie2/bowtie2). The -a flag is optional and is described below. The -x flag is required and indicates the name of the index file that we created earlier (pombe). The path to the raw sequence files is placed after the index name (./example.fastq.trimmed.gz). In our case, it was the current directory/folder (./). As mentioned in part 1, we are performing alignment on reads that were trimmed at the 5'. Since we were looping over all the sequencing files, we used the ${describer} variable that changed with each iteration of the for-loop. You can replace this variable with the actual name of a file that needs to be aligned. The -S parameters is optional and is used to ask bowtie to generate the output aligned files in the SAM format. Before running the alignment, we created a directory named 'aligned,' and inserted the path to this folder in front of the -S parameter. This caused the aligner to save the output files inside the 'aligned' folder.

Read more about the SAM format:

To create a directory using the terminal, you can run the make directory command (mkdir):

mkdir aligned

After the alignment process is finished, you can visualize the SAM files using a visualization tool such as the Integrative Genomics Viewer (IGV) from the Broad Institute.


Sort and normalize aligned reads

You can get a lot of information from the aligned SAM files, and these may be sufficient for you. Alternatively, you may need to perform more complex analysis such as WT vs. mutant comparisons. For this purpose, it is necessary to perform some normalization to account for noise and technical variations that are introduced during ChIP-Seq procedure. This is an evolving field, and many alternative approaches are being used for normalizing ChIP-Seq data. We will go into more detail about the normalization methods in a later post, but, here, we give one example.

An important step to enable normalization or visualization of the SAM files is to sort the aligned reads based on genomic location. To do this, we used SAM tools:

After installing sam tools (see above link),  we ran the following for-loop on the aligned sam files.

for sample in *.sam; 
echo $sample; 
describer=$(echo ${sample} | sed 's/.sam//'); 
/anaconda/bin/samtools view -bS ${describer}.sam | /anaconda/bin/samtools sort -o ${describer}.sorted.bam; 
/anaconda/bin/samtools index ${describer}.sorted.bam; 

 The for-loop is similar to the previously described loops. Here we use the sort function of samtools to generate sorted bam files (BAM format). These bam files make it super easy to perform normalization or any other mathematical processing on the aligned reads. You can inspect the content of the SAM and BAM files using the head or tail function, as described in part 1.

After sorting the alignments, to perform normalization, we created a python script that uses the bedtools genomecov function:

This python script was generated because we needed to perform additional calculations before using the genomecov tool (see comments inside the python script file). You can directly use the genomecov tool if no additional preprocessing of the reads is required. Visit our GitHub page to view the python script file:

To use the script, we performed the following command in terminal:

/anaconda/bin/ ./lab/scripts/ --bam sampleA_TCCGCGAA.sorted.bam --bw sampleA.bedgraph

The output of the normalization step is a file with the bedgraph format. 

The bedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. This track type is similar to the wiggle (WIG) format, but unlike the wiggle format, data exported in the bedGraph format are preserved in their original state.

In part 3, we will discuss how to use R to plot the bedgraph files that were generated by the bedrolls genomecov function.


Free sample scripts available on our GitHub page:

For sample genomic data used in this turtorial click here.

We do not have any affiliation to any of the tools mentioned in this blog post.