The Bioinformatics repository at my GitHub account contains a script I use to "build" the Human Genome: it creates the necessary genomic data structures that I need to run a DNA sequencing analysis. The data structures are Burrows-Wheeler indices that the genomic aligners (Bowtie2) need to get their job done.

The script is a simple bash script that downloads the latest human genome build from the Ensembl repository and then proceeds to process it. The genome is downloaded as 25 separate chromosome files (22 autosomes, 2 sex chromosomes, and the mitochondrial chromosome) that are then concatenated into one big file which is then processed by the index building utilities (bowtie2-build).

In addition to downloading the DNA genome, the script also downloads the gene annotations (GTF), CDNA sequences, and non-coding RNAs (NCRNAS).

I plan on updating the script with other stages such as building a transcriptome index with Tophat for splice-mapping, and indexing the CDNAs with Kallisto for transcriptome quantificiation.

[build-ensembl-genome.sh]

Genome Building

Camilo Valdes