Paper: Large Scale Microbiome Profiling in the Cloud

The paper for Flint just got published! You can view the publication at Oxford Bioinformatics:

Large Scale Microbiome Profiling in the Cloud

From the manuscript:

Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43,552 bacterial genomes from Ensembl.

Flint runs on Amazon’s Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40K genomes on 64 machines in 67 seconds – an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments.

The source code is available in GitHub, under the MIT license.