mountain.png

Scalable Metagenomics Analyses

Flint takes advantage of the Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large reference collection of bacterial genomes. Our implementation relies distributing the alignment of millions of sequencing reads against a large collection of bacterial genomes. The genome collection is partitioned in order to distribute it across worker machines, and this allows the use of large collections of reference genomes. We use the Bowtie2 aligner under the hood in the worker-nodes, and are able to maintain fast alignment rates, without loss of accuracy.

Our computational framework is primarily implemented using Spark’s MapReduce model, and deployed in a cluster launched using the Elastic Map Reduce (EMR) service offered by Amazon Web Services (AWS). The initial cluster configuration (as of Spring 2019) consists of multiple commodity worker machines (computational nodes), and in the current configuration of the cluster that we use, each worker machine consists of 15 GB of RAM, 8 vCPUs (a hyperthread of a single Intel Xeon core), and 100 GB of EBS disk storage. Each of the worker nodes work in parallel to align the input sequencing DNA reads to a partitioned shard of the reference database.

You can read the Flint publication to learn more.


Releases

What's New in Release Candidate (RC1)

Flint RC 1 is the initial public release of the Flint pipeline. RC 1 is a focused release that contains refinements and enhancements to the existing features of Beta 2. As part of this release we are also making available the necessary indices to deploy in your cluster.

Changes, Additions, and Fixes

  • Initial Release

Requirements

  • EMR 5.22.0

  • Hadoop 2.8.5

  • Spark 2.0+

  • Python 2.7

  • Boto3

  • Pandas

  • For a full set of requiremtns, see the full documentation.

Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast profiling of metagenomic samples against a large collection of reference genomes.



Documentation

  • Overview

  • Sample Code

  • Partitioned Genomes

  • Learn More