mountain.png

Getting Started

Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large reference collection of bacterial genomes.

Requirements

The Flint software is designed to run on AWS, and requires a couple of assets to be in place before you can run it. Following is a list of the tools & items that you will need in order to setup and run a Flint cluster:

  1. An AWS Account.

  2. A copy of the partitioned genomes collection.

  3. A copy of the Flint source code.

  4. A terminal emulator for your computer (Terminal.app, iTerm2, PuTTY, etc.)

  5. An SFTP client (Panic's Transmit, FileZilla, etc.).


Deployment, Configuration, and Installation

Flint runs on Spark, and the current version is tuned for Amazon's EMR service. While the Amazon-specific code is minimal, we are not yet supporting Spark clusters outside of EMR. A Flint project consists of multiple pieces (outlined below), and you will need to stage the genome reference assets and machine configurations before launching a cluster. The pieces are:

  1. Asset Staging: Upload the bacterial reference genomes to an accessible data bucket in Amazon's S3 storage, along with two configuration scripts for the cluster.

  2. 🎛 EMR Cluster Launch: Launch an EMR cluster that will be configured with bootstrap actions and steps that you staged in the previous step.

  3. 📡 Accessing the Cluster: Connect to the cluster that you just created. You can connect through SSH or through a SFTP client.

  4. ⚙️ Source Code Deployment: Upload the main Flint python script, along with utilities for copying the reference shards into each cluster node, and a template of the spark-submit resource file, into the cluster's master node.

  5. 🛑 Terminating a Cluster: Any cloud provider will charge you for the time you use, so its critical that you terminate the clusters after you are done using them.

The following section(s) contain details on each of the above. If you have any questions, comments, and/or queries, please get in touch.


Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast profiling of metagenomic samples against a large collection of reference genomes.



Documentation

  • Overview

  • Sample Code

  • Partitioned Genomes

  • Learn More