The paper for Flint just got published! You can view the publication at Oxford Bioinformatics. Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes.
We’ve been testing some Spark code that will eventually be moved to AWS. For now, to save costs, we’ve created a 8 node Spark cluster that runs on a set of Virtual Machines running Ubuntu on VirtualBox. We’ve developed some bash-scripts to make starting (and shutting down) the VMs easy.
The Bioinformatics repository at my GitHub account contains a script I use to "build" the Human Genome: it creates the necessary genomic data structures that I need to run a DNA sequencing analysis. The data structures are Burrows-Wheeler indices that the genomic aligners (Bowtie2) need to get their job done.
Recently I had to upgrade my R installation because I needed to install a library that required a higher version of R than what I had installed. I used to live life on the edge and upgrade R as soon as a new version was available, but as my third-party libs started to grow I started to upgrade R less and less.
I needed to create a series of diagnostic plots for a recent Data Mining project. I created the plots by hand using R — I say "by hand" to mean that I wrote a script to generate them, rather than using a tool such as Tableau. The reason is that the data for the plots came from the UCI Machine Learning Repository, and it just so happened that the particular datasets come bundled with the R standard library. :)
A recent assignment in a machine learning class called for drawing the k-nearest-neighbor decision boundary for some given values of k, starting with k=1. The task involved using standard Euclidean distance between the starting points to determine the class of the nearest neighbors, and at the same time to draw (by hand) the resulting figure.