The Bioinformatics repository at my GitHub account contains a script I use to "build" the Human Genome: it creates the necessary genomic data structures that I need to run a DNA sequencing analysis. The data structures are Burrows-Wheeler indices that the genomic aligners (Bowtie2) need to get their job done.
Recently I had to upgrade my R installation because I needed to install a library that required a higher version of R than what I had installed. I used to live life on the edge and upgrade R as soon as a new version was available, but as my third-party libs started to grow I started to upgrade R less and less.
I needed to create a series of diagnostic plots for a recent Data Mining project. I created the plots by hand using R — I say "by hand" to mean that I wrote a script to generate them, rather than using a tool such as Tableau. The reason is that the data for the plots came from the UCI Machine Learning Repository, and it just so happened that the particular datasets come bundled with the R standard library. :)
A recent assignment in a machine learning class called for drawing the k-nearest-neighbor decision boundary for some given values of k, starting with k=1. The task involved using standard Euclidean distance between the starting points to determine the class of the nearest neighbors, and at the same time to draw (by hand) the resulting figure.
Spark is great, and the more I work with it on my PhD thesis the more changes I make to my local installation on my rMBP. One of the modifications I came across the other day is how to dial down logging messages in one of the Spark shells. Specifically, how to dial dow the messages in PySpark when programming in Python.
I recently needed to run some R code while inspecting a dataset in Tableau, and this post contains some notes & observations on the process of setting it up in OS X. A lot of these observations are also found in the excellent r-bloggers post linked here.
I've been keeping my eye on Spark for a while now, and decided to take the plunge recently after having to do some brief R analyses that were not that complicated and were perfect for Spark. I use TextMate as my R IDE, and I wanted to run my scripts from TextMate right into Spark, and the following are a couple of tips & tricks I found on how to setup everything so that you can start Spark from a command-R (⌘-R) shortcut.
Weka is a great resource for data mining and machine learning. You can get a lot done with the standalone GUI workbench, but sometimes you need to use it as part of a script in a custom R analysis pipeline. Yes, you could create a shell script that makes use of the Weka command-line tools, and invoke said script from R using a 'system' call, but that could get out of hand really quickly.
I have a love/hate relationship with Circos. I love the figures and plots that you can create with it, but I hate having to install it. We have machines that are not upgraded because someone somehow got it working on it, and we are afraid to have to go through the whole process of having to reinstall it. Its awful. Its great.