## Perl5 References 2010

I find Perl to be a great language to use for both small-scale one-liner type tasks and as well a for large-scale multi-file applications. However, I need reminders of its usage details every now and again. While I could read the relevant manpages via perldoc, they are somewhat dry and technical (they are manpages after all!). I found these recent tutorials/references much more easier reading.

### Modern Perl Object Orientation

Moose is the Object Oriented foundation for Modern Perl. I found the following Moose presentation by the eminent Perl hacker RJBS very informative. It seems to nicely describe all the latest functionality of Moose (as of August 2010).

## A Taste of Reservoir Algorithms

Imagine that you are the Chief Chocolate Inspector of a chocolate factory. Your job is to ensure that the days production of chocolate by the factory is of good quality. Only upon your recommendation will the factory be shipping out its days worth of chocolate production to its distribution points.

The chocolate factory consists of three adjacent rooms connected together by a conveyor belt/assembly line.

The first room is where the chocolate pieces are produced and initially placed onto the conveyor belt. The factory produces a variety of decadent chocolate types simultaneously and randomly places each type onto the conveyor belt.

The second adjoining room is where each of the individual chocolate pieces get wrapped up. Afterwards, you are given a choice whether to choose that particular piece for taste inspection, or ignore it and wait for the next one.

The final adjoining room is where the chocolate pieces get packaged together in boxes and loaded onto the delivery trucks. The trucks only move upon your decision at the end of the day.

Your goal, as the Chief Inspector, is to notify the truck drivers if the packages are suitable for shipping in as efficient manner as possible. Armed with a special selection bin to hold $n$ pieces of chocolate, you decide to choose $n$ uniformly random pieces of chocolate out the $N$ total pieces of chocolate produced by the factory throughout the day, where $n << N$. Afterwards, you base your taste decision on this random subset.

Unfortunately, you have no idea as to how many chocolate entities will be passing by on any given day. Some days, the amount of chocolate produced by the factory is very high, on others it is pretty low. In other words, you do not know what $N$ is beforehand.

While you could wait to make the sampling after the days production is done; you don’t want to hold up the assembly line making these quality control choices. “I Love Lucy” has shown what kind of hijinks can happen if this process goes awry.

## What to do?

One solution to this problem might be to roll a die as each chocolate piece passes you. If the die rolls even, hold the chocolate for inspection in the selection bin. Otherwise, let the chocolate pass through. If the selection bin is full when choosing a chocolate piece, simply replace the oldest chocolate piece in the bin with the new one.

In this naïve approach, one is biased against keeping the earliest chocolate samples with the ones selected later in the day. To resolve this issue, you can use a loaded die or get more “Dungeons & Dragons”-esque by using more sophisticated dice and rules as the day progresses.

Then with some luck, you could get a uniform sampling of the days production of chocolate.

But there is a better way…

## A Smarter Way

Suppose you could assign a uniform random (i.i.d of course!) number from $[0, 1)$ to each piece of chocolate you encounter on the assembly line.

Now, to obtain a sampling of the days production of chocolate, simply choose the chocolates corresponding to the $n$ lowest random numbers linked with them. Each chocolate piece has an equal chance of being in the lowest $n$ choices, so the sampling is uniform.

With the ubiquitous computing power available these days, you could substitute using dice with a iPhone or Android app to help out. Each time you see a new chocolate piece, generate a random number with your app. Hold onto the chocolate pieces that correspond to the $n$ lowest random numbers. If you encounter a new chocolate piece that is linked with a random number that is lower than the largest random number associated chocolate already in your selection bin, simply replace the old one with the new one.

By the end of the process you now have $n$ uniform samples of the entire days chocolate production, and you never needed to hold onto more than those $n$ items. Pretty clever, eh!

This approach could be optimized even further. For example, you could preemptively generate a large list of random numbers; identify the lowest ones; and afterwards directly choose those chocolate pieces as they appear off the production line and bypass the rest.

Amazingly, you could sample the entire days production of chocolates, $N$, in less than $O(N)$ time. According to this fellow, you can make the process $O(n(1+log(N/n)))$ efficient.

These techniques of sampling a set of unknown size through one pass are called reservoir algorithms.

## The “Real” World

While most of us are not chocolate inspectors, we are all nowadays swamped in a “sea of data”, from science and business to personal tracking. Often times these raw data sets come in large chunks. One data-mining technique to get a better handle of the data is sampling it. Sometimes less is more. Now chocolate pieces become record sets and the assembly line becomes a large file.

Reservoir algorithms can be a useful tool in one’s data science & engineering endeavors.

Posted in algorithms, software | Tagged | 1 Comment

## Pacific Biosciences Buzz

Press release image of the upcoming Pacific Biosciences DNA sequencer, the PacBio RS

This was an interesting video about Pacific Biosciences (PacBio), a third-generation DNA sequencing company, that I recently came across in the Wall Street Journal.  It profiles their upcoming sequencing device, the PacBio RS, from a business perspective.

## sff2fastq

The basic premise of genetic sequencing involves preparing a DNA sample into a form suitable for use on a DNA sequencer.  Afterwards, the sequencer ascertains the sequences of bases on the preapred sample and stores these results into a digital file.  These file formats are related to the sequencing methodology taken by the sequencer.

In 454 sequencing, the SFF format is the native currency of storing the sequence data; ABI-Sanger it is the AB1 or SCF chromatogram file format; Illumina/Solexa it is the QSEQ or Illumina FASTQ format; and in ABI-SOLiD it is the colorspace CSFASTA format.

Most scientists/biologists are more interested in the final sequence data produced rather than the particular vendor technology itself.

During the course of the biological investigation, one often is confronted with data from various sequencing platforms.  A format is needed that is common across platforms.  In the era of next-generation sequencing, it appears that the Sanger FASTQ format is the popular lingua franca of sequence file formats.  It holds both the sequence and quality data generated by the sequencer.  Many of the currently popular (and open-source) aligner and assemblers such as maq, bwa, bowtie, SSAHA2 and velvet accept Sanger FASTQ files as their inputs.

In the world of 454 sequencing, Roche 454 has their own set of tools to work with the data.  Unfortunately, they are not freely available.  While the 454 tools from Roche provide a way to convert their data into a FASTA file format, another device independent sequence file format; there is not a direct SFF to FASTQ conversion utility.

To that end, and for curiosities sake, I decided to write a program to do so, called sff2fastq.  The idea is by no means unique.  There are other similar tools such as flower (haskell-based) and sff_extract (python-based), and other alternative approaches as discussed on seqanswers.  As they say, variety is the spice of life.

Posted in biology, dna sequencing, software | 6 Comments

## Bayes, Bayes, & Bayes

I have been going through a bit of a mathematical refresher lately. It has been a while in directly dealing with things of a probabilistic and statistical nature. In particular, I was reading up on Bayes’ Theorem and Bayesian inference. Bayes’ rule is a subtle mathematical statement that has deep interpretations and implications. Below were three introductory write ups I found upon the subject, each with a slightly different perspective of presentation.

## Writing Tips

I am a perpetual apprentice to the oft neglected and under appreciated art of writing well.  Words, either written or spoken, are the most powerful means of communication we have.  While images, animations, sounds, and music certainly have the ability to capture our attention and engage us, words are needed to persuade, instruct, and argue.

As Roy Jacobsen wrote on his blog, Writing, Clear and Simple:

The words you use, either written or spoken, can have powerful effects on your audience‚—if you use them carefully and skillfully. Whether your goal is to inform, to persuade, to call for action, or to entertain, your words and your stories can be powerful. They can be powerful, because language is software for the mind

For the past few years, aside from corporate email speak, I have spent the majority of my time writing software for computers.  This has placed my writing, or software for the mind skills, a bit in the backseat.  Obviously, being well read, and practice writing are the keys to improvement; however, there are times when direct discussion about the topic itself proves helpful.  Below are a few resources that I have found in the past few months that maybe useful.

Happy Writing!