This, That & the Other

Perl5 References 2010

Posted on August 24, 2010 by indraniel

I find Perl to be a great language to use for both small-scale one-liner type tasks and as well a for large-scale multi-file applications. However, I need reminders of its usage details every now and again. While I could read the relevant manpages via perldoc, they are somewhat dry and technical (they are manpages after all!). I found these recent tutorials/references much more easier reading.

One-liner Tutorials

Essential Perl One-Liners (PDF) – a talk presented by Walter C. Mankowski at YAPC::NA 2010
The top 10 tricks of Perl one-liners – a blog entry from ksplice.com that goes through some practical perl one-liner shorthands.

Modern Perl Object Orientation

Moose is the Object Oriented foundation for Modern Perl. I found the following Moose presentation by the eminent Perl hacker RJBS very informative. It seems to nicely describe all the latest functionality of Moose (as of August 2010).

Moose: A Guide to the New Revolution

Posted in perl, software | Leave a comment

A Taste of Reservoir Algorithms

Posted on August 12, 2010 by indraniel

Flickr: Moonstruck Chocolates by eszter

Imagine that you are the Chief Chocolate Inspector of a chocolate factory. Your job is to ensure that the days production of chocolate by the factory is of good quality. Only upon your recommendation will the factory be shipping out its days worth of chocolate production to its distribution points.

The chocolate factory consists of three adjacent rooms connected together by a conveyor belt/assembly line.

The first room is where the chocolate pieces are produced and initially placed onto the conveyor belt. The factory produces a variety of decadent chocolate types simultaneously and randomly places each type onto the conveyor belt.

The second adjoining room is where each of the individual chocolate pieces get wrapped up. Afterwards, you are given a choice whether to choose that particular piece for taste inspection, or ignore it and wait for the next one.

The final adjoining room is where the chocolate pieces get packaged together in boxes and loaded onto the delivery trucks. The trucks only move upon your decision at the end of the day.

Your goal, as the Chief Inspector, is to notify the truck drivers if the packages are suitable for shipping in as efficient manner as possible. Armed with a special selection bin to hold $n$ pieces of chocolate, you decide to choose $n$ uniformly random pieces of chocolate out the $N$ total pieces of chocolate produced by the factory throughout the day, where $n << N$ . Afterwards, you base your taste decision on this random subset.

Unfortunately, you have no idea as to how many chocolate entities will be passing by on any given day. Some days, the amount of chocolate produced by the factory is very high, on others it is pretty low. In other words, you do not know what $N$ is beforehand.

While you could wait to make the sampling after the days production is done; you don’t want to hold up the assembly line making these quality control choices. “I Love Lucy” has shown what kind of hijinks can happen if this process goes awry.

What to do?

One solution to this problem might be to roll a die as each chocolate piece passes you. If the die rolls even, hold the chocolate for inspection in the selection bin. Otherwise, let the chocolate pass through. If the selection bin is full when choosing a chocolate piece, simply replace the oldest chocolate piece in the bin with the new one.

In this naïve approach, one is biased against keeping the earliest chocolate samples with the ones selected later in the day. To resolve this issue, you can use a loaded die or get more “Dungeons & Dragons”-esque by using more sophisticated dice and rules as the day progresses.

Then with some luck, you could get a uniform sampling of the days production of chocolate.

But there is a better way…

A Smarter Way

Suppose you could assign a uniform random (i.i.d of course!) number from $[0, 1)$ to each piece of chocolate you encounter on the assembly line.

Now, to obtain a sampling of the days production of chocolate, simply choose the chocolates corresponding to the $n$ lowest random numbers linked with them. Each chocolate piece has an equal chance of being in the lowest $n$ choices, so the sampling is uniform.

With the ubiquitous computing power available these days, you could substitute using dice with a iPhone or Android app to help out. Each time you see a new chocolate piece, generate a random number with your app. Hold onto the chocolate pieces that correspond to the $n$ lowest random numbers. If you encounter a new chocolate piece that is linked with a random number that is lower than the largest random number associated chocolate already in your selection bin, simply replace the old one with the new one.

By the end of the process you now have $n$ uniform samples of the entire days chocolate production, and you never needed to hold onto more than those $n$ items. Pretty clever, eh!

This approach could be optimized even further. For example, you could preemptively generate a large list of random numbers; identify the lowest ones; and afterwards directly choose those chocolate pieces as they appear off the production line and bypass the rest.

Amazingly, you could sample the entire days production of chocolates, $N$ , in less than $O(N)$ time. According to this fellow, you can make the process $O(n(1+log(N/n)))$ efficient.

These techniques of sampling a set of unknown size through one pass are called reservoir algorithms.

The “Real” World

While most of us are not chocolate inspectors, we are all nowadays swamped in a “sea of data”, from science and business to personal tracking. Often times these raw data sets come in large chunks. One data-mining technique to get a better handle of the data is sampling it. Sometimes less is more. Now chocolate pieces become record sets and the assembly line becomes a large file.

Reservoir algorithms can be a useful tool in one’s data science & engineering endeavors.

Posted in algorithms, software | Tagged data-mining | 1 Comment

Pacific Biosciences Buzz

Posted on May 3, 2010 by indraniel

Press release image of the upcoming Pacific Biosciences DNA sequencer, the PacBio RS

This was an interesting video about Pacific Biosciences (PacBio), a third-generation DNA sequencing company, that I recently came across in the Wall Street Journal. It profiles their upcoming sequencing device, the PacBio RS, from a business perspective.

Posted in biology, dna sequencing, science | Leave a comment

sff2fastq

Posted on April 23, 2010 by indraniel

The basic premise of genetic sequencing involves preparing a DNA sample into a form suitable for use on a DNA sequencer. Afterwards, the sequencer ascertains the sequences of bases on the preapred sample and stores these results into a digital file. These file formats are related to the sequencing methodology taken by the sequencer.

In 454 sequencing, the SFF format is the native currency of storing the sequence data; ABI-Sanger it is the AB1 or SCF chromatogram file format; Illumina/Solexa it is the QSEQ or Illumina FASTQ format; and in ABI-SOLiD it is the colorspace CSFASTA format.

Most scientists/biologists are more interested in the final sequence data produced rather than the particular vendor technology itself.

During the course of the biological investigation, one often is confronted with data from various sequencing platforms. A format is needed that is common across platforms. In the era of next-generation sequencing, it appears that the Sanger FASTQ format is the popular lingua franca of sequence file formats. It holds both the sequence and quality data generated by the sequencer. Many of the currently popular (and open-source) aligner and assemblers such as maq, bwa, bowtie, SSAHA2 and velvet accept Sanger FASTQ files as their inputs.

In the world of 454 sequencing, Roche 454 has their own set of tools to work with the data. Unfortunately, they are not freely available. While the 454 tools from Roche provide a way to convert their data into a FASTA file format, another device independent sequence file format; there is not a direct SFF to FASTQ conversion utility.

To that end, and for curiosities sake, I decided to write a program to do so, called sff2fastq. The idea is by no means unique. There are other similar tools such as flower (haskell-based) and sff_extract (python-based), and other alternative approaches as discussed on seqanswers. As they say, variety is the spice of life.

Posted in biology, dna sequencing, software | 6 Comments

Bayes, Bayes, & Bayes

Posted on March 16, 2010 by indraniel

I have been going through a bit of a mathematical refresher lately. It has been a while in directly dealing with things of a probabilistic and statistical nature. In particular, I was reading up on Bayes’ Theorem and Bayesian inference. Bayes’ rule is a subtle mathematical statement that has deep interpretations and implications. Below were three introductory write ups I found upon the subject, each with a slightly different perspective of presentation.

Posted in science | Tagged math, science | Leave a comment

Writing Tips

Posted on January 12, 2010 by indraniel

Flickr: Fountain Pen on Notebook

I am a perpetual apprentice to the oft neglected and under appreciated art of writing well. Words, either written or spoken, are the most powerful means of communication we have. While images, animations, sounds, and music certainly have the ability to capture our attention and engage us, words are needed to persuade, instruct, and argue.

As Roy Jacobsen wrote on his blog, Writing, Clear and Simple:

The words you use, either written or spoken, can have powerful effects on your audience‚—if you use them carefully and skillfully. Whether your goal is to inform, to persuade, to call for action, or to entertain, your words and your stories can be powerful. They can be powerful, because language is software for the mind

For the past few years, aside from corporate email speak, I have spent the majority of my time writing software for computers. This has placed my writing, or software for the mind skills, a bit in the backseat. Obviously, being well read, and practice writing are the keys to improvement; however, there are times when direct discussion about the topic itself proves helpful. Below are a few resources that I have found in the past few months that maybe useful.

The Elements of Style — This 92 year old book is always ever popular by writers. Its tips about the fundementals are always worth perusing.
The Economist Style Guide — This is the style guide given to journalist at the eminent news magazine The Economist. It has a somewhat British English bent to it, but accounts for Americanisms.
The New York Times: Grammar and Usage Section — There are a plethora of interesting articles and resources about writing here. The prestigious newspaper also critiques itself in the After Deadline section. Even the professionals make mistakes too, and still work to refine their craft.
Chicago Manual of Style (CMS) — This is a 104 year old style guide (currently in its 15th edition) for American English. The CMS deals with aspects of editorial practice, from grammar and usage to document preparation.

Happy Writing!

Posted in writing | Leave a comment

Presidents Obama’s Nobel Prize Speech

Posted on December 13, 2009 by indraniel

Image of President Obama giving his Nobel Peace Prize Speech

President Obama giving his Nobel Prize Speech at Oslo City Hall (source whitehouse.gov)

On December 10 2009, President Obama gave his Nobel Prize acceptance speech at the Oslo City Hall. He was the first sitting president to receive the prize since Woodrow Wilson. The speech was interesting in how he juxtaposed the philosophical need for war, in light of recent events in Afghanistan and Iraq, with an acceptance of a peace prize. Clearly, it was an eloquent talk intended to be experienced again after some aging, much like a fine wine. I wonder how this speech will feel when sampled many years in the future.

Posted in news | Leave a comment

A Gentle Overview of miRNA and Epigenetics

Posted on December 9, 2009 by indraniel

I recently came across some nifty overviews of topics that are of recent interest in genetics research. The cool thing was, that they were in video form. Being a TV addict, this was all too good to be true.

microRNAs

microRNAs, or (miRNA) are short stranded RNA molecules that are thought to regulate gene expression in the eukaryotic cell. I’m sure things are a bit oversimplified in the following YouTube video, but nevertheless, it gives a high-level overview of what is going on.

Epigenetics

Back in July 2007, NOVA‘s ScienceNOW program with Neil deGrasse Tyson, had an episode dealing with epigenetics. Epigenetics, as stated by wikipedia, refers to changes in phenotype (appearance) or gene expression caused by mechanisms other than changes in the underlying DNA sequence. Unfortunately, I could not find a YouTube-like link of that episode to place directly in the blog post. Please go here to view the show.

Posted in biology, science | Leave a comment

The financial crisis: visualized

Posted on March 23, 2009 by indraniel

I thought these were a nice overviews of the financial crisis. Too bad there are no similar nice overviews of a quick solution to the problem.

Hard to believe this first one below was a graduate student project.

The Crisis of Credit Visualized from Jonathan Jarvis

This one is a web supplement to the NPR’s Marketplace show.

Toxic assets from Marketplace on Vimeo.

Posted in finance, news | Leave a comment

Welcome again!

Posted on March 21, 2009 by indraniel

Welcome! I decided to migrate from my old blog site to here after noticing all the extra features provided by the more competitive WordPress.com and Blogger. Many of these features were not available (or at least observable) to me at LiveJournal. Also recent news
does not seem to bode well with expecting new upcoming features from LiveJournal. Lets see how the blog evolves from here.

Posted in Uncategorized | Leave a comment