My Dominant Hemisphere

The Official Weblog of 'The Basilic Insula'

Archive for the ‘Technology’ Category

New Beginning: Going Anonymous!

with 2 comments

via NguyenDai @ Flickr (CC BY-NC-SA License)

Howdy readers!

Just a quick update. For reasons far too many to list out here, I’ve decided to pursue an anonymous blog in addition to this one. This blog fills a niche and I’d like to maintain that intact as I continue to post things of interest here. Furthermore, there are many topics that I frequently ruminate about and that I’d be more comfortable writing and discussing about anonymously. I’ve come to understand that this blog, a tool that’s meant to caress my intellect as much as it does yours (come on! admit it! :-P),  is unsuitable to fulfill this important role in entirety.

If you’re a close friend or a blogger who knows me personally, then you know how to find me and would probably recognize  my anonymous presence when you see it. To you I make just one earnest plea: try not to blow my cover 😛 ! My friend, Jaffer of Maniaravings, had a pertinent example of how the privacy of bloggers can be adversely affected by the slapdash behavior of people known to them. Sometimes unintentionally. Always keep in mind some relevant guidelines for bloggers set forth by the Electronic Frontier Foundation (a privacy group) here.

Alrighty then! Until we meet again, cheerio!


Copyright Firas MR. All Rights Reserved.

“A mote of dust, suspended in a sunbeam.”


Search Blog For Tags: ,

Written by Firas MR

March 22, 2011 at 4:33 pm

Posted in Technology

Tagged with , ,

How To [Windows/Linux]: OCR On PDFs Using Tesseract and Imagemagick

leave a comment »

OCR

via OCReactive@Flickr (CC BY-NC-SA License)

Howdy readers!

Many moons ago, we met and talked about some of the basics of computer programming. Today I’m going to share with you a BASH shell script that I created using publicly available content as I was trying to OCR a couple of PDFs lying on my hard drive.

OCR is short for “Optical Character Recognition”. OCR software contains algorithms that analyze photographs/scanned images of books, articles, etc. (i.e. text matter) and convert them into plain text such that it can be copy/pasted or manipulated in various forms. For more on what OCR does, see here.

PDFs are ubiquitous these days. And although the file format has been opensourced and standardized, what hasn’t is the way people create PDFs. This gives rise to a plethora of unexpected differences such that two people could create a PDF file from the same input and yet come out with totally different looking PDFs. A lot of this has to do with differences in the way the metadata, layout information, text-layer, embedded fonts, reflow properties, etc. have been stored in the PDF file. For across-the-board accessibility (by people using mobile phones, eReaders, etc.) getting all of these right is absolutely essential.

Sadly, many PDFs of eBooks available online (such as at Archive.org) lack these properties and thus can be a pain to read on small screens. One of the most frequent of problems is that often these PDFs are merely a collection of scanned images of books and articles. And aren’t amenable to note taking, highlighting text, or copy/pasting text, etc. This is where OCR comes into play. Using OCR software one ends up with a file containing text that can then be manipulated to one’s liking. OCR software will obviously omit any pictures or illustrations in its output.

This how-to has been tested on Windows Vista Basic and uses free and open-source software. The script will also work on a Linux system.

  1. Download and install Cygwin from here. Cygwin provides a Linux-like environment on the Windows platform. The default shell that it comes with is BASH. As compared to DOS on Windows, BASH provides a saner way to create tiny programs that can automate tasks. The commands are easier to read and understand.
  2. Run Cygwin and check the output of:
    echo $TERM

    If it says "dumb", then you’re faced with a well-known bug in the installation that doesn’t allow Cygwin to behave properly. To remedy this:

    1. Exit Cygwin.
    2. Click on the Start Menu.
    3. In the field that says “Start Search”, type “Run” and then hit ENTER.
    4. Type sysdm.cpl in the dialogue box that opens.
    5. You are now in the Sytem Properties window. Click on the tab that says “Advanced”. Then click on “Environment Variables”.  Under “System Variables” scroll down to and click on the entry that says “TERM” and click on the “Edit” button at the bottom.
    6. In the box that opens, delete whatever is under “Variable Name” and type cygwin.
    7. Click OK and close the box. Then Click OK and close the “System Properties” box.
    8. Open Cygwin again and see that the output of echo $TERM give you cygwin as the answer.
  3. We’ll need to install a few packages on Cygwin. Install the nano package. Nano is an easy to use text-editor and is more reliable than lame-old Notepad. Notepad can sometimes misbehave and enter invisible control-characters (such as carriage-returns or end-of-files) that Linux systems WILL NOT ignore.
  4. Install the tesseract-ocr, tesseract-ocr-eng, imagemagick and ghostscript packages. Tesseract is the OCR software we shall be using. It works best with English text and supposedly has a reputation for being more accurate than other open-source tools out there. Imagemagick is a set of software tools that allow image manipulation using the command-line. Ghostscript is software that Imagemagick will require in order to work with PDFs.
  5. Open Cygwin. Right click on the title bar of the window and goto Properties. Check (tick-mark) the boxes that say “QuickEdit Mode” and “Insert Mode“. Hit OK. Ignore any error messages that pop-up.
  6. Using nano we will create a BASH script called ocr.sh . This will need to be placed or copied to the directory that contains the PDF file that needs to be OCR’d. Type the following text out manually (exactly as it is) or just copy paste it into nano. After copying text from here, when you right-click inside Cygwin, the text will be pasted inside the window. To save the file hit Ctrl-O. Then hit ENTER. Then exit nano by hitting Ctrl-X.

    Using nano to create a file on Cygwin

    Inside nano

    #!/bin/bash
    
    # Created by Firas MR.
    # Website: https://mydominanthemisphere.wordpress.com
    
    # define variables
    SCRIPT_NAME=`basename "$0" .sh`
    TMP_DIR=${SCRIPT_NAME}-tmp
    OUTPUT_FILE=${SCRIPT_NAME}-output.txt
    
    # make a temporary directory
    
    mkdir $TMP_DIR
    
    # copy PDF to temporary directory
    
    cp $@ $TMP_DIR
    
    # change current working directory to temporary directory
    
    cd $TMP_DIR
    
    # use Imagemagick tool to read PDF pages at a pixel denisty of
    # 150 ppi in greyscale mode and output TIFF files at a pixel
    # depth of 8. Tesseract will misbehave with pixel depth > 8
    # or with color images.
    
    convert -density 150 -depth 8 -colorspace gray -verbose * p%02d.tif
    
    # For every TIFF file listed in numerical order in the temporary
    # directory (contd)
    
    for i in `ls *.tif | sort -tp -k2n`;
    
    do
    
    # strip away full path to file and file extension
    
     BASE=`basename "$i" .tif`;
    
    # run Tesseract using the English language on each TIFF file
    
     tesseract "${BASE}.tif" "${BASE}" -l eng;
    
    # append output of each resulting TXT file into an output file with
    # pagebreak marks at then end of each page
    
     cat ${BASE}.txt | tee -a $OUTPUT_FILE;
     echo "[pagebreak]" | tee -a $OUTPUT_FILE;
    
    # remove all TIFF and TXT files
    
     rm ${BASE}.*;
    
    done
    
    # move output file to parent directory
    
    mv $OUTPUT_FILE ..
    
    # remove any remaining files (eg. PDF, etc.)
    
    rm *
    
    # change to parent directory
    
    cd ..
    
    # remove temporary directory
    
    rmdir $TMP_DIR
    
  7. Next we’ll need to make the file executable by all users. To do this type
    chmod a+x ocr.sh

    and hit ENTER.

  8. Change directories to where the PDF file is located. Eg: in order to change directories to the C: drive in Cygwin you need to do:
    cd /cygdrive/c/

    List contents by typing

    ls -al

    Copy ocr.sh to the directory that contains your PDF. Do this by typing

    cp ~/ocr.sh .

    (That dot is not a typo!). Rename the PDF to a simple name without hyphens or weird characters. Make it something like bookforocr.pdf . You can do this by typing

    mv <name of PDF file> bookforocr.pdf
  9. Type ./ocr.sh bookforocr.pdf and observe as your computer chugs away 🙂 ! You’ll end up with a file called ocr-output.txt containing the OCR’d data from the book! Imagemagick will use up quite a bit of RAM memory as it works on the PDF. Expect some sluggishness in your computer as it does this.
  10. You can convert the txt file into anything you like. For example an EPUB file using Calibre that can then be uploaded to an eReader such as the B&N NOOK :-).

One could modify the script to crop, set white-points, etc. for anything fancier. For Windows users who like a GUI, a good open-source cropping tool for PDFs is BRISS. It is a great boon for easily cropping multi-column text matter. Another great tool for the same purpose is Papercrop (although, since it rasterizes its output you notice a significant decrease in quality).

A Linux Journal article describes how to find out position co-ordinates for cropping using GIMP.

Another way that I discovered to OCR a PDF is to use OCRopus. It claims to have automatic and intelligent layout analysis for dealing with stuff like multiple columns, etc.

Alrighty then. See you next time! Feel the OCR power on your PDFs 🙂 !

# Footnotes:

Ubuntuforums Howto on OCR
Circle.ch: How to OCR multipage PDF files
The Kizz Notes: cygwin: WARNING: terminal is not fully functional


Copyright Firas MR. All Rights Reserved.

“A mote of dust, suspended in a sunbeam.”


Search Blog For Tags: , , , , ,

Written by Firas MR

March 20, 2011 at 8:58 am

Posted in Technology, Unix

Tagged with , ,

What’s New: Blog’s FriendFeed Alter Ego & Intrasite Tag Search Goodness

with 4 comments

My Dominant Hemisphere now has a FriendFeed! Follow along!

Hello everyone!

Just a couple of quick updates about the blog:

  1. Having just read some of Lorelle’s excellent advice on the use of Categories and Tags, I’ve decided to implement an intrasite tag search at the bottom of every post. Clicking on any of these tags will automatically return items from the blog that are tagged with these words.

    I’m using the following re-hashed bookmarklet (thanks to Lorelle and Rakesh) in order to put them in my posts: 

    javascript: ( function() { /* Technorati Tag Book Marklet 0.3 Created First By: Lorrell <http://lorelle.wordpress.com> Later Modified By: Rakesh <http://rakeshkumar.wordpress.com> Last Modified by: Firas MR <https://mydominanthemisphere.wordpress.com/about/> */ var a=''; var t=prompt('Enter Tags separated by commas',''); if(!t) return; var tr=t.split(','); a+='---<br /> <img src='+unescape('%22')+'https://mydominanthemisphere.files.wordpress.com/2010/11/gravatar2.png'+unescape('%22')+' align='+unescape('%22')+'left'+unescape('%22')+' />Copyright <a href='+unescape('%22')+'https://mydominanthemisphere.wordpress.com/about/'+unescape('%22')+' title='+unescape('%22')+'Copyright Firas MR. All Rights Reserved.'+unescape('%22')+'>Firas MR</a>. All Rights Reserved.<br /> <br /> <em>"A mote of dust, suspended in a sunbeam."</em><p><br /> <br /> <br /> <br /> <br /> <hr /></p><p><code><font size="-1"><strong>Search Blog For Tags: </strong>'; for(var i=0;i<tr.length;i++) { tr[i]=tr[i].replace(/^\s+/,""); tr[i]=tr[i].replace(/\s+$/,""); var tag_text=tr[i]; tr[i]=tr[i].replace(/\s+/g,"-"); if(i > 0){ a+=', '; } a+='<a href='+unescape('%22')+'https://mydominanthemisphere.wordpress.com/tag/'+tr[i]+unescape('%22')+' rel='+unescape('%22')+'tag'+unescape('%22')+'>'+tag_text+'</a>'; } a+='</font></code></p>'; prompt('Copy this html code, Press OK, Then Paste into your blog entry:',a); } )()
    
  2. I’ve cleaned up and organized the Post Categories into a hierarchy for easier navigation.
  3. The blog/website now has a detailed About page that’s worth checking out!
  4. I’ve also added a new favicon for the website.
  5. Also new is a Subscribe by Email link, the option to receive RSS via Feedburner, and a FriendFeed microblogging site with an accompanying widget that goes into the sidebar for shorter updates.


Copyright Firas MR. All Rights Reserved.

“A mote of dust, suspended in a sunbeam.”

 


 

Search Blog For Tags: , ,

Written by Firas MR

November 11, 2010 at 7:14 am

What Makes FreeBSD Interesting

with 4 comments


A Narrative History of BSD, by Dr. Kirk McKusick (Courtesy: bsdconferences channel @ Youtube)

Oh Lord, won’t you buy me a 4BSD?
My friends all got sources, so why can’t I see?
Come all you moby hackers, come sing it out with me:
To hell with the lawyers from AT&T!

— a random, hilarious fortune cookie touching on the origins of the FreeBSD project

Howdy all!

Another quick post about tech stuff today. Someday I’ll delve into FreeBSD in a lot more detail. But for now, a brief rundown of why I personally think FreeBSD is one of the best toys around to play with today:

  1. Great documentation! Aside from the FreeBSD Handbook, there are two other books that I think do a phenomenal job in teaching not just the way things are done in the BSD world, but also UNIX philosophy in general. Michael Lucas’s, ‘Absolute FreeBSD‘ and Greg Lehey’s, ‘The Complete FreeBSD‘. My personal all time favorite tech book is currently, ‘The Complete FreeBSD‘. Note the emphasis on ‘all time’. That kind of thing doesn’t come easily from a person who’s not a professional techie. Although Greg ‘Groggy’ Lehey (as he’s popularly known) hasn’t covered the latest version of FreeBSD, a lot of the knowledge you gain from reading his book is pretty transferable. This book also teaches you how computing all began. From the origins of the word ‘Terminal’, to the Hayes command set (he even teaches you some basic commands to talk directly to your modem!), to how the Internet came to be shaped with TCP/IP and BIND and so on. Go check it out for free here and listen to Lehey and Lucas as they are interviewed by BSDTalk here and here. If you’ve ever dabbled in the Linux world, you’ll soon come to realize that FreeBSD’s approach in consolidating, streamlining and simplifying documentation is like a breath of fresh air! Oh and by the way, Dru Lavigne, another famous personality in the BSD world has a great talk on the similarities and differences between BSD and Linux here.
  2. Another incredible boon is their hardware compatibility list (a.k.a. the ‘Hardware Notes‘, that come with every release). It’s jaw-droppingly amazing that you are presented with a list of all known chips/circuit boards and the drivers that you’ll need to use to get them working all organized in such a neat manner right on their main website! Again, something that will definitely blow you away if you’re coming from the Linux world. In fact, when anybody asks me what hardware I recommend for good open-source support (i.e. cross-compatibility across major Operating Systems), I usually turn to this excellent list. It’s a great shopper’s guide! 🙂
  3. From my experience, it’s a lot easier to grasp fundamental concepts about the way computers work by reading about FreeBSD than by looking at books about Linux. In fact Arch Linux, which is a great Linux distribution that I recommend if you want to explore how Linux works, borrows a lot from the manner FreeBSD functions (its /etc/rc.conf file for example) as part of its KISS (Keep It Simple Stupid) philosophy.

More on FreeBSD later! That does it for today! Cheers! 🙂

Copyright © Firas MR. All rights reserved.

, , , , ,

Powered by ScribeFire.

Written by Firas MR

October 25, 2010 at 7:04 pm

Posted in Technology, Unix

Tagged with , , , , ,

Beginning Programming In Plain English

with 3 comments

Part 1 of an introductory series on programming using the Python language via SciPy @ Archive.org Special Thanks

Before I begin today’s discussion (since it concerns another book), a quick plug for Steve McCurry, whose photography I deeply admire and whose recent photo-essays on the subject of reading, are especially inspirational and worth checking out. I quote:

Fusion: The Synergy of Images and Words Part III « Steve McCurry’s Blog

“Reading is a means of thinking with another person’s mind; it forces you to stretch your own.” — Charles Scribner

Susan Sontag said: “The camera makes everyone a tourist in other people’s reality.” The same can be said for reading books.

Every once in a while, I receive feedback from readers as to how much they appreciate some of my writing on non-clinical/non-medical subjects. Sometimes, the subject matter concerns books or web resources that I’ve recently read. Occasionally, I also like taking notes as I happen to read this material. And often, friends, family and colleagues ask me questions on topics that I’ve either read a book about or have made notes on. Note-taking is a good habit as you grow your comprehension of things. In my opinion, it also helps you skeletonize reading material – sort of like building a quick ‘Table Of Contents’ – that you can utilize to build your knowledge base as you assimilate more and more.

If you’ve ever visited a college bookstore in India, you’ll find dozens and dozens of what are popularly referred to as “guides” or “guidebooks”. These contain summaries and notes on all kinds of subjects – from medicine to engineering and beyond. They help students:

  1. Get verbosity in their main coursebooks (often written in English that is more befitting the Middle Ages) out of the way to focus on skeletonizing material
  2. Cram before exams

I tend to think of my notes and summaries of recently-read books, as guidebooks. Anchor points, that I (& often family or friends) can come back to later on, sometimes when I’ve long forgotten a lot of the material!

I write this summary in this spirit. So with all of that behind us, let’s begin.

I stumbled upon an enticing little book recently, called “Learning the BASH shell“, by Cameron Newham & Bill Rosenblatt. Being the technophile that I am, I just couldn’t resist taking a peek.

I’ve always been fascinated by the innards of computers – from how they’re made and assembled to how they are programmed and used. My first real foray into them began with learning some of the fundamentals of DOS and BASIC on an old 286 (I think) as a 7th grader. Those were the days of pizza-box styled CPU-case form factors, monochrome monitors that had a switch that would turn text green, hard disks that were in the MB range, RAM that was measured in KB and when people thought 3.5 inch floppies were cool. Oh boy, I still do remember the way people used to go gaga over double-sided, high-density, pre-formatted and stuff! As I witnessed the emergence of CDs and then later DVDs and now SSDs and portable HDs, I got my hands dirty on the 386, the 486, the Pentium 1, the Pentium 3, the Pentium 4 (still working!) and my current main workstation which is a Core 2 Duo. Boy, have I come a long way! Over the years I’ve read a number of books on computer hardware (this one and this one recently – more on them for a future post) and software applications and Operating Systems (such as this one on GIMP, this one on GPG, this one, this one and this one on Linux and this one and this one on FreeBSD – again, more on them later!). But there was always one cranny that seemed far too daunting to approach. Yup, programming. Utterly jargoned, the world of modern programming has seemed really quite esoteric & complicated to me from the old days, when BASIC and dBASE could get your plate full. When you’ve lost >95% of your memory on BASIC, it doesn’t help either. Ever since reading about computational biology or bioinformatics (see my summary of a book on the topic here), I’ve been convinced that getting at least a superficial handle on computer programming concepts can mean a lot in terms of having a competitive edge if you ever contemplate being in the research world. This interplay between technology and biology and the level to which our research has evolved over the past few decades was further reinforced by something I read recently from an interview of Kary Mullis, the inventor of PCR. He eventually won the Nobel Prize for his work:

Edge: Eat Me Before I Eat You! A New Foe For The Bad Bugs, A Talk with Kary Mullis

[…]

What I do personally is the research, which I can do from home because of the Internet, which pleases me immensely. I don’t need to go to a library; I don’t need to even talk to people face to face.

[…]

There are now whole books and articles geared towards programming and biology. I recommend the great introductory essay, Why Biologists Want to Program Computers by author, James Tisdall.

“Learning the BASH shell” is a fascinating newbie-friendly introduction to the world of programming and assumes extremely rudimentary familiarity with how computers work or computer programming in general. It certainly helps if you have a working understanding of Linux or any one of the Unix operating system flavors, but if you’re on Windows you can get by using Cygwin. I’ve been using Linux for the last couple of years (originally beginning with Ubuntu 6.06, then Arch Linux and Debian, Debian being my current favorite), so this background certainly helped me grasp some of the core concepts much faster.

Defining Programming

So what exactly is programming anyway? Well, think of programming as a means to talk to your computer to carry out tasks. Deep down, computers understand nothing but the binary number system (eg: copy this file from here to there translates into gibberish like .…010001100001111000100110…). Not something that most humans would find even remotely appealing (apparently some geeks’ favorite pastime is reverse-engineering human-friendly language from binary!). Now most of us are familiar with using a mouse to point-and-click our way to getting tasks done. But sometimes it becomes necessary to speak to our computers in more direct terms. This ultimately comes down to entering a ‘programming environment’, typing words in a special syntax (depending on what programming language you use) using this environment, saving these words in a file and then translating the file and the words it contains into language the computer can understand (binary language). The computer then executes tasks according to the words you typed. Most languages can broadly be divided into:

  1. Compiler-based: Words in the programming language need to be converted into binary using a program called a ‘compiler’. The binary file can then be run independently. (eg. the C programming language)
  2. Interpreter-based: Words in the programming language are translated on-the-fly into binary. This on-the-fly conversion occurs by means of an intermediary program called an ‘interpreter’. Because of the additional resources required to run the interpreter program, it can sometimes take a while before your computer understands what exactly it needs to do. (eg. the Perl or Python programming languages)

If you think about it, a lot of the stuff we take for granted is actually similar to programming languages. HTML (the stuff of which most web-pages are made) and LATEX (used to make properly typeset professional-quality documents) are called Text Mark-up Languages. By placing the typed words in your document between various tags (i.e. by ‘marking’ text), you tell your web-browser’s HTML-rendering-engine or your LATEX program’s LATEX-rendering-engine to interpret the document’s layout, etc. in a specific way. It’s all actually similar to interpreter-based programming languages. Javascript, the language that’s used to ask your browser to open up a pop-up, etc. is also pretty similar.

What is BASH?

BASH is first and foremost a ‘shell’. If you’ve ever opened up a Command-Prompt or CLI (Command Line Interface) on Windows (Start Menu > Accessories > Command Prompt), then you’ve seen what a shell looks like. Something that provides a text interface to communicate with the innards of your operating system. We’re used to doing stuff the GUI way (Graphical User Interface), using attractive buttons, windows and graphics. Think of the shell as just an alternative means to talk to your computer. Phone-line vs. paper-mail, if that metaphor helps.

Alright, so we get that BASH provides us with an interface. But what else does it do? Well, BASH is also an interpreted programming language! That is amazing because what this allows you to do, is to use your shell to create programs for repetitive or complicated multi-step tasks. A little segue into Unix philosophy bears merit here. Unix-derivative operating systems, unlike others, basically stress on breaking complicated tasks in to tiny bits. Each bit is to be worked on by a program that specializes in that given component of a task. sort is a Unix program that sorts text. cut snips off a chunk of text from a larger whole. grep is used to find text. sed is used to replace text. The find program is used to find files and directories. And so on. If you need to find a given file, then look for certain text in it, yank out a portion of it, replace part of this chunk, then sort it from ascending to descending order, all you do is combine find, grep, sed, cut and sort using the proper syntax. But what if you didn’t really want to replace text? Then all you do is omit sed from the workflow. See, that’s the power of Unix-based OS(s) like Linux or FreeBSD. Flexibility.

The BASH programming language takes simple text files as its input. Then an interpreter called bash translates the words (commands, etc.) into machine-readable code. It’s really as simple as that. Because BASH stresses on the Unix philosophy, it assumes you’ll need to use the various Unix-type programs to get stuff done. So at the end of the day, a BASH program looks a lot like:

execute the Unix program date
assign the output of date to variable x
if x = 8 AM
then execute these Unix program in this order(find, grep, sed, cut, sort, etc.)

Basic Elements of Programming

In general, programming consists of breaking down complicated tasks into bits using unambiguous language in a standard syntax.

The fundamental idea (using BASH as an example) is to:

  1. Construct variables.
  2. Manipulate variables. Add, subtract, change their text content, etc.
  3. Use Conditions such as if/then (referred to in technobabble as “Flow Control”)
  4. Execute Unix programs based on said Conditions

All it takes to get going is learning the syntax of framing your thoughts. And for some languages this can get hairy.

This explains why some of the most popular programming languages out there try to emulate human language as much as possible in their syntax. And why a popular language such as Perl was in fact developed by a linguist!

This was just a brief and extremely high-level introduction to basic concepts in programming. Do grab yourself a copy and dive in to “Learning the BASH shell” with the aforementioned framework in mind. And before you know it, you’ll soon start putting two and two together and be on your way to developing your own nifty program!

I’m going to end for today with some of the additional excellent learning resources that I’m currently exploring to take my quest further:

  1. Steve Parker’s BASH tutorial (extremely easy to follow along)
  2. Greg’s BASH Guide (another one recommended for absolute noobs)
  3. Learning to Program Using Python – A Tutorial for Hobbyists, Self-Starters, and All Who Want to Learn the Art of Computer Programming by Alan Gauld
  4. How to think like a Computer Scientist – Learning with Python by Jeffrey Elkner, Allen B. Downey, and Chris Meyers

UPDATE 1: If you’re looking for a programming language to begin with and have come down to either Perl or Python, but are finding it difficult to choose one over the other, then I think you’ll find the following article by the famous Open Source Software advocate, Eric S. Raymond, a resourceful read: Why Python?

UPDATE 2: A number of resourceful, science-minded people at SciPy conduct workshops aimed at introducing Python and its applications in science. They have a great collection of introductory videos on Python programming concepts & syntax here. Another group, called FOSSEE, has a number of workshop videos introducing Python programming here. They also have a screencast series on the subject here.

UPDATE 3: AcademicEarth.org has quite a number of useful lecture series and Open Courseware material on learning programming and basic Computer Science concepts. Check out the MIT lecture, “Introduction to Computer Science and Programming” which is specifically designed for students with little to no programming experience. The lecture focuses on Python.

Copyright Firas MR. All rights reserved.


# Player used is Stream Player licensed under the GPL. Special thanks to Panos for helping me get the embedded video to work! Steps I followed to get it working:

  • Download the Stream Player plugin as a zip. Extract it locally. Rename the player.swf file to player-swf.jpg
  • Upload player-swf.jpg to your WordPress.com Media Library. Don’t worry, WordPress.com will not complain since it thinks it’s being given a JPG file!
  • Next insert the gigya shortcode as explained at Panos’ website. I inserted the following between square brackets, [ ] :
  • gigya  src="https://mydominanthemisphere.files.wordpress.com/2010/11/player-swf.jpg"  width="512" wmode="transparent" allowFullScreen="true" quality="high"  flashvars="file=http://ia311014.us.archive.org/1/items/scipy09_introTutorialDay1_1/scipy09_introTutorialDay1_1_512kb.mp4&image=http://ia311014.us.archive.org/1/items/scipy09_introTutorialDay1_1/scipy09_introTutorialDay1_1.thumbs/scipy09_introTutorialDay1_1_000180.jpg&provider=http"

  • Parameters to flashvars are separated by ampersands like flashvars="file=MOVIE URL HERE&image=IMAGE URL HERE". The provider="http" parameter to flashvars states that we would like to enable skipping within the video stream.

لیجئے میرا پہلا اردو زبان میں بلوگ پوسٹ

with 3 comments


اردو ہے جسکا نام، ہم ہی جانتے ہیں داغ، سارے جہاں میں دھوم، ہماری زباں کی ہے ~ داغ

(ایک ضروری بات: اس مضمون کو سہی روپ میں دیکھنے کے لئے آپ ناظرین کو یہ font ڈاونلوڈکرکے اپنے سسٹم پر ڈالنا ہوگا. یہ ایسی font ہے جو خاص کمپیوٹر سکرین پر باآسانی پڑھنے کے لئے بنائی گئی ہے.)

آداب دوستو،

امید ہے کہ آپ لوگوں کو میری جانب سے کافی عرصے سے کچھ نہ سننے پر زیادہ شکایات نہیں ہوگی. دراصل بات یہ ہے کہ ہمیشہ کی طرح پڑھائی اور دیگر تعلیمی دنیا سے متعلق چیزوں نے مجھے کافی مصروف رکھا ہے.

میری ہمیشہ سے یہ خواہش تھی کہ کسی دن میں اپنے اس بلوگ پر اردو زبان میں بھی لکھوں. کیونکہ یہ تو میری مادری زبان ہے ہی اور پتہ نہیں کب اور کیسے میرا اس خوبصورت زبان سے رابطہ کچھ ٹوٹنے سا لگا تھا. شاید اس کا قصور میری سائنسی دنیا کا ہے، جو آج کل کے زمانے میں، انگریزی زبان پر ہی زور دیتی ہے. اور اگر اخبارات اور خبروں کی بات کی جائے تو مجھے کبھی یہ نہیں محسوس ہوا کہ اردو دنیا میں کوئی خاص کر انوکھی جیسی چیز ہو. لیکن اب مجھے لگتا ہے کہ میری یہ سوچ کتنی معصوم تھی. پچھلے کچھ ہفتوں سے میرے سامنے کئی ایسی مضامین آے ہیں جو انتہائی دلچسپ ہیں اور جو انگریزی زبان کی دنیا میں شاید ہی دیکھنے کو ملیںگے. یوں سمجھئے کہ مجھے اس زبان سے واقف ہونے کا مزہ آخر اب ہی مل رہا ہے. اور میں اس کے لئے کافی شکرگزار محسوس کر رہا ہوں.

آج کے لئے میرے پاس کسی خاص عنوان پر لکھنے کا رجحان تو نہیں. بس اتنا بتانا چاہتا ہوں کہ انٹرنیٹ پر اردو میں لکھنے کے لئے بہت سارے مددگار سائٹس ہیں. چاہے وہ Linux, BSD, FOSS سے متعلق ہوں یا پھر Windows سے. ان میں سے کچھ جو مجھے بہترین لگے، یہ ہیں:

     

  • اگر آپ کو لگتا ہے کہ آپ کا اردو ذخیرہ الفاظ کمزور ہے، تو یہ سائٹ آپ کو مدد کرے گی: http://www.urduenglishdictionary.org
  • اگر آپ Windows پر ہوں، تو Google Transliteration IME Keyboard ضرور استعمال کریں. فی الحال یہ صرف Windows کے لیے ہی فراہم ہو رہا ہے : http://www.google.com/ime/transliteration
  • Urdu Fonts ڈاونلوڈ کرکے انکا استعمال Openoffice, Firefox, etc میں کریں. بعض Fonts صرف Windows کے لئے خاص پروگرام کی ہوتی ہیں اور یہ Linux, BSD, etc پر نہیں چلینگی. Windows کے لئے بہترین Fonts آپ کو یہاں سے ملیں گی: http://www.crulp.org . اگر آپ Debian جیسے Linux flavor پر ہیں تو apt-get کا استعمال کریں. CRULP وغیرہ کی جانب 3rd-party fonts کو اس ترکیب سے اپنے سسٹم پر ڈالیے: http://wiki.archlinux.org/index.php/Fonts . واضح رہے کہ جس طرح انگریزی میں الگ الگ Fonts الگ الگ مسائل کے پیش نظر کام آتی ہیں، اسی طرح اردو میں بھی مختلف Fonts ہوتی ہیں جو الگ الگ قلمی انداز میں لکھی جاتی ہیں جیسے نستعلیق، نسخ وغیرہ اور کہیں ایک قسم کی font مناصب ہوگی تو وہیں پر دوسری نامناصب. ان پر بڑھی ہی عمدہ مضامین یہاں ہیں: ، http://salpat.uchicago.edu/index.php/salpat/article/view/33/29 ، http://en.wikipedia.org/wiki/Islamic_calligraphy
  • Linux, BSD وغیرہ پر SCIM اور IBus جیسی سہولتیں ملیں گی. ان کے ذرے آپ transliteration keyboards کا استعمال کر سکتے ہیں: http://wiki.debian.org/I18n/ibus , http://beeznest.wordpress.com/2005/12/16/howto-install-japanese-input-on-debian-sarge-using-scim/ . اردو میں لکھنے کے لئے آپ کو m17 packages install کرنا پڑیگا. اور یے بھی مت بھولیے کہ آپ کو اردو زبان کی locales بھی سسٹم پر ڈالنی پڑےنگی. خاص طور پر جو UTF-8 والی ہوں.
  • Firefox کے لئے اردو لغت کو install کرنے کے لئے پہلے Nightly Tester Tools addon install  کیجئے اور پھر Urdu Dictionary addon install کریے.
  • Debian وغیرہ میں کچھ دیگر ترتیبات کے بعد ہی Firefox اردو الفاظ کو سہی ڈھنگ سے دکھاتا ہے. دراصل Debian میں Firefox, Pango font rendering engine کا استعمال بند ہوتا ہے جس کی وجہ سے اردو کے الفاظ ٹھیک نہیں نظر آتے. Pango کو واپس لانے کے لئے ترکیب یہاں ہے: http://ubuntu.sabza.org/2006/08/18/firefox-for-linux-urdu-font-rendering
  • Firefox اور Debian کو لیکر مجھے یے بھی مسلہ کا سامنا کرنا پڑا. ویسے اسکا حل مجھے ابھی تک تو نہیں ملا ہے.
  • Openoffice کے لئے اردو لغت یہاں ملے گی: http://extensions.services.openoffice.org/en/project/dict-ur . اسے اپنے سسٹم پر ڈالنے کے بعد آپ کو Tools>Options>Language Settings میں جا کر Enabled for complex text layout tick-mark کرنا ہوگا. Default زبان کی فہرست میں اردو تو نہیں ہے. تو یہاں پر ہندی ہی رہنے دیجئے. ہوتا یہ ہے کہ جب آپ اردو میں ٹائپ کرنا شروع کرتے ہیں، تو خودبخود Openoffice وثیقہ کی زبان اردو ہے سمجھ جاتا ہے اور اسکا اشارہ bottom toolbar میں کرتا ہے. میرے تجربے میں Debian میں ایسا نہیں ہوتا. آپ کو پہلے اردو میں تھوڑے الفاظ ٹائپ کرنا پڑتا ہے. پھر bottom toolbar کے ذریے زبان کی setting مقرّر کرنی پڑتی  ہے. اچھا، چونکہ ہندی default CTL language ہے، جب آپ اردو ٹائپ کرنے لگتے ہیں، تو ایک ہندی font خودبخود منتخب کی جاتی ہے. جیسے Mangal وغیرہ. تو اس بات کا دھیان رکھتے ہوئے اردو ٹائپ کرتے وقت، اپنی font نسخ، نستعلیق، وغیرہ میں تبدیل کرنا نہ بھولیں.
  •  

تو پھر بس آج کے لئے اتنا ہی. امید ہے کہ آپ ناظرین سے پھر ملاقات ہوگی. تب تک کے لئے الوداع!


Copyright Firas MR. All Rights Reserved.

, , ,

Powered by ScribeFire.

Written by Firas MR

October 10, 2010 at 8:25 am

On Literature Search Tools And Translational Medicine

with 2 comments

Courtesy danmachold@flickr (by-nc-sa license)

Howdy all!

Apologies for the lack of recent blogular activity. As usual, I’ve been swamped with academia.

A couple of interesting pieces on literature search strategies & tools that caught my eye recently, some of which were quite new to me. Do check them out:

  • Matos, S., Arrais, J., Maia-Rodrigues, J., & Oliveira, J. (2010). Concept-based query expansion for retrieving gene related publications from MEDLINE. BMC Bioinformatics, 11(1), 212. doi:10.1186/1471-2105-11-212

[…]

The most popular biomedical information retrieval system, PubMed, gives researchers access to over 17 million citations from a broad collection of scientific journals, indexed by the MEDLINE literature database. PubMed facilitates access to the biomedical literature by combining the Medical Subject Headings (MeSH) based indexing from MEDLINE, with Boolean and vector space models for document retrieval, offering a single interface from which these journals can be searched [5]. However, and despite these strong points, there are some limitations in using PubMed or other similar tools. A first limitation comes from the fact that keyword-based searches usually lead to underspecified queries, which is a main problem in any information retrieval (IR) system [6]. This usually means that users will have to perform various iterations and modifications to their queries in order to satisfy their information needs. This process is well described in [7] in the context of information-seeking behaviour patterns in biomedical information retrieval. Another drawback is that PubMed does not sort the retrieved documents in terms of how relevant they are for the user query. Instead, the documents satisfying the query are retrieved and presented in reverse date order. This approach is suitable for such cases in which the user is familiar with a particular field and wants to find the most recent publications. However, if the user is looking for articles associated with several query terms and possibly describing relations between those terms, the most relevant documents may appear too far down the result list to be easily retrieved by the user.

To address the issues mentioned above, several tools have been developed in the past years that combine information extraction, text mining and natural language processing techniques to help retrieve relevant articles from the biomedical literature [8]. Most of these tools are based on the MEDLINE literature database and take advantage of the domain knowledge available in databases and resources like the Entrez Gene, UniProt, GO or UMLS to process the titles and abstracts of texts and present the extracted information in different forms: relevant sentences describing a biological process or linking two or more biological entities, networks of interrelations, or in terms of co-occurrence statistics between domain terms. One such example is the GoPubMed tool [9], which retrieves MEDLINE abstracts and categorizes them according to the Gene Ontology (GO) and MeSH terms. Another tool, iHOP [10], uses genes and proteins as links between sentences, allowing the navigation through sentences and abstracts. The AliBaba system [11] uses pattern matching and co-occurrence statistics to find associations between biological entities such as genes, proteins or diseases identified in MEDLINE abstracts, and presents the search results in the form of a graph. EBIMed [12] finds protein/gene names, GO annotations, drugs and species in PubMed abstracts showing the results in a table with links to the sentences and abstracts that support the corresponding associations. FACTA [13] retrieves abstracts from PubMed and identifies biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) co-occurring with the terms in the user’s query. The concepts are presented to the user in a tabular format and are ranked based on the co-occurrence statistics or on pointwise mutual information. More recently, there has been some focus on applying more detailed linguistic processing in order to improve information retrieval and extraction. Chilibot [14] retrieves sentences from MEDLINE abstracts relating to a pair (or a list) of proteins, genes, or keywords, and applies shallow parsing to classify these sentences as interactive, non-interactive or simple abstract co-occurrence. The identified relationships between entities or keywords are then displayed as a graph. Another tool, MEDIE [15], uses a deep-parser and a term recognizer to index abstracts based on pre-computed semantic annotations, allowing for real-time retrieval of sentences containing biological concepts that are related to the user query terms.

Despite the availability of several specific tools, such as the ones presented above, we feel that the demand for finding references relevant for a large set of is still not fully addressed. This constitutes an important query type, as it is a typical outcome of many experimental techniques. An example is a gene expression study, in which, after measuring the relative mRNA expression levels of thousands of genes, one usually obtains a subset of differentially expressed genes that are then considered for further analysis [16,17]. The ability to rapidly identify the literature describing relations between these differentially expressed genes is crucial for the success of data analysis. In such cases, the problem of obtaining the documents which are more relevant for the user becomes even more critical because of the large number of genes being studied, the high degree of synonymy and term variability, and the ambiguity in gene names.

While it is possible to perform a composite query in PubMed, or use a list of genes as input to some of the IR tools described above, these systems do not offer a retrieval and ranking strategy which ensures that the obtained results are sorted according to the relevance for the entire input list. A tool more oriented to analysing a set of genes is microGENIE [18], which accepts a set of genes as input and combines information from the UniGene and SwissProt databases to create an expanded query string that is submitted to PubMed. A more recently proposed tool, GeneE [19], follows a similar approach. In this tool, gene names in the user input are expanded to include known synonyms, which are obtained from four reference databases and filtered to eliminate ambiguous terms. The expanded query can then be submitted to different search engines, including PubMed. In this paper, we propose QuExT (Query Expansion Tool), a document indexing and retrieval application that obtains, from the MEDLINE database, a ranked list of publications that are most significant to a particular set of genes. Document retrieval and ranking are based on a concept-based methodology that broadens the resulting set of documents to include documents focusing on these gene-related concepts. Each gene in the input list is expanded to its various synonyms and to a network of biologically associated terms, namely proteins, metabolic pathways and diseases. Furthermore, the retrieved documents are ranked according to user-defined weights for each of these concept classes. By simply changing these weights, users can alter the order of the documents, allowing them to obtain for example, documents that are more focused on the metabolic pathways in which the initial genes are involved.

[…]

(Creative Commons Attribution License: http://creativecommons.org/licenses/by/2.0)

  • Kim, J., & Rebholz-Schuhmann, D. (2008). Categorization of services for seeking information in biomedical literature: a typology for improvement of practice. Brief Bioinform, 9(6), 452-465. doi:10.1093/bib/bbn032
  • Weeber, M., Kors, J. A., & Mons, B. (2005). Online tools to support literature-based discovery in the life sciences. Brief Bioinform, 6(3), 277-286. doi:10.1093/bib/6.3.277

I’m sure there are many other nice ones out there. Don’t forget to also check out the NCBI Handbook. Another great resource …

————————————————————————————————————

On a separate note, a couple of NIH affiliated authors have written some thought provoking stuff about Translational Medicine:-

  • Nussenblatt, R., Marincola, F., & Schechter, A. (2010). Translational Medicine – doing it backwards. Journal of Translational Medicine, 8(1), 12. doi:10.1186/1479-5876-8-12

[…]

The present paradigm of hypothesis-driven research poorly suits the needs of biomedical research unless efforts are spent in identifying clinically relevant hypotheses. The dominant funding system favors hypotheses born from model systems and not humans, bypassing the Baconian principle of relevant observations and experimentation before hypotheses. Here, we argue that that this attitude has born two unfortunate results: lack of sufficient rigor in selecting hypotheses relevant to human disease and limitations of most clinical studies to certain outcome parameters rather than expanding knowledge of human pathophysiology; an illogical approach to translational medicine.

[…]

A recent candidate for a post-doctoral fellowship position came to the laboratory for an interview and spoke of the wish to leave in vitro work and enter into meaningful in vivo work. He spoke of an in vitro observation with mouse cells and said that it could be readily applied to treating human disease. Indeed his present mentor had told him that was the rationale for doing the studies. When asked if he knew whether the mechanisms he outlined in the mouse existed in humans, he said that he was unaware of such information and upon reflection wasn’t sure in any event how his approach could be used with patients. This is a scenario that is repeated again and again in the halls of great institutions dedicated to medical research. Any self respecting investigator (and those they mentor) knows that one of the most important new key words today is “translational”. However, in reality this clarion call for medical research, often termed “Bench to Bedside” is far more often ignored than followed. Indeed the paucity of real translational work can make one argue that we are not meeting our collective responsibility as stewards of advancing the health of the public. We see this failure in all areas of biomedical research, but as a community we do not wish to acknowledge it, perhaps in part because the system, as it is, supports superb science. Looking this from another perspective, Young et al [2] suggest that the peer-review of journal articles is one subtle way this concept is perpetuated. Their article suggests that the incentive structure built around impact and citations favors reiteration of popular work, i.e., more and more detailed mouse experiments, and that it can be difficult and dangerous for a career to move into a new arena, especially when human study is expensive of time and money.

[…]

(Creative Commons Attribution License: http://creativecommons.org/licenses/by/2.0)

Well, I guess that does it for now. Hope those articles pique your interest as much as they did mine. Until we meet again, adios 🙂 !

Copyright © Firas MR. All rights reserved.

Written by Firas MR

June 29, 2010 at 4:33 pm

A Brief Tour Of The Field Of Bioinformatics

with 10 comments

This is an example of a full genome sequencing machine. It is the ABI PRISM 3100 Genetic Analyzer. Sequencers like it completely automate the process of sequencing the entire genome. Yes, even yours! [Courtesy: Wikipedia]

Some Background Before The Tour

Ahoy readers! I’ve had the opportunity to read a number of books recently. Among them, is “Developing Bioinformatics Computer Skills” by Cynthia Gibas and Per Jambeck. I dived into the book straight away, having no basic knowledge at all of what comprises the field of bioinformatics. Actually, it was quite like the first time I started medical college. On our first day, we were handed a tiny handbook on human anatomy, called “Handbook Of General Anatomy” by B D Chaurasia. Until actually opening that book, absolutely no one in the class had any idea of what Medicine truly was. All we had with us were impressions of charismatic white-coats who could, as if by magic, diagnose all kinds of weird things by the mere touch of a hand. Not to mention, legendary tales from the likes of Discovery Channel. Oh yes, our expectations were of epic proportions 😛 . As we flipped through the pages of that little book, we were flabbergasted by the sheer volume of information that one had to rote. It had soon become clear to us, what medicine was all about – Physiology is the study of normal body functions akin to physics, Anatomy is the study of the structural organization of the human body a la geography … – and this set us on the path to learning to endure an avalanche of learn-by-rote information for the rest of our lives.

Bioinformatics is shrouded in mystery for most medics. Because, so many of these ideas are completely new. The technologies are new. The data available are new. Before the human genome was sequenced, there was virtually no point of using computers to understand genes and alleles. Most of what needed to be sorted out could be done by hand. But now that we have huge volumes of data, and data that are growing at an exponential rate at that, it makes sense to use computers to connect the dots and frame hypotheses. I guess, bioinformatics is a conundrum to most other people too – whether you are coming from a math background, a computer science background or a biology background – we all have something missing from our repertoire of knowledge and skills.

What is the rationale behind using computation to understand genes? In yore times, all we had were a couple of known genes. We had the tools of Mendelian genetics and linkage analysis to solve most of the genetic mysteries. The human genome project changed that. We are suddenly flooded not only with sequences that we don’t know anything about, but also the gigantic hurdle of finding relationships between them. To give you a sense of the magnitude of numbers we’re talking about here: we could simplify DNA’s 3-D structure and represent the entire genetic code contained in a single polynucleotide strand of the human genome, as a string of letters A, C, G or T each representing a given nucleic acid (base) in a long sequence (like so …..ATCGTTACGTAAAA…..). Since it has been found that this strand is approximately 3 billion bases long, its entire length comes to 3 billion bytes. That’s because each letter A, T, C or G could be thought of as being represented by a single ASCII character. And we all know that an ASCII character is equal to 1 byte of data. Since we are talking about two complementary strands within a molecule of DNA, the amount of information within the genome is 6 billion bytes§. But human cells are diploid! So the amount of DNA information in the nucleus of a single human cell is 12 billion bytes! That’s 1.2 terabytes of data neatly packed in to the DNA sequence of every cell – we haven’t even begun to talk about the 3-D structure of DNA or the sequence and 3-D structure of RNA and proteins yet!

§ Special thanks to Martijn for bringing this up in the comments: If you really think about it for a moment, bioinformaticians don’t need to store the sequences of both the DNA strands of a genome in a computer, because the sequence of one strand can be derived from the other – they are complementary by definition. If you store 3 billion bytes from one strand, you can easily derive the complementary 3 billion bytes of information on the other strand, provided that the two strands are truly complementary and there aren’t any blips of mismatch mutations between them. Using this concept, you can get away with storing 3 billion bytes and not 6 billion bytes to capture the information in the human genome.

Special thanks also to Dr. Atul Butte ¥ of Stanford University who dropped by to say that a programmer really doesn’t need a full byte to store a nucleic acid base. A base can be represented by 2 bits (eg. 00 for A, 11 for C, 01 for G and 10 for T). Since 1 byte contains 8 bits, a byte can actually hold 4 bases. Without compression. So 3 billion bases can be held within 750,000,000 bytes. That’s 715 megabytes (1 megabyte = 1048576 bytes), which can easily fit on to an extended-length CD-ROM (not even a DVD). So the entire genetic code from a single polynucleotide strand of the human genome can easily fit on to a single CD-ROM. Since human cells are diploid, with two CD-ROMs – one CD-ROM for each set of chromosomes – you can capture this information for both sets of chromosomes. [go back]

To compound the issue, we don’t have a taxonomy system in place to describe the sequences we have. When Linnaeus invented his taxonomy system for living things, he used basic morphologic criteria to classify organisms. If it walked like a duck and talked like a duck, it was a duck! But how do you apply this reasoning to genes? You might think, why not classify them by organism? But there’s a more subtle issue here too. Some of these genetic sequences can be classified in to various categories – is this gene a promoter, exon, intron or could it be a sequence that plays a role in growth, death, inflammatory response, and so on. Not only that, many sequences could be found in more than one organism. So how do you solve the problem of classification? Man’s answer to this problem is simple – you don’t!

Here’s how we can get away with that. Simply create a relational database using MySQL, PostgreSQL or what have you and create appropriate links between sequence entries, their functions, etc. Run queries to find relationships and voila, there you have it! This was our first step in developing bioinformatics as a field. Building databases. You can do this with a genetic sequence (a string of letters A for ‘adenine‘, C for ‘cytosine‘, G for ‘guanine‘ and T for ‘thymine‘ …represented like so ATGGCTCCTATGCGGTTAAAATTT….) or with an RNA sequence (a string of letters A for ‘adenine’, C for ‘cytosine, G for ‘guanine’ and U for ‘Uracil‘ like so …AUGGCACCCU…) or even a protein sequence (a string of 20 letters each letter representing one amino acid). By breaking down and simplifying a 3-D structure this way, you can suddenly enhance data storage, retrieval and more importantly, analysis between:

  1. Two or more sequences of DNA
  2. Two or more sequences of RNA
  3. Two or more sequences of Protein

You can even find relationships between:

  1. A DNA sequence and an RNA sequence
  2. An RNA sequence and a Protein sequence
  3. A DNA sequence and a Protein sequence

If you can represent the spatial coordinates of the molecules within a protein 3-D structure as cartesian coordinates (x, y, z), you can even analyze structure not only within a given protein, but also try to predict the best possible 3-D structure for a protein that is hypothetically synthesized by a given DNA or RNA sequence. In fact that is the Holy Grail of bioinformatics today. How to predict protein structure from a DNA sequence? And consequentially, how to manipulate protein structure to suit your needs.

The Tour Begins

Let’s take a tour of what bioinformatics holds for us.

The Ability To Build Relational Databases

We have already discussed this above.

Local Sequence Comparison

An example of sequence alignment. Alignment of 27 avian influenza hemagglutinin protein sequences colored by residue conservation (top) and residue properties (bottom) [Courtesy: Wikipedia]

Before we delve in to the idea of sequence comparisons further, let’s take an example from the bioinformatics book I mentioned to understand how sequence comparisons help in the real world. It speaks of a gene-knockout experiment that targets a specific sequence in the fruit fly’s (Drosophila melanogaster) genome. Knocking this sequence out, results in the flies’ progeny being born without eyes. By knocking this gene – called eyeless – out you learn that it somehow plays an important role in eye development in the fruit fly. There’s a similar (but not quite the same) condition in humans called aniridia, in which eyes develop in the usual manner, except for the lack of an iris. Researchers were able to identify the particular gene that causes aniridia and called it aniridia. By inserting the aniridia gene in to an eyeless-knockout Drosophila’s genome, they observed that suddenly its offspring bore eyes! Remarkable isn’t it? Somehow there’s a connection between two genes separated not only by different species, but also by genera and phyla. To discern how each of these genes functions, you proceed by asking if the two sequences could be the same? How similar would they might be exactly? To answer this question you could do an alignment of the two sequences. This is the absolute basic kind of stuff when we do sequence analysis.

Instead of doing it by hand (which could be possible if the sequences being compared were small), you could find the best alignment between these two long sequences using a program such as BLAST. There are a number of ways BLAST can work. Because the two sequences may have only certain regions that fit nicely, with other regions that don’t – called gaps – you can have multiple ways of aligning them side by side. But what you are interested in, is to find the best fit that maximizes how much they overlap with each other (and minimize gaps). Here’s where computer science comes in to play. In order to maximize overlap, you use the concept of ‘dynamic programming‘. It is helpful to understand dynamic programming as an algorithm rather than a program per se (it’s not like you’ll be sitting in front of a computer and programming code if you want to compare eyeless and aniridia; the BLAST program will do the dirty work for you. It uses dynamic programming code that’s built in to it). Amazingly enough, dynamic programming is not something as hi-fi as you might think. It is apparently the same strategy used in many computer spell-checkers! Little did the bioinformaticians who first developed dynamic programming techniques in genetics know, that the concept of dynamic programming was discovered far earlier than them. There are apparently many such cases in bioinformatics where scientists keep reinventing the wheel, purely because it is such an interdisciplinary field! One of the most common algorithms that is a subset of dynamic programming and that is used for aligning specific sequences within a genome is called the Smith-Waterman algorithm. Like dynamic programming, another useful algorithm in bioinformatics is what is called a greedy algorithm. In a greedy algorithm, you are interested in maximizing overlap in each baby-step as you construct the alignment procedure, without consideration to the final overlap. In other words, it doesn’t matter to you how the sequences overlap in the end as long as each step of the way during the alignment process, you maximize overlap. Other concepts in alignment include, using a (substitution) matrix of possible scores when two letters – each in a sequence – overlap and trying to maximize scores using dynamic programming. Common matrices for this purpose are BLOSUM-62, BLOSUM-45 and PAM (Point Accepted Mutation).

So now that we know the basic idea behind sequence alignment, here’s what you can actually do in sequence analysis:

  1. Using alignment, find a sequence from a database (eg. GenBank from the NCBI) that maximizes overlap between it and a sequence that isn’t yet in the database. This way, if you discover some new sequence, you can find relationships between it and known sequences. If the sequence in the database is associated with a given protein, you might be able to look for it in your specimen. This is called pairwise alignment.
  2. Just as you can compare two sequences and find out if there is a statistically significant association between them or not, you can also compare multiple sequences at once. This is called multiple sequence alignment.
  3. If certain regions of two sequences are the same, it can be inferred that they are conserved across species or organisms despite environmental stresses and evolution. A sequence encoding development of the eye is very likely to remain unchanged across multiple species for which sight is an essential function to survive. Here comes another interesting concept – phylogenetic relationships between organisms at a genetic level. Using alignment it is possible to develop phylogenetic trees and phylogenetic networks that link two or more gene sequences and as a consequence find related proteins.
  4. Similar to finding evolutionary homology between sequences as above, one could also look for homology between protein structures – motifs – and then conclude that the regions of DNA encoding these proteins have a certain degree of homology.
  5. There are tools in sequence analysis that look at features characteristic of known functioning regions of DNA and see if the same features exist in a random sequence. This process is called gene finding. You’re trying to discover functionality in hitherto unknown sequences of DNA. This is important, as the vast majority of genetic code is as far as we know, non-functional random junk. Could there be some region in this vast ocean of randomness that might, just might have an interesting function? Gene finding uses software that looks for tRNA encoding regions, promoter sites, open reading frames, exon-intron splicing regions, … – in short, the whole gamut of what we know is characteristic of functional code – in random junk. Once a statistically significant result is obtained, you’re ready to test this in a lab!
  6. A special situation in sequence alignment is whole genome alignment (or global alignment). That is, finding the best fit between entire genomes of different organisms! Despite how arduous this sounds, the underlying ideas are pretty similar to local sequence alignment. One of the most common dynamic programming algorithms used in whole genome alignment is the Needleman–Wunsch algorithm.

Many of the things discussed for sequence analysis of DNA, have equal counterparts for RNA and proteins.

Protein Structure Property Analysis

Say that you have an amino acid sequence for a protein. There’s nothing in the databases that has your sequence. In order to build a 3-D model of this  protein, you’ll need to predict what could be the best possible shape given the constraints of bond angles, electrostatic forces between constituent atoms, etc. There’s a specific technique that warrants mentioning here – the Ramachandran Plot – that takes information on steric hindrance and plots the probabilities for different 3-D structures of an amino acid sequence. With a 3-D model, you could try to predict this protein’s chemical properties (such as pKa, etc.). You could also look for active sites on this protein that are the crucial regions that bind to substrates, based on known structures of active sites from other proteins… and so on.

This figure depicts an unrooted phylogenetic tree for myosin, a superfamily of proteins. [Courtesy: Wikipedia]

Protein Structure Alignment

This is when you try to find the best fit between two protein structures. The idea is very similar to sequence alignment, only this time the algorithms are a bit different. In most cases, the algorithms for this process are computationally intensive and rely on trial and error. You could build phylogenetic trees based on structural evolutionary homology too.

Protein Fingerprint Analysis

This is basically using computational tools to identify relationships between two or more proteins by analyzing their break-down products – their peptide fingerprints. Using protein fragments, it is possible to compare entire cocktails of different proteins. How does the protein mixture from a human retinal cell, compare to a protein mixture from the retinal cell of a mouse? This kind of stuff, is called Proteomics, because you’re comparing the entire protein from an organism to another. You could also analyze protein fragments from different cells within the same organism to see how they might have evolved or developed.

DNA Micro-array Analysis

A DNA microarray is a slide with hundreds of tiny dots on it. Each dot is tagged with a fluorescent marker that glows under UV (or another form of) light, if the cells within that dot produce a given protein. When a given protein is made, it means that a given genetic sequence is being expressed (or transcribed into RNA which in turn is being translated in to protein). By inoculating these dots with the same population of cells and by measuring the amount of light coming from these dots, you could develop a gene expression profile for these cells. You could then study the expression profiles of these cells under different environmental conditions to see how they behave and change.

You could also inoculate different dots with different cell populations and study how their expression profiles differ. Example: normal gastric epithelium vs cancerous gastric epithelium.

Of course you could try looking at all these light emitting dots with your eyes and count manually. If you want to take a shot at it, you might even be able to tell the difference between the different levels of brightness between dots! But why not use computers to do the job for you? There are software tools out there that can quantitatively measure these expression profiles for you.

Primer Design

There are many experiments and indeed diagnostic tests that use an artificially synthesized DNA sequence to serve as an anchor that flanks a specific region of interest in the DNA of a cell, and amplify this region. By amplify – we mean, make multiple copies. These flanking sequences are also called primers. Applications for example include, amplifying DNA material of the HIV virus to better detect presence or absence of HIV in the blood of a patient. The specific name for this kind of test or experiment is called the polymerase chain reaction. There are a number of other applications of primers such as gene cloning, genetic hybridization, etc. Primers ought to be constructed in specific ways that prevent them from forming loops or binding to non-specific sites on cell DNA. How do you find the best candidate for a primer? Of course, computation!

Metabolomics

A fancy word for modeling metabolic pathways and their relationships using computational analyses. How does the glycolytic pathway relate to some random metabolic pathway found in the neurons of the brain? Computational tools help identify potential relationships between all of these different pathways and help you map them. In fact, there are metabolic pathway maps out there on the web that continually get updated to reflect this fascinating area of ongoing research.

I guess that covers a whole lot of what bioinformatics is all about. When it comes to definitions, some people say that bioinformatics is the application part whereas computational biology is the part that mainly deals with the development of algorithms.

Neologisms Galore!

As you can see, some fancy new words have come into existence as a result of all this frenzied activity:

  • Genomics: Strictly speaking, the study of entire genomes of organisms/cells. In bioinformatics, this term is applied to any studies on DNA.
  • Transcriptomics: Strictly speaking, the study of entire transcriptomes (the RNA complement of DNA) of organisms/cells. In bioinformatics, this term is applied to any studies on RNA.
  • Proteomics: Strictly speaking, the study of entire proteins made by organisms/cells. In bioinformatics, this term is applied to any studies on proteins. Structural biology is a special branch of proteomics that explores the 3-D structure of proteins.
  • Metabolomics: The study of entire metabolic pathways in organisms/cells. In bioinformatics, this term is applied to any studies on metabolic pathways and their inter-relationships.

Real World Impact

So what can all of this theoretical ‘data-dredging’ give us anyway? Short answer – hypotheses. Once you have a theoretical hypothesis for something you can test it in the lab. Without forming intelligent hypotheses, humanity might very well take centuries to experiment with every possible permutation or combination of data that has been amassed so far and mind you, which continues to grow as we speak!

Thanks to bioinformatics, we are now discovering genetic relationships between different diseases that were hitherto considered completely unrelated – such as diabetes mellitus and rheumatoid arthritis! Scientists like Dr. Atul Butte [go back] and his team are trying to reclassify all known diseases using all of the data that we’ve been able to gather from Genomics. Soon, the days of the traditional International Classification of Diseases (ICD) might be gone. We might some day have a genetic ICD!

Sequencing of individual human genomes (technology for this already exists and many commercial entities out there will happily sequence your genome for a fee) could help in detecting or predicting disease susceptibility.

Proteins could be substituted between organisms (a la pig and human insulin) and better yet, completely manipulated to suit an objective – such as drug delivery or effectiveness. Knowing a DNA sequence, would give you enough information to predict protein structure and function, giving you yet another tool in diagnosis.

And the list of possibilities is endless!

Bioinformatics, is thus man’s attempt to making biology and medicine a predictive science 🙂 .

Further Reading

I haven’t had the chance to read any other books on bioinformatics, what with exams just a couple of months away. Having read, “Developing Bioinformatics Computer Skills“, and found it a little too dense especially in the last couple of chapters, I would only recommend it as an introductory text to someone who already has some knowledge of computer algorithms. Because different algorithms have different caveats and statistical gotchas, it makes sense to have a sound understanding of what each of these algorithms do. Although the authors have done a pretty decent job in describing the essentials, the explanations of the algorithms and how they really function are a bit complicated for the average biologist. It’s difficult for me to recommend a book that I might not have read, but here are two I’m considering worth exploring in the future:

Understanding Bioinformatics
Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum

Introduction to Bioinformatics: A Theoretical and Practical Approach
Introduction to Bioinformatics: A Theoretical And Practical Approach by Stephen Krawetz and David Womble

As books to refresh my knowledge of molecular biology and genetics I’m considering the following:

Molecular Biology of the Cell
Molecular Biology Of The Cell by Bruce Alberts et al


Molecular Biology Of The Gene by none other than James D Watson himself et al (Of ‘Watson & Crick‘ model of DNA fame)

Let me know if you have any other suggested readings in the comments1.

There are also a number of excellent Opencourseware lectures on bioinformatics out on the web (example: at AcademicEarth.org. For beginners though, I suggest Dr. Daniel Lopresti’s (Lehigh University) fantastic high level introduction to the field here. Also don’t forget to check out “A Short Course On Synthetic Genomics” by George Church and Craig Venter on Edge.org for a fascinating overview of what might lie ahead in the future! In the race to sequence the human genome, Craig Venter headed the main private company that posed competition to the NIH’s project. His group of researchers ultimately developed a much faster way to sequence the genome than had previously been imagined – the shotgun sequencing method.

Hope you’ve enjoyed this high level tour. Do send in your thoughts, suggestions and corrections!

UPDATE 1: Check out Dr. Eric Lander‘s (one of the stalwarts behind the Human Genome Project) excellent lecture at The Royal Society from 2005 called Beyond the Human Genome Project – Medicine in the 21st Century that tries to gives you the big picture on this topic.

UPDATE 2: Also check out NEJM’s special review on Genomics called Genomics — An Updated Primer.

Copyright © Firas MR. All rights reserved.

Your feedback counts:

1. Dr. Atul Butte ¥ suggests checking out some of the excellent material at NCBI’s Bookshelf. [go back]

Readability grades for this post:

Flesch reading ease score: 57.4
Automated readability index: 10.8
Flesch-Kincaid grade level: 9.7
Coleman-Liau index: 11.5
Gunning fog index: 13.4
SMOG index: 12.2

Powered by ScribeFire.

Evolutionary Computing

with 5 comments

Howdy people! I apologize for the lack of recent activity on my blog. I’ve been swamped with heavy academia lately and am finding it hard to devote time to it. Let’s talk about some fusion stuff today.

This thought just occurred to me. What if computers and software could evolve on their own? If I hypothetically had an operating system that could introduce random optimizations continually, there could be occasions when some random bit of code could prove to be a better fit to suit my hardware and take over. There’s an interesting page on Wikipedia here. Take a look and send in your thoughts and hopefully we can get an interesting conversation started! The idea is radical, no doubt. A couple of starter questions:

  1. How do you build such a thing? Feasibility. How far along the line do you think such technology would come about?
  2. What is the current status of artificial intelligence in desktop computers?
  3. What benefits could you think of?
  4. Any potential side-effects of the phenomenon?
  5. Lastly, is this likely to affect how man and machine interact with each other and if so how?

So that’s it for today folks. See ya!

EDIT: Due to sloppy editing, the comments were turned off for a couple of hours. Everything’s back to normal people, so I’m waiting to hear your comments!

Copyright © Firas MR. All rights reserved.

Written by Firas MR

July 13, 2008 at 3:02 pm

Posted in Technology

Tagged with

Tech bytes: Konqueror and Java – Opera on Kubuntu 8.04

with one comment

Source, Author and License

Today’s tech bytes:

Some very nice people over at Kubuntu’s tech support IRC channel brought my attention to the fact that Kubuntu 8.04 doesn’t have an LTS version. Apparently, both the KDE 3.5.9 and KDE4 versions have not been given that status as KDE development has been in flux lately. So people, all those Powered by Kubuntu 8.04 LTS post-scripts in my previous posts stand corrected as …Kubuntu 8.04 (KDE 3.5.9).

Ever found the fact that from a fresh install of Kubuntu 8.04, Konqueror’s java behavior is a tad odd? No matter what you do, when you enable Tools>HTML Settings>Java, the java setting never sticks. It stays on the website you’re visiting but that’s it. As soon as you go to some other website, that java setting resets back to disabled. Furthermore, when you restart Konqueror, it’s the same deal again.

One nice person over at Kubuntu’s IRC channel was kind enough to share his solution. Goto Settings>Configure Konqueror>Java & Javascript>Java Runtime Settings. Uncheck/disable the option ‘Use Security Manager’. Click ‘Apply’>’OK’. Now enable java under Tools>HTML Settings>Java. Restart Konqueror. Yay! It sticks! Now, go back to Settings>Configure Konqueror>Java & Javascript>Java Runtime Settings. Check/enable the option ‘Use Security Manager’. Click ‘Apply’>’OK’. It’s a little weird but doing so doesn’t cause the funny java behavior to turn back on again and having any sort of security on a web browser is good 🙂 .

Tried the latest Opera 9.5 Beta 2/weekly snapshot on Kubuntu/Ubuntu? If you live outside of the US, there’s a good chance that your system locale settings are set to use something other than English US by default. It so happens that this causes Opera 9.5b2 to crash with a segmentation fault. In order to enjoy Opera 9.5b2, make sure you have Sun Java set as the default Java version (use this howto) and set your locale to English US (en_US). On Kubuntu 8.04 (KDE 3.5.9) do the following as discussed here :-

  1. Goto System Settings>Regional & Language>Country/Region & Language
  2. Click on the ‘Locale’ tab
  3. Click on ‘Select System Language’ and choose ‘English US’
  4. Click ‘Apply’
  5. Restart KDE (log out and then log in) for settings to take effect

I have found the flash support to be a little flaky, at least with Opera 9.5b2. Opera, for me, often suffers from this grey box phenomenon. One moment a flash video works perfectly, other times I’d find grey boxes with audio but no sound. This becomes particularly the case when I’d have two or more tabs with flash video open in them and keep switching between them.

Quick user tip: To set your middle-click options, press the Shift key and then middle-click.

Is it just me or does Firefox 3 RC1 seem faster on Windows XP than on Ubuntu/Linux? For me, FF3RC1 on Kubuntu 8.04 still seems to take a lot more memory than on Windows. I guess their Linux development is slow or something.


Google announced their Google Health service recently. Privacy concerns abound.

That’s it for today folks. See ya!

Copyright © 2006 – 2008 Firas MR. All rights reserved.

Written by Firas MR

May 25, 2008 at 1:21 pm