A Brief Tour Of The Field Of Bioinformatics
This is an example of a full genome sequencing machine. It is the ABI PRISM 3100 Genetic Analyzer. Sequencers like it completely automate the process of sequencing the entire genome. Yes, even yours! [Courtesy: Wikipedia]
Some Background Before The Tour
Ahoy readers! I’ve had the opportunity to read a number of books recently. Among them, is “Developing Bioinformatics Computer Skills” by Cynthia Gibas and Per Jambeck. I dived into the book straight away, having no basic knowledge at all of what comprises the field of bioinformatics. Actually, it was quite like the first time I started medical college. On our first day, we were handed a tiny handbook on human anatomy, called “Handbook Of General Anatomy” by B D Chaurasia. Until actually opening that book, absolutely no one in the class had any idea of what Medicine truly was. All we had with us were impressions of charismatic white-coats who could, as if by magic, diagnose all kinds of weird things by the mere touch of a hand. Not to mention, legendary tales from the likes of Discovery Channel. Oh yes, our expectations were of epic proportions 😛 . As we flipped through the pages of that little book, we were flabbergasted by the sheer volume of information that one had to rote. It had soon become clear to us, what medicine was all about – Physiology is the study of normal body functions akin to physics, Anatomy is the study of the structural organization of the human body a la geography … – and this set us on the path to learning to endure an avalanche of learn-by-rote information for the rest of our lives.
Bioinformatics is shrouded in mystery for most medics. Because, so many of these ideas are completely new. The technologies are new. The data available are new. Before the human genome was sequenced, there was virtually no point of using computers to understand genes and alleles. Most of what needed to be sorted out could be done by hand. But now that we have huge volumes of data, and data that are growing at an exponential rate at that, it makes sense to use computers to connect the dots and frame hypotheses. I guess, bioinformatics is a conundrum to most other people too – whether you are coming from a math background, a computer science background or a biology background – we all have something missing from our repertoire of knowledge and skills.
What is the rationale behind using computation to understand genes? In yore times, all we had were a couple of known genes. We had the tools of Mendelian genetics and linkage analysis to solve most of the genetic mysteries. The human genome project changed that. We are suddenly flooded not only with sequences that we don’t know anything about, but also the gigantic hurdle of finding relationships between them. To give you a sense of the magnitude of numbers we’re talking about here: we could simplify DNA’s 3-D structure and represent the entire genetic code contained in a single polynucleotide strand of the human genome, as a string of letters A, C, G or T each representing a given nucleic acid (base) in a long sequence (like so …..ATCGTTACGTAAAA…..). Since it has been found that this strand is approximately 3 billion bases long, its entire length comes to 3 billion bytes. That’s because each letter A, T, C or G could be thought of as being represented by a single ASCII character. And we all know that an ASCII character is equal to 1 byte of data. Since we are talking about two complementary strands within a molecule of DNA, the amount of information within the genome is 6 billion bytes§. But human cells are diploid! So the amount of DNA information in the nucleus of a single human cell is 12 billion bytes! That’s 1.2 terabytes of data neatly packed in to the DNA sequence of every cell – we haven’t even begun to talk about the 3-D structure of DNA or the sequence and 3-D structure of RNA and proteins yet!
§ Special thanks to Martijn for bringing this up in the comments: If you really think about it for a moment, bioinformaticians don’t need to store the sequences of both the DNA strands of a genome in a computer, because the sequence of one strand can be derived from the other – they are complementary by definition. If you store 3 billion bytes from one strand, you can easily derive the complementary 3 billion bytes of information on the other strand, provided that the two strands are truly complementary and there aren’t any blips of mismatch mutations between them. Using this concept, you can get away with storing 3 billion bytes and not 6 billion bytes to capture the information in the human genome.
Special thanks also to Dr. Atul Butte ¥ of Stanford University who dropped by to say that a programmer really doesn’t need a full byte to store a nucleic acid base. A base can be represented by 2 bits (eg. 00 for A, 11 for C, 01 for G and 10 for T). Since 1 byte contains 8 bits, a byte can actually hold 4 bases. Without compression. So 3 billion bases can be held within 750,000,000 bytes. That’s 715 megabytes (1 megabyte = 1048576 bytes), which can easily fit on to an extended-length CD-ROM (not even a DVD). So the entire genetic code from a single polynucleotide strand of the human genome can easily fit on to a single CD-ROM. Since human cells are diploid, with two CD-ROMs – one CD-ROM for each set of chromosomes – you can capture this information for both sets of chromosomes. [go back]
To compound the issue, we don’t have a taxonomy system in place to describe the sequences we have. When Linnaeus invented his taxonomy system for living things, he used basic morphologic criteria to classify organisms. If it walked like a duck and talked like a duck, it was a duck! But how do you apply this reasoning to genes? You might think, why not classify them by organism? But there’s a more subtle issue here too. Some of these genetic sequences can be classified in to various categories – is this gene a promoter, exon, intron or could it be a sequence that plays a role in growth, death, inflammatory response, and so on. Not only that, many sequences could be found in more than one organism. So how do you solve the problem of classification? Man’s answer to this problem is simple – you don’t!
Here’s how we can get away with that. Simply create a relational database using MySQL, PostgreSQL or what have you and create appropriate links between sequence entries, their functions, etc. Run queries to find relationships and voila, there you have it! This was our first step in developing bioinformatics as a field. Building databases. You can do this with a genetic sequence (a string of letters A for ‘adenine‘, C for ‘cytosine‘, G for ‘guanine‘ and T for ‘thymine‘ …represented like so ATGGCTCCTATGCGGTTAAAATTT….) or with an RNA sequence (a string of letters A for ‘adenine’, C for ‘cytosine, G for ‘guanine’ and U for ‘Uracil‘ like so …AUGGCACCCU…) or even a protein sequence (a string of 20 letters each letter representing one amino acid). By breaking down and simplifying a 3-D structure this way, you can suddenly enhance data storage, retrieval and more importantly, analysis between:
- Two or more sequences of DNA
- Two or more sequences of RNA
- Two or more sequences of Protein
You can even find relationships between:
- A DNA sequence and an RNA sequence
- An RNA sequence and a Protein sequence
- A DNA sequence and a Protein sequence
If you can represent the spatial coordinates of the molecules within a protein 3-D structure as cartesian coordinates (x, y, z), you can even analyze structure not only within a given protein, but also try to predict the best possible 3-D structure for a protein that is hypothetically synthesized by a given DNA or RNA sequence. In fact that is the Holy Grail of bioinformatics today. How to predict protein structure from a DNA sequence? And consequentially, how to manipulate protein structure to suit your needs.
The Tour Begins
Let’s take a tour of what bioinformatics holds for us.
The Ability To Build Relational Databases
We have already discussed this above.
Local Sequence Comparison
Before we delve in to the idea of sequence comparisons further, let’s take an example from the bioinformatics book I mentioned to understand how sequence comparisons help in the real world. It speaks of a gene-knockout experiment that targets a specific sequence in the fruit fly’s (Drosophila melanogaster) genome. Knocking this sequence out, results in the flies’ progeny being born without eyes. By knocking this gene – called eyeless – out you learn that it somehow plays an important role in eye development in the fruit fly. There’s a similar (but not quite the same) condition in humans called aniridia, in which eyes develop in the usual manner, except for the lack of an iris. Researchers were able to identify the particular gene that causes aniridia and called it aniridia. By inserting the aniridia gene in to an eyeless-knockout Drosophila’s genome, they observed that suddenly its offspring bore eyes! Remarkable isn’t it? Somehow there’s a connection between two genes separated not only by different species, but also by genera and phyla. To discern how each of these genes functions, you proceed by asking if the two sequences could be the same? How similar would they might be exactly? To answer this question you could do an alignment of the two sequences. This is the absolute basic kind of stuff when we do sequence analysis.
Instead of doing it by hand (which could be possible if the sequences being compared were small), you could find the best alignment between these two long sequences using a program such as BLAST. There are a number of ways BLAST can work. Because the two sequences may have only certain regions that fit nicely, with other regions that don’t – called gaps – you can have multiple ways of aligning them side by side. But what you are interested in, is to find the best fit that maximizes how much they overlap with each other (and minimize gaps). Here’s where computer science comes in to play. In order to maximize overlap, you use the concept of ‘dynamic programming‘. It is helpful to understand dynamic programming as an algorithm rather than a program per se (it’s not like you’ll be sitting in front of a computer and programming code if you want to compare eyeless and aniridia; the BLAST program will do the dirty work for you. It uses dynamic programming code that’s built in to it). Amazingly enough, dynamic programming is not something as hi-fi as you might think. It is apparently the same strategy used in many computer spell-checkers! Little did the bioinformaticians who first developed dynamic programming techniques in genetics know, that the concept of dynamic programming was discovered far earlier than them. There are apparently many such cases in bioinformatics where scientists keep reinventing the wheel, purely because it is such an interdisciplinary field! One of the most common algorithms that is a subset of dynamic programming and that is used for aligning specific sequences within a genome is called the Smith-Waterman algorithm. Like dynamic programming, another useful algorithm in bioinformatics is what is called a greedy algorithm. In a greedy algorithm, you are interested in maximizing overlap in each baby-step as you construct the alignment procedure, without consideration to the final overlap. In other words, it doesn’t matter to you how the sequences overlap in the end as long as each step of the way during the alignment process, you maximize overlap. Other concepts in alignment include, using a (substitution) matrix of possible scores when two letters – each in a sequence – overlap and trying to maximize scores using dynamic programming. Common matrices for this purpose are BLOSUM-62, BLOSUM-45 and PAM (Point Accepted Mutation).
So now that we know the basic idea behind sequence alignment, here’s what you can actually do in sequence analysis:
- Using alignment, find a sequence from a database (eg. GenBank from the NCBI) that maximizes overlap between it and a sequence that isn’t yet in the database. This way, if you discover some new sequence, you can find relationships between it and known sequences. If the sequence in the database is associated with a given protein, you might be able to look for it in your specimen. This is called pairwise alignment.
- Just as you can compare two sequences and find out if there is a statistically significant association between them or not, you can also compare multiple sequences at once. This is called multiple sequence alignment.
- If certain regions of two sequences are the same, it can be inferred that they are conserved across species or organisms despite environmental stresses and evolution. A sequence encoding development of the eye is very likely to remain unchanged across multiple species for which sight is an essential function to survive. Here comes another interesting concept – phylogenetic relationships between organisms at a genetic level. Using alignment it is possible to develop phylogenetic trees and phylogenetic networks that link two or more gene sequences and as a consequence find related proteins.
- Similar to finding evolutionary homology between sequences as above, one could also look for homology between protein structures – motifs – and then conclude that the regions of DNA encoding these proteins have a certain degree of homology.
- There are tools in sequence analysis that look at features characteristic of known functioning regions of DNA and see if the same features exist in a random sequence. This process is called gene finding. You’re trying to discover functionality in hitherto unknown sequences of DNA. This is important, as the vast majority of genetic code is as far as we know, non-functional random junk. Could there be some region in this vast ocean of randomness that might, just might have an interesting function? Gene finding uses software that looks for tRNA encoding regions, promoter sites, open reading frames, exon-intron splicing regions, … – in short, the whole gamut of what we know is characteristic of functional code – in random junk. Once a statistically significant result is obtained, you’re ready to test this in a lab!
- A special situation in sequence alignment is whole genome alignment (or global alignment). That is, finding the best fit between entire genomes of different organisms! Despite how arduous this sounds, the underlying ideas are pretty similar to local sequence alignment. One of the most common dynamic programming algorithms used in whole genome alignment is the Needleman–Wunsch algorithm.
Many of the things discussed for sequence analysis of DNA, have equal counterparts for RNA and proteins.
Protein Structure Property Analysis
Say that you have an amino acid sequence for a protein. There’s nothing in the databases that has your sequence. In order to build a 3-D model of this protein, you’ll need to predict what could be the best possible shape given the constraints of bond angles, electrostatic forces between constituent atoms, etc. There’s a specific technique that warrants mentioning here – the Ramachandran Plot – that takes information on steric hindrance and plots the probabilities for different 3-D structures of an amino acid sequence. With a 3-D model, you could try to predict this protein’s chemical properties (such as pKa, etc.). You could also look for active sites on this protein that are the crucial regions that bind to substrates, based on known structures of active sites from other proteins… and so on.
Protein Structure Alignment
This is when you try to find the best fit between two protein structures. The idea is very similar to sequence alignment, only this time the algorithms are a bit different. In most cases, the algorithms for this process are computationally intensive and rely on trial and error. You could build phylogenetic trees based on structural evolutionary homology too.
Protein Fingerprint Analysis
This is basically using computational tools to identify relationships between two or more proteins by analyzing their break-down products – their peptide fingerprints. Using protein fragments, it is possible to compare entire cocktails of different proteins. How does the protein mixture from a human retinal cell, compare to a protein mixture from the retinal cell of a mouse? This kind of stuff, is called Proteomics, because you’re comparing the entire protein from an organism to another. You could also analyze protein fragments from different cells within the same organism to see how they might have evolved or developed.
DNA Micro-array Analysis
A DNA microarray is a slide with hundreds of tiny dots on it. Each dot is tagged with a fluorescent marker that glows under UV (or another form of) light, if the cells within that dot produce a given protein. When a given protein is made, it means that a given genetic sequence is being expressed (or transcribed into RNA which in turn is being translated in to protein). By inoculating these dots with the same population of cells and by measuring the amount of light coming from these dots, you could develop a gene expression profile for these cells. You could then study the expression profiles of these cells under different environmental conditions to see how they behave and change.
Of course you could try looking at all these light emitting dots with your eyes and count manually. If you want to take a shot at it, you might even be able to tell the difference between the different levels of brightness between dots! But why not use computers to do the job for you? There are software tools out there that can quantitatively measure these expression profiles for you.
There are many experiments and indeed diagnostic tests that use an artificially synthesized DNA sequence to serve as an anchor that flanks a specific region of interest in the DNA of a cell, and amplify this region. By amplify – we mean, make multiple copies. These flanking sequences are also called primers. Applications for example include, amplifying DNA material of the HIV virus to better detect presence or absence of HIV in the blood of a patient. The specific name for this kind of test or experiment is called the polymerase chain reaction. There are a number of other applications of primers such as gene cloning, genetic hybridization, etc. Primers ought to be constructed in specific ways that prevent them from forming loops or binding to non-specific sites on cell DNA. How do you find the best candidate for a primer? Of course, computation!
A fancy word for modeling metabolic pathways and their relationships using computational analyses. How does the glycolytic pathway relate to some random metabolic pathway found in the neurons of the brain? Computational tools help identify potential relationships between all of these different pathways and help you map them. In fact, there are metabolic pathway maps out there on the web that continually get updated to reflect this fascinating area of ongoing research.
I guess that covers a whole lot of what bioinformatics is all about. When it comes to definitions, some people say that bioinformatics is the application part whereas computational biology is the part that mainly deals with the development of algorithms.
As you can see, some fancy new words have come into existence as a result of all this frenzied activity:
- Genomics: Strictly speaking, the study of entire genomes of organisms/cells. In bioinformatics, this term is applied to any studies on DNA.
- Transcriptomics: Strictly speaking, the study of entire transcriptomes (the RNA complement of DNA) of organisms/cells. In bioinformatics, this term is applied to any studies on RNA.
- Proteomics: Strictly speaking, the study of entire proteins made by organisms/cells. In bioinformatics, this term is applied to any studies on proteins. Structural biology is a special branch of proteomics that explores the 3-D structure of proteins.
- Metabolomics: The study of entire metabolic pathways in organisms/cells. In bioinformatics, this term is applied to any studies on metabolic pathways and their inter-relationships.
Real World Impact
So what can all of this theoretical ‘data-dredging’ give us anyway? Short answer – hypotheses. Once you have a theoretical hypothesis for something you can test it in the lab. Without forming intelligent hypotheses, humanity might very well take centuries to experiment with every possible permutation or combination of data that has been amassed so far and mind you, which continues to grow as we speak!
Thanks to bioinformatics, we are now discovering genetic relationships between different diseases that were hitherto considered completely unrelated – such as diabetes mellitus and rheumatoid arthritis! Scientists like Dr. Atul Butte [go back] and his team are trying to reclassify all known diseases using all of the data that we’ve been able to gather from Genomics. Soon, the days of the traditional International Classification of Diseases (ICD) might be gone. We might some day have a genetic ICD!
Sequencing of individual human genomes (technology for this already exists and many commercial entities out there will happily sequence your genome for a fee) could help in detecting or predicting disease susceptibility.
Proteins could be substituted between organisms (a la pig and human insulin) and better yet, completely manipulated to suit an objective – such as drug delivery or effectiveness. Knowing a DNA sequence, would give you enough information to predict protein structure and function, giving you yet another tool in diagnosis.
And the list of possibilities is endless!
Bioinformatics, is thus man’s attempt to making biology and medicine a predictive science 🙂 .
I haven’t had the chance to read any other books on bioinformatics, what with exams just a couple of months away. Having read, “Developing Bioinformatics Computer Skills“, and found it a little too dense especially in the last couple of chapters, I would only recommend it as an introductory text to someone who already has some knowledge of computer algorithms. Because different algorithms have different caveats and statistical gotchas, it makes sense to have a sound understanding of what each of these algorithms do. Although the authors have done a pretty decent job in describing the essentials, the explanations of the algorithms and how they really function are a bit complicated for the average biologist. It’s difficult for me to recommend a book that I might not have read, but here are two I’m considering worth exploring in the future:
As books to refresh my knowledge of molecular biology and genetics I’m considering the following:
Molecular Biology Of The Gene by none other than James D Watson himself et al (Of ‘Watson & Crick‘ model of DNA fame)
Let me know if you have any other suggested readings in the comments1.
There are also a number of excellent Opencourseware lectures on bioinformatics out on the web (example: at AcademicEarth.org. For beginners though, I suggest Dr. Daniel Lopresti’s (Lehigh University) fantastic high level introduction to the field here. Also don’t forget to check out “A Short Course On Synthetic Genomics” by George Church and Craig Venter on Edge.org for a fascinating overview of what might lie ahead in the future! In the race to sequence the human genome, Craig Venter headed the main private company that posed competition to the NIH’s project. His group of researchers ultimately developed a much faster way to sequence the genome than had previously been imagined – the shotgun sequencing method.
Hope you’ve enjoyed this high level tour. Do send in your thoughts, suggestions and corrections!
UPDATE 1: Check out Dr. Eric Lander‘s (one of the stalwarts behind the Human Genome Project) excellent lecture at The Royal Society from 2005 called Beyond the Human Genome Project – Medicine in the 21st Century that tries to gives you the big picture on this topic.
UPDATE 2: Also check out NEJM’s special review on Genomics called Genomics — An Updated Primer.
Copyright © Firas MR. All rights reserved.
Your feedback counts:
Readability grades for this post:
Flesch reading ease score: 57.4
Automated readability index: 10.8
Flesch-Kincaid grade level: 9.7
Coleman-Liau index: 11.5
Gunning fog index: 13.4
SMOG index: 12.2
Powered by ScribeFire.
Written by Firas MR
August 12, 2009 at 7:19 pm
Subscribe to comments with RSS.