Sequencing the Human Genome: Transcript Part 2
The Birth of Celera:
Celera started under sort of strange terms. I got a couple of calls from senior management of what was then the Perkin-Elmer Corporation, saying that Mike Hunkapillar developed a new sequencing machine, and by the way, they were thinking of putting up 300 million dollars to sequence the genome...was I interested? I thought they were crank calls and I hung up on them. Finally out of frustration Mike called me himself and said it was real, that the new instrument was going to be really fantastic, and they were thinking of funding the formation of a company and he thought this would work perfectly with our whole genome shotgun strategy.
We decided very quickly after seeing this device, that we would form Celera to sequence the human genome. There were three key components. One was the new automated sequencer, and very few people understand why this was such a significant breakthrough. There were two key components...most of the sequencing up until then had been on slab gels, and we would try to cram as many samples [as possible] into these slab gels, but as they ran down the gels quite often they would get mixed. Now for a small genome, we only need 20 to 30 thousand sequences. We could live with that problem. But with a human genome, where we need on the order tens of millions of sequences. That made it impossible to sequence the human genome accurately with those methods.
The other key part of this was the switch from slab gels to capillaries where each DNA was in a small glass tube, so it totally eliminated that problem.
The other major change is the fully automated machine. We have six people that run 300 of these, and they run 24 hours a day, 7 days a week. This is in contrast to the usual situation of having 2-3 people running one machine. So it changed the cost factors--the manpower factors--very dramatically so this was combined with the TIGR sequencing strategy...and we had to build a super computer as well.
This is a picture of our major facility. It is the size of a football field. It has a large number of these sequencing machines, but very few people. It is a million dollar a year electric bill, mostly for the air-conditioning to cool off all the lasers. But the other key aspect [is that] these are just hooked up to a network where the data goes out to a computer.
We teamed up with Compaq to build a supercomputer. It is a roughly 1 1/2 teraflop computer, which means it can do 1.5 trillion calculations per second. Even with that power it took 20,000 CPU hours to assemble the human genome. We now have over 100 terabytes of spinning disc storage and we have had to create a large amount of backup storage as well.
And we have teamed up with the U.S. Government through the Sandia [National] Labs and Compaq to try to build a 100 teraflop computer. The physicists now have thrown in the towel on the Department of Energy saying that they understand that big computing challenges are not for simulating nuclear weapons blast, they are for trying to understand the human genetic code and how that turns into biology and so the Department of Energy is trying to turn its computing power into something else.
Now when we started this program, the DNA sequencer was a breadboard device. I never actually saw it work. The algorithms that we developed for TIGR for the microbial genomes, we knew would not scale to the level of the human genome, but we knew that it was possible to do that. We didn't have any of the lab techniques for making the hundreds of thousands of clones a day very effectively.
So with all that, we thought it would be reasonable to do a test project, although the test project we chose was the largest genome anybody had attempted to date, and that was of the fruit fly genome. As people at this institution know, what a wonderful history it has in terms of biology. Most of the discoveries about human genetics have stemmed from methods that were developed in Drosophila.
I met Gerry Rubin at a Cold Spring Harbor meeting where we were first introducing our plan to the human genetics community, and I pulled him out in the hallway and I asked if he would be interested in collaborating with us on the fruit fly genome. I said Harold Varmus was pushing us to do another worm, I didn't really want to do a worm. As a neurobiologist, I wanted to do Drosophila...was he interested? It took him roughly 5 seconds to make up his mind. He said he would collaborate with us and I asked him why he decided so quickly. He said the Drosophila community would kill him if we sequenced a worm after he turned us down. But it turned out to be one of the best collaborations that I have ever participated in in science.
Strategies and Challenges for Gene Sequencing:
I won't go through the algorithms, but I will tell you a little bit about the differences and the different strategies for DNA sequencing. The problem is that all the sequencing technology gives us only five to six hundred letters of genetic code. So it is a very simple engineering challenge. How do you get the genetic code of something billions of letters long when you can only get five to six hundred letters at a time?
The other problem is that genomes, including the human genome, have lots of repeats. The same sequence repeated over and over again, which really confounds a lot of the methods.
So, there are three different basic strategies. The first one that has been used by a lot of people including us for that first gene we sequenced, was mapping and walking. You would sequence one piece, get six hundred letters, you would make a little primer that allowed you to read the next six hundred letters and you would just work sequentially down the clone with that. It was estimated it would take about a century to sequence the human genome with these methods.
The other approach--actually settled on by the public effort because of the limitations of the mathematics--was mapping first each clone, lining them up, and then sequencing the clones once they were lined up. So when you hear about mapping, it was first trying to get these smaller pieces of DNA in the right order before they were sequenced. With something like the e-coli genome it took three years just to get the lambda clones ordered around the genome, before Blattner's lab would start sequencing those.
Whole Genome Shotgun Method:
With "whole genome shotgun" we don't do any cloning first. We just treat the problem as a whole. We take the entire set of chromosomes, break those down into little pieces and then use new mathematical algorithms to try and solve these massive jigsaw puzzles. The mathematics was very similar up until recently, whether you were sequencing a small clone or a big clone and it was just like when you work a jigsaw puzzle. You pick up a piece and you compare it to every other piece in the puzzle until you find a match.
That is what the computers did with the early algorithms. They found a sequence. They would search through all the other ones until they found a sequence and they could line them up. That is what all the assumptions are going on, for the government program...in other labs...in terms of what they view as shotgun sequencing. And it is very limited, because you can only look at the next neighbors of these small sequences as you build the larger pieces. Whole genome shotgun sequencing is based on totally a different strategy of mate pairs where we sequence both ends of clones, different sized clones, so these ends and we know how far apart they are.
To date myself...I used to play with tinker toys. Those of you who remember what those are, they were long sticks and they had balls on the end with multiple holes so you could then plug other sticks into them and build very large structures very quickly. In contrast to Legos, which are like bricks which are like the previous method, you have to build them one at a time on top of each other. So, this is the strategy that builds very large structures very quickly. There is a very good reason large bridges and large skyscrapers are not built out of bricks. You need long structures for putting things together.
So we take all the chromosome material, sheer it down to certain sizes, and this is all work that Ham Smith does himself in the lab. And that has been one of our secret weapons because handling large pieces of DNA effectively is a real art form, that he has only been able to teach to 2 or 3 other people. Getting the size of these very accurately is important for knowing how far apart these ends are.
So we use four different sized pieces. We use pieces that were 2,000 letters long, pieces that 10,000, pieces that were 50,000 and pieces that were 150,000. You can see as you get 500 to 600 letters from each end, you can get real close distances, intermediate ones and very long ones. And what happens is, we build very large scaffolds very quickly.
For example, if this is a 50 KB clone, and the one next to it matches one end, and the other end matches say a 2 bit KB clone, then you start to build these intermediate structures very quickly, which is very different than having one end at a time where all you can do mathematically is build small local structures. All the thinking in the field was confined to this "Lego" approach and that is why people were absolutely certain these new methods would not work.
What we do in fact, as we build smaller structures, is use the longer pieces to link them together. So mathematically, if you require two or more of these links, there is less than one chance in 10 to the 15th of making an error in doing this. We build very large scaffolds very quickly, with small holes.
We said in the beginning when we sequenced the genome, that we would have very large structures and there would be small holes where repeat sequences go and we thought we could resolve those in the end, and that seems to be working out. Because of the publicity surrounding this, much was made about the small holes and that is where the draft sequence came from, even though then the government effort set up to do one with even greater holes.
This is looking at an assembly, so you can see with all these links, all these bridges, you build a very accurate scaffold over these distances very quickly. When we did this with the Drosophila genome--and it took 4 months to sequence the Drosophila genome--now each one of these colored bars and one of the chromosome arms of Drosophila. Here are 120 million letters of genetic code across here.
We chose Drosophila for several reasons as I indicated, but one is, it had the best, most accurate map of any genome. We were betting not only multiple careers but an awful lot of money on this process. We wanted to know if it worked. We figured Drosophila would tell us in a very clear "yes" or "no" answer. We compared it back to the mapping data and we found 16 sites out of over 2,000 did not agree with a sequence assembly. The Drosophila community has now shown that all 16 of these were mapping errors and basically they agree totally with the sequence. Gerry Rubin's lab has reported this is the most accurate sequence yet produced. It has less than one error in one million base pairs.
So we had this sequence very quickly and it was a big challenge. What do you do with 120 million base pairs of sequence? For the microbial genomes, haemophilus was 1.8 million and it was a tremendous effort to annotate that. We formed what we called an "annotation jamboree". We brought top Drosophila scientists in from around the world and they camped out at Celera for several weeks. This was a very strange event. We tried to have some social interactions, some dinner parties. Nobody would hang around for them. They would rush back to the lab, rush back to their computers. It was like an ultimate boot camp or summer camp for nerds. But apparently they weren't working all the time because there was a young couple who met here and they got married afterwards. So there was a little bit more going on with this. But this was a very exciting event as people exchanged DNA, and sorted their way through the genome. This is Gerry Rubin, that is what this community is all about.
At the end of this process, there were roughly 13,000 genes that were annotated, but as with the microbial genomes, over half of them...we don't have the slightest clue what their biology is. Even the ones in categories that we know what they are similar to, we still don't know their roles in biology that they play. This number will change up or down.
The algorithms do not predict very small genes very well because there is so much background noise. So I am sure this is missing small genes. The hypothetical ones are where there is no evidence other than ______ predictions, a lot of these are starting to go away and people are finding some other ones that we missed. But it is roughly going to be around the 12 to 14, 000 gene number, and less than one year from starting the project, starting the conversation with Gerry Rubin, we published this paper last year in Science and for the last year this has been the most quoted paper in Science. I got an e-mail yesterday from Gerry Rubin saying he just got the latest statistics that it is outperforming everybody else 2:1 so he and everyone else are very proud of this study.
Well, five years apart. It was five years between the first genome in Drosophila...a big difference in size. For haemophilus, we had to sequence 26,000 sequences, it took us 4 months, and in the same time we did roughly 3 million with Drosophila with the new methods.
The staff for sequencing didn't change dramatically. There was a huge difference in the algorithm team that ultimately grew to over 30. There are 2 million lines of code in the Celera assembler which is a series of 5 major steps. If we were to start the haemophilus project or the e. coli project over today, it would take roughly 2 hours to do. E. coli took a total of 12 years, ultimately, to sequence. The yeast genome project which took 10 years with over a thousand scientists would take roughly 6-7 hours to do.
To show you the differences in what we can do computer-wise now, it took us 11 days on a Sun computer to assemble the haemophilus genome. Now, with the new algorithms and the new computers, we have this down to under 5 minutes. But even with that difference, it still took 20,000 hours to do [the] human [genome].
< previous section | next section >
More: Sequencing the Human Genome
|