Sequencing the Human Genome: Transcript Part 1
Early DNA Sequencing Techniques:
Thank you very much David, for the very kind introduction. Thank you Gerry, for helping to invite me here. It is indeed a pleasure to be back here. It is good to see some old friends from the Neurology Institute days that I haven't seen since that time, when I left NIH in 1992 to form TIGR. I was actually up here for the first molecular evolution course back when I was a government scientist, and could only afford a 25-foot sailboat, and I lived on the boat in Eel Pond for a couple of weeks while I took courses here. So it was a wonderful experience, and if I could have extended it I would have.
In 1984 when Claire and our lab first moved to NIH to the Neurology Institute, we were working on trying to purify one protein, the beta adrenergic receptor, from heart and lungs. We had 8 to 9 years into it. We finally got a tiny bit of protein sequence that enabled us in the first year at NIH to clone one of the first neurotransmitter receptors. I think it was the first one from human brain, at the same time that Robert Lefkowitz and his team were doing it from other tissues.
Early DNA Sequencing Techniques:
It was a big change for somebody who trained as a biochemist. We shut down everything in the lab to teach ourselves molecular biology that year, and it took a year to sequence this one gene, even though it was only cDNA clone, 1200 base pairs in length. It was a slow, painful year as we learned how to do it. It was the end of a 10-year period to get one gene.
This was the time--most of you remember the mid 80's--when the first serious discussions about sequencing the human genome took place. I think I was one of the few biochemists that actually got excited about this project, in part because we knew from these neurotransmitter receptors that they were part of very large multi-gene families. And having just taken a decade to get one, figured the idea of sequencing the entire genome to get all of them, even with a 10-15 or 20 year time course would be a wonderful bargain.
Also, in 1986 there was a key paper published, at least key to us. It was the paper that Lee Hood's group published describing how they were going to change DNA sequencing by attaching four fluorescent dyes to the DNA instead of the radioactivity and reading the x-ray films. This, in theory, allowed the DNA just to run down a single lane, to be activated by a laser and read into a computer.
In fact it was in February, 1987 when my NIH lab and the Neurology Institute was the first test site for the first automated DNA sequencer. This led not only to rapidly sequencing some genes, but it led, when Jim Watson first came to NIH, to us doing the first test project on sequencing bits of human chromosomes to see if they would actually work. The late Erns Freese was our scientific director. He was an ex-post doc with Jim. He took me over to meet Watson to show him our data and Watson was absolutely stunned by what he saw because he had just come from Lee Hood's lab and Lee's lab was still using manual sequencing. We were the only lab that seemed to be doing automated DNA sequencing. And Jim inquired what my background was and I explained how I trained with Nate Kaplan and I worked on protein purification. And he said, "Oh that explains it, you are a biochemist." I thought for the longest time that that was a compliment. It was only much later when I learned at the time that it was his worst cut that he could make. Nevertheless, he switched some funds immediately for us to try and scale up.
But we ran into problems, not with sequencing the DNA, but in the interpretation of it once we had the first sequences from human chromosomes, and we found that we could not interpret the human genetic code. The algorithms were inadequate, the computers were inadequate. We didn't know the rules. We didn't know anything about it. And what we had to do was, each place where there was a possibility of there being a gene, we had to make PCR primers around it and go into cDNA libraries to see if we could amplify something. If we did, meaning that it might be expressed, we would sequence that and then compare it back to the genetic code.
It took two years to examine these first 200,000 base pair regions. Turns out they were very gene rich. We found eight genes. But it was eight genes in two years. It wasn't going to be the pace that was going to lead to things. In fact it was the notion...thinking about if we had to have cDNA's to interpret the human genetic code, that led to our development of the EST [expressed sequence tag] method.
With [the EST method], we simply take a cDNA library--cDNA's for those of you who don't know, come from messenger RNA, that lead to the production of the proteins. We would isolate the messenger RNA from, for example, the human brain, and we would just randomly pick clones and sequence them. And it turns out that with every one we sequenced, we made a major discovery, for the most part. A year or two later, we wrote this up and it was published in Science in 1991, and was really the start of the change of lots of things. It led to some major antagonism between me and the then leaders of the genome project, because they thought this program would threaten the budget of the genome project. If the genes could be found cheaply and quickly, we might lose the 3 billion dollar funding. It seemed to be a common theme through the next decade of my life. This was not meant as a threat and, in fact, people go back and read this paper. We talk about this being the ultimate method for annotating the genome once it was sequenced.
I am not going to talk about EST's further tonight. I think most people who work in molecular biology are familiar with them. They became the major standard and method for gene discovery at least until whole genome sequencing really started taking place. What this method led to though, was a dramatic change, not in molecular biology techniques but in mathematics, and those changes in mathematics changed what we could do scientifically in the lab.
When we sequenced these first sections of the human chromosomes back at NIH you could only sequence what was called a cosmid clone. It was about 35,000 letters long, and to sequence that we would shotgun it. We would break it down into random pieces and then sequence those pieces, and the best algorithms at the time could only deal with roughly 1,000 sequences to reassemble them in the right way. That was one of the reasons that the NIH and the public project chose to do mapping first small clones and then sequencing them because there were no methods that allowed even sequencing of large clones let alone whole genomes.
We had hundreds of thousands of EST sequences and we wanted to assemble those together, so we hired some mathematicians at TIGR. The lead one was Granger Sutton who designed a new algorithm that is now known as the TIGR assembler, that we use for first assembling hundreds of thousands of EST's together to compare them and try to guess how many genes there were.
We all guessed wrong from those numbers, but we knew we had a very powerful tool and Claire and I were sitting around with our good friend and colleague Ham Smith and said, "We have this wonderful tool, let's think of how we can go back and approach genomes with it."
Sequencing a Microbial Genome:
Ham is actually the one I credit with suggesting that we sequence a microbial genome. At first he suggested we sequence the e. coli [Escherichia coli] genome, but e. coli genome project was in its 9th year of federal funding, and it was about halfway done. We figured we would just antagonize the community even more if we sequenced that genome quickly. So Ham suggested his laboratory pet haemophilus influenza that he isolated the first restriction endonucleases from, and led to his sharing of the Nobel prize and led to some of the key tools all of us use.
In 1994, Ham and I wrote a grant and submitted it to NIH, proposing to sequence this microbial genome in one year with this new method. We were somewhat skeptical that we would get funded so we dug into the TIGR endowment to fund the project, while we waited to hear from NIH. We had the genome almost completely sequenced and assembled and the paper was being written when we got our pink sheet from NIH telling us what we were doing was impossible and it couldn't possibly work.
I called Francis Collins and explained that it was working extremely well and that we were close to finishing the project and publishing a paper and he said, "No, the experts on the committee said it absolutely won't work", and he didn't think we could do it so he wasn't going to fund it.
A short while later we published this paper in Science--this was in July of 1995--and as David said, this was the first complete genome of a species that is a free living organism, other than a virus. There have been lots of viral genomes done including our sequencing of a smallpox virus early on.
There are a lot of exciting findings and we could spend hours just talking about what was found in haemophilus. There is a lot of significance towards evolution in terms of having pre-programmed changes built into the genome, not just random changes. But one of the things I wanted to mention, seeing a wonderful old friend Monica Reilly in the audience. We used Monica's classification system that she developed for e. coli genes, as the basis for annotating haemophilus and that was now carried forward to essentially every genome that has been done, and she gets nowhere near sufficient credit for starting the entire system. So I am delighted to see her here.
TIGR really scaled up from that. The second genome was published also in 1995 and that is the smallest genome to date. The mycoplasma genitalium genome. It led to some wonderful t-shirts after this was published. TIGR had a t-shirt that says, "I 'heart' my genitalium" and a short while later we did, as David said, the first archaea, and it was a wonderful collaboration with Holger Jannasch here, and that was not only a wonderful introduction for me in terms of the broader range of biology, I think it was for much of the community.
Because from the first two genomes, even though one was a gram positive and one a gram negative bacteria, they were very similar in their gene content and people starting speculating that the gene universe was much smaller. That was until we published methanococcus and found that over half the genes had never been seen before. It didn't have glycolysis. It didn't have a TCA cycle. It used the methanogenesis pathway for making its energy and fixing carbon. But you can see there has been an exponential growth since then and I think we are now approaching close to 100 microbial genomes that have been sequenced.
TIGR has done approximately half of those that have been sequenced and published to date, including some very key pathogens, the cholera genome, tuberculosis genome, strep pneumonia, the malaria genome. Some of the most exciting ones and interesting ones though, to me, have been the ones out of the environment including methanococcus jannaschii.
But I have to say my favorite one is Dinococcus radiodurans. This is an organism that was discovered in the 1950's when the government tried to irradiate meat for long term non refrigerated storage. Regardless of the dosage of radiation other than enough to cook the meat, a red pigmented bacteria kept growing out of the meat. Even the military thought that was unappetizing. This bacteria was isolated and characterized and it was found that it was very resistant to radiation. It can take 3 million rads of radiation. Its chromosomes get blown apart with multiple double-stranded breaks, but over 12 to 24 hours it stitches its chromosomes back together exactly as they were before, and it starts replicating again. Now I only thought Congress could do that!
Now early on, Francis Crick promoted the notion of panspermia for the origin of life on this planet and it was dismissed for a lot of reasons, but one of them [was that] there were no life forms that could survive outer space. But Dinococcus is just a representative, I think, of a broad array of organisms that can. It can be totally desiccated. It can absorb huge doses of ionizing radiation over a long period of time. You get it in an aqueous environment, it stitches its genome back together and starts replicating again. But don't get too excited when you hear NASA announce that they discovered this in outer space, because every time a shuttle goes up, every time they flush the commode on the space station, billions of copies of this and other microbes get launched into outer space. So if it was once sterile, it is not now. I doubt that it was ever sterile. NASA now has decided that they want to coat the outside of a space shuttle with Dinococcus to see if it can survive going out and coming back. They don't realize it is a little late for that experiment.
So I think we are just barely scratching the surface of what is out there in the environment. Each one of these genomes provides a vast array of new unknown genes and just from the work of TIGR alone, there are now over 100,000 new unknown genes in GenBank. We don't have the slightest idea what role in biology they play.
TIGR was also the lead institute on sequencing the first plant genome, arabidopsis. In terms of insects, we sequenced the Drosophila genome at Celera. We are now sequencing the mosquito genome, the key vector for malaria and other diseases, and I will be telling you a little bit about human and mouse.
< previous section | next section >
More: Sequencing the Human Genome
|