 |
|
 |
 |
 |
|
|
Sequencing the Human Genome: Transcript Part 3
The Birth of Celera:
As soon as we finished Drosophila we started human. We selected from 21 donors. We made sure that they had at least a complete set of chromosomes...we thought that was reasonable...and that they didn't have AIDS or hepatitis to infect the laboratory workers who were working with the samples.
We chose five people for sequencing. We chose three females and two males. We wanted some ethnic diversity so we chose people [who] were self-proclaimed African American, Hispanic, Chinese, and Caucasian. We started in September of 1999 and 9 months later we had the genome covered 39 times in these paired sequence scaffolds.
As people probably know on June 26, 2000, we had a nice announcement, along with Francis Collins, at the White House, that we had a finished the first mathematical assembly and earlier this year it was published in Science. This was a fun event. There are a lot of interesting stories that go into this.
Actually this is me announcing my candidacy for public office. This is my out-of-work vice president. But this was exciting for everybody involved. It was exciting for more reasons than people know. The event was scheduled around when the assembly was supposed to finish. The public project didn't have anything to announce. We just agreed that whenever we were finished with the assembly we would make a joint announcement. The assembly had not finished the day before but the White House said they couldn't change the schedule, we had to go ahead anyway. So we kept getting nervous calls from the White House as to whether it was "soup yet". Finally late the day before, Gene Myers' team finished the first assembly.
I was writing my own speech. Everybody else had speech writers. I had a copy of the President's speech. I had a copy of Prime Minister Blair's speech, I had a copy of Francis Collins' speech. They didn't have a copy of mine. That made them even more nervous than not having the sequence finished because it was only a few months before, where statements driven by Francis Collins and the Wellcome Trust got Prime Minister Blair and President Clinton to make what they thought was a harmless statement that crashed the NASDAQ stock market and took 10 billion dollars of valuation away from our young company. So he thought I might try and get even at his House and he was very pleased that we didn't.
But this was an exciting event because I think we were told that this is the first scientific announcement that was made from the White House, and certainly the first one on live international television. And in February this was published in Science, and the public effort published their analysis in Nature. At the press conferences when I saw the cover for the first time and I asked Don Kennedy whether they put these...you can see we sequenced five people but there are six on the cover...and I asked him if he put the baby on the cover in honor of the nine months that it took to sequence the genome and now he is sticking to that story. But we were very pleased to see this cover.
Now if we do the same experiment with the human chromosome assemblies...now here is 2.9 billion letters of genetic code, across this algorithm. Here are the chromosome numbers down here. So this piece here is human chromosome one. We compared it to all he human STS map markers and we were actually pleasantly surprised how few didn't agree. About 5% of them didn't agree. For those of you who know the human genetics community, quite often they can't even agree which chromosome some of these markers go on because of all the repeats and other things, so we thought this was wonderful.
Annotating the Human Genome:
As we try to annotate the genes, the order and orientation is turning out to be a very key issue and lots of groups have papers pending on this...about the differences between the public project and the Celera project. Because we have these clones in the right order, we were able for the most part to define most of the proper structure of the genes. It is still a work in progress and probably will be for the next several decades.
The other thing we found out, and I will show you in a minute, we sequenced the mouse genome when we did the first assembly. Not only did we include all the data from Celera but we included all the data, all the human gene data, from GenBank including from the public project. We were worried this might have an affect on the algorithm but they were such large experiments at 20,000 CPU hours that we couldn't do it multiple times in the time frame we had. But when we went back and assembled the mouse genome with just the Celera data we found out how much better it worked, so we have recently gone back and reassembled the human genome with just the Celera paired end data and that says green line.
It is hard for people to read these graphs, but basically you can look at any number...for example, at 40 or 50% of the genome, what are the average sized pieces? So now with this new assembly, the average sized pieces are around 25 million letters of genetic code. With the public effort they are around 11,000, and in our first assembly they were on the order of 2 to 3 million. So huge differences. We are still learning the mathematics. We are still learning the assembly but the processes have great news for the future genomes that we are doing.
As people have undoubtedly heard we found far fewer than most people expected. Because the algorithms can come up with any number you want, we chose an evidence based method. The reason they can come up with any number you want is, only 1.1% of those three billion letters in each of your haploid genomes codes for proteins. So the other 98.9% it is very easy with a four letter code to have things in a random order that actually look like they might form a gene. So we took things with two or more lines of evidence. We found 26,000 of those and we described another 11,000 hypothetical genes just from ____ predictions. We think most of those 11,000 are not real. They have totally different chemical composition than the real ones but we don't know what the final number is going to be. All we know is it is going to be far less than the 140,000 estimates that were coming out of a lot of places.
One of the places we start to differ with some of the people in the public program and the things that have been put out on this-- is our genome, the genetic code&I don't think is the "Book of Life." It is not the "Blueprint of Humanity". I personally don't think it is the "Language of God", and it certainly is not a "Parts List of Humanity". If you want to use book analogies, at best it is an index or a table of contents to a vast encyclopedia of information.
And most biology takes place beyond the genetic code. Now I think one of the fun ironies of all this is the largest calculation that has ever been done in biology and medicine. We can only show the results by putting them on a large sheet of paper. That is why we had the foldout in Science. This is just dealing with a few inches of it...you can see the richness of details...but it shows some of the key features. If we tried to show the whole map, the genes would be smaller than single pixels, being only 1% of the genome. There are parts of our chromosomes that the gene density is so high that we had to annotate them and show them on both sides of the chromosomes. There are other areas...and here is one right here...that we call a desert...millions of letters of genetic code with few or no genes in it. I think one of the big surprises was how few areas there were of these really high densities and how many areas there were of these deserts.
I will give you some ideas in a minute how some of these very high density regions formed, but early on, in the early 90s, I published a paper with some colleagues saying [that] we estimated there were 50 to 80,000 human genes. That was based on the first 2 regions that we sequenced while we were at NIH and they happened to be regions with this kind of density. Even though we tried to extrapolate from those and subtract out knowing there were lower density regions, we clearly know now why our estimates were way too high.
One of the things that I am challenging scientists to do everywhere is to look at the protein or sets of genes that you have been working on. This is what Claire and I spent 10 years of our careers trying to isolate and purify. You can see how far we had to zoom in to find it.
This is the adrenaline receptor. It gets lost in all the noise when we look back at the other sheets. But in the 1970's when I was getting my PhD, most of the scientific literature was about how adrenaline and other receptors stimulated adenyl cyclase to form cyclic AMP. And this was the explanation of most regulatory biology and most diseases for over a decade. Tens of thousands of papers published ignoring all the rest of the genome because while we had ideas that it was there, nobody had seen this gene, or this gene. This was an interesting pathway that did in fact explain a lot, but nowhere near what people tried to have it do.
Now we have to go back and try to find one of those genes in the context of all this information. That is the challenge going forward for biology, whether it is a microbial genome, a viral genome, the human genome, the Drosophila genome, any of the marine species, it doesn't matter. We have to understand how this information works together because it doesn't work one gene at a time.
Now when we look at that set of genes, it is roughly 26,000 genes, that is twice as many as a fruit fly. I was told when we announced this that some people were very upset and they felt diminished by it. I got a lot of angry calls from the Drosophila community as well. There were a lot of estimations that the human genome would be 4 fruit fly equivalents. It ended up being 2, in terms of number. So a reasonable question to ask, do we have just two of everything that a fruit fly has, except wings? And the answer is, no.
Fruit Fly to Human:
So it gives us a wonderful chance to look at our own evolution and look at vertebrae evolution. When we look at Drosophila going forward to human, we are talking about maybe 600 million years, and we see 4 or 5 major gene groups that expanded during this period. For example we have an immune system, fruit flies don't have much of an immune system. All the genes associated with our immunity, expanded during this time period.
When you think of hemostasis...all the things associated with our vascular system. Most of those genes develop during this period. Here is signal transduction. Cell to cell communication...huge expansion. We would like to think we are smarter than fruit flies and for the most part we are, but I think the most interesting category is the nucleic acid binding proteins. Particularly the transcription factors. And you will see in a little bit, we basically have the same gene sets as each other. In fact we have virtually identical gene sets and spelling of those.
But we have virtually identical gene set with mice and other vertebrates. So it can't just be the definition of genes. The key things that lead to the uniqueness of our biology are things like the transcription factors in gene regulation, that turns on different sets in response to environmental conditions. You can see even out of the 26,000 gene sets, about 42% of those are unknown genes. Many of these are unknown in terms of their function. We know it is a 7 trans-membrane receptor, we have no idea what role it actually plays.
The Mouse Genome and Human Genome:
We recently finished sequencing, assembling and annotating the mouse genome. We actually chose three different inbred strains so that we would not only be able to link back to the genetics that had been done with these strains, but have polymorphisms by comparisons. We scaled up a little bit more. We sequenced these 29 million sequences in six months. It took a short period of time to assemble the genome and as I said the first annotation has just been done.
Most people don't realize all the different mammals basically had the same large blocks of chromosome material that just have slight rearrangements from species to species. For example, if we look at human chromosome 21, the Down Syndrome region, we find all the genes in the same exact order as we find on the tip of mouse chromosome 16. Many people don't realize that people with Down Syndrome usually die by the age of 30 or 35, and the ones that live beyond that age, virtually 100% of them get Alzheimer's disease. And so people got very excited when they found the amyloid precursor in this region and
now it has really enhanced the use of some mouse models to study Alzheimer's disease.
For the first time now, we can do a mathematical comparison of the mouse chromosomes and human chromosomes. With these mathematical comparisons when you see identity, they show up on the diagonal. You can see this is not a solid line. It is punctate and as we zoom way in, we find these regions of virtual exact sequence. This is filtered at a 70% mat over a 50 base pair window. It turns out these are the exons for one particular gene. One thing I didn't tell you...the mouse genome is about 10 to 15% smaller than the human genome, and we find smaller introns and smaller ___ regions. Here is where the introns
were, all the same size except this last one, but we find these blocks of identity that help us tremendously.
This is the latest data we have, and it is absolutely stunning to us when we look at it. The center here is mouse chromosome 16. So here is that region I just showed you on human chromosome 21. You can see basically all the genes in the same order...and this is looking at the other parts of other human chromosomes that make up the mouse chromosome. So, large blocks transferred basically completely together.
You can see here there are some flips and some rearrangements. Here is another large block here from human chromosome 16 to mouse 16. I think the most interesting things are these individual lines, and they are going to be a real challenge to explain but basically this is a map of...and we are now developing these for every mouse and human chromosome...of the period of the last 100 million years of evolution.
Here is a different way of looking at that same data in a more linear fashion. Just looking at the blocks from here is the mouse chromosome scaffold, here is the human chromosome scaffold. You can see there are regions in the mouse genome and the human genome where these do not line up. I am not saying that every mouse gene right now has a human counterpart or visa versa. We actually do not have good enough definition of either the mouse genes or the human genes to do that, but we are largely genetically identical.
Now this looks like a board game. We won't play this, but what this allows us to do is look at the gene number that goes from the human chromosome to the mouse chromosome in each block. We can look outside the gene regions and we find these other blocks of sequence between mouse and humans that in these we find specifically a greater chance of finding the regulatory regions. We do not have good enough computer algorithms to predict where regulatory regions are and we need these comparative genomic studies.
We get asked all the time, are we going to sequence the chimpanzee genome? And while it would be fun to do, I have argued that it is not going to be very useful at this stage of biology. Here is some data that Svante Pääbo and Max Planck Institute generated using our human genome data. He randomly sequenced 10,000 chimpanzee genome sequences and compared them back to our data and found the average difference of only 1.27%. You can see it varies a little bit from chromosome to chromosome. On the X chromosome that females have two of, and males have only one of, it is only .9% difference between chimp and human. That is across the 75% of the chromosome that is ennergenic DNA, the 24% of chromosome that is intron and the 1.1% that is exon. Those of you who looked at the genome map that we published in Science, down at the bottom there was a small, somewhat pathetic chromosome with very few genes on it. Apparently it can vary widely between humans and chimps with no difference in behavior whatsoever. But half of you already knew that.
Now if we look at one of those exons, here is the chimpanzee histamine H1 receptor. These red base pairs are the actual few differences between human and chimpanzee in a coding region. Only one of these actually change the amino acid sequence. The rest are totally silent. Now I am not proposing that this gene is the key link between chimpanzees and biogenetics researchers, but it shows you how similar it is. In fact, the chimpanzee genome just looks like a human genome with a slightly higher polymorphism rate. So when we do comparative genomics and we have blocks of chimpanzee sequence, it is just
essentially identical across the whole region. It doesn't help us interpret the human genetic code, whereas mouse, at 100 million years difference, we have sequenced with TIGR a major portion of that important species, the Standard Poodle genome, and we are working on others. We thought we would start at the top of the evolutionary tree and work down.
Now those of you who looked at the Science paper, I am certain that none of you read or understood this figure. Even though I think is the most important figure in the paper. Let me explain briefly how this was generated.
For each chromosome, because we had the genes in the right order in orientation, we pulled them out of each chromosome and put them together in the same order they were in on that chromosome and then asked that question, "Do we find those same genes in that same order anywhere else in the human genome?"
Every place we found three or more genes duplicated in that same order, we drew one of these colored lines. We found blocks of hundreds of genes. At the bottom...it is probably hard to see...this is half of chromosome 20 that got duplicated at some early stage and became all of chromosome 18 with just a few rearrangements. Four evolutionary events would explain the differences between that half of 20 and all of 18.
This didn't happen during human existence. In fact, none of those lines occurred during our existence. They pre-dated us, but it is the same sets of genes with largely, as far as we can tell, the same functions duplicated on these other chromosomes. If one of these is associated with a disease, more often than not its counterpart on the other chromosome is associated with the disease.
And this explains why people have been so confounded thinking they have a disease gene, thinking they have a target for a pharmaceutical, that ended up not being correct. When we look at all the chromosomes put together around the circle, every place there is one of these lines is where sets of three or more genes were duplicated. If we chose two or more genes, this would be a solid ball. The highest gene density chromosome we have is chromosome 19. What is on chromosome 19? Why is it the highest density, and all these red lines emanating from it showed duplications in and out of 19?
It is loaded with neurotransmitter receptors, olfactory receptors and transcription factors. Some of the categories, I told you, had the greatest expansion over the last 600 million years. They had that expansion. You are seeing that expansion. This is our recorded history. In the future, we will be able to provide a date and in some cases a species, where this evolutionary event took place.
We are now generating these same maps from the mouse genome and we will be able to work out which ones of these took place in the last 100 million years. But these are not all recent events.
If we go to c elegans [Caenorhabditis elegans], here are gene duplications, chromosome duplications in the c elegans genome. That if we look and track these same duplications forward to the human genome, we find them maintained and carried forward in our own chromosomes. For the first time we will have the chance to de-convolute our evolutionary history, and many of the evolutionary events that took place. To me that is one of the most exciting things out of the genome sequencing to date.
With the whole genome shotgun method, we got all the variants in the individuals that we sequenced--one primarily, that we did most of the sequencing from. Each of us has roughly 2 to 3 million letters of genetic code that are different from the person sitting next to you... different from mine. But one thing we realized after we annotated the genome [was] that most of these have probably no biological significance whatsoever. Other than having forensic applications and tracking applications, and maybe useful as mapping tools. Less than 1% of these occur in genes or regulatory regions. With the one individual that we did most of the coverage on, we looked in all the genes to see how many of these single letter changes actually changed the protein structure, and we found less than 2,000. That means at the biological level, at the gene level, we are all virtually identical twins. The fraction that is really different in a biologically meaningful way is such a tiny percentage it is stunning. I think that helps, along with a small number of genes and the similarity to mouse, hopefully put the nail in the coffin of the genetic determinants. Maybe literally in some cases.
< previous section | next section >
More: Sequencing the Human Genome
|
|
 |
 |
|
 |