An international team of scientists has published the first complete, gap-free sequence of the human genome. The new reference genome adds hundreds of millions of base pairs to earlier drafts, filling in some crucial gaps that will improve studies of disease and evolution.
The Human Genome Project has been in the works for decades, famously publishing its first draft in 2000, and a “complete” genome in 2003. But this only included the euchromatic regions, comprising around 92 percent of the total genome – the other regions, known as heterochromatic, reside in the tips (telomeres) and centers (centromeres) of the chromosomes, and were deemed too difficult and costly to sequence at the time.
Now, with an extra two decades of work and technological advancements, the entire human genome of around 3 billion bases has finally been sequenced with no gaps, thanks to an international team of scientists known as the Telomere-to-Telomere (T2T) Consortium. The new reference genome has been designated T2T-CHM13, adding almost 200 million base pairs of previously unknown DNA sequences. Of those, there are 99 genes that seem to code for proteins and around 2,000 candidate genes that will need to be examined more closely. The genome also corrects thousands of structural errors present in earlier versions.
Ironically, the team says, the last 8 percent took twice as long to sequence as the first 92 percent. That’s because the heterochromatic regions are made up of large chunks of repeating sections, so it’s difficult to piece together accurately. If a genome is a jigsaw puzzle, these sections are the plain-color background pieces that could fit together any number of ways.
To crack the code, the T2T Consortium used a few new tools to read longer sequences. After all it’s much easier to complete the puzzle when working with fewer large-scale pieces, rather than a huge amount of tiny ones. The researchers examined the genome using the Oxford Nanopore DNA sequencing method, which can read up to a million DNA letters in a single read with a modest degree of accuracy. Another method, known as PacBio HiFi, can read about 20,000 letters at once with almost perfect accuracy, so combining the two is an effective way to sequence the complete human genome.
Another part of the problem, the team says, was that the previous reference genome was stitched together from multiple individuals, creating seams in the model. The new version removes those, creating a model that’s more representative of what an individual person’s genome would look like.
The newly complete human genome will inform a wide range of future studies, including of genetic markers of disease. Cancer in particular can arise from abnormalities in the centromere, the team says, so a better understanding of this region of the genome could lead to new types of therapies.
“Generating a truly complete human genome sequence represents an incredible scientific achievement, providing the first comprehensive view of our DNA blueprint,” said Eric Green, M.D., Ph.D., director of National Human Genome Research Institute. “This foundational information will strengthen the many ongoing efforts to understand all the functional nuances of the human genome, which in turn will empower genetic studies of human disease.”
And this is just the first complete human genome – the next steps are to create a human pangenome reference of 350 individuals, to capture more of the natural variety from person to person.
The breakthrough was described in a series of six papers published in the journal Science, along with other companion papers appearing in other journals.
Sources: National Human Genome Research Institute, Rockefeller University, UC Davis, UC Santa Cruz