Assoc. Public Health Prof. Aakrosh Ratan and Public Health Prof. Stephen Rich helped developed a new genome sequencing tool in collaboration with other researchers. The new tool called Giraffe is the first step toward making the original reference genome from the Human Genome Project much more robust.
A genome is the complete code of our DNA that specifies all of our traits as humans and as individuals. It encodes for our visible traits, predisposes genetic diseases and in some cases causes disease. While the code is written by just four chemical compounds, the genome is approximately 3 billion pairs long, stretched across 23 different chromosomes. With such a vast volume of data, scientists in the past struggled to map the complete sequence of the human genome.
Toward this end, Congress initiated the Human Genome Project in 1990. The project concluded over a decade later in April 2002 and cost $2.7 billion dollars. With its completion, scientists created the first reference genome that was freely accessible to all.
The accomplishment of this project was a massive scientific leap forward, but the original sequence had several limitations. The reference genome represented only one set for comparison. Thus, the set did not account for genetic diversity between individuals.
Moreover, any analysis using this genome would be biased against genomes that did not follow the exact pattern of the original genome. This can obscure what differences are important from what is normal variability in organization due to diverse backgrounds.
While genomes within a species such as humans have a conserved component, there is a highly variable portion of the genome as well. In order to capture this variability, researchers need to add more individuals to the reference genome.
In response to this issue, University researchers developed Giraffe, a type of tool known as a pangenome. The pangenome captures this variability and can serve as a point of comparison between unique genomes to reduce bias.
“The first step [in sequencing] is to figure out where the [DNA] fragment is coming from,” Ratan said. “And in order to do that, we use what's called an aligner. So we take the fragment and align it against that reference genome. So the reference genome was derived from only a few individuals. So you can imagine some of the biases that come into play as a consequence of that.”
Adding more data from more diverse individuals to the original genome also helps to reduce some of the bias found in the original European-Caucasian centered genome. The original genome was sourced out of convenience from people who lived near original research facilities. In fact, over 70 percent of the original genome is from a single individual.
Giraffe curtails these issues that are inherent to a one-dimensional reference genome. For example, if a section of the genome is flipped relative to the reference, you cannot say whether it is the reference or analyzed genome that is abnormal. With more individuals, we can identify not just single changes in code but much larger structural variations across genomes.
These structural changes in genome sequence can take many forms, commonly inversions and translocations. Inversions happen when parts of the genome are reversed. Translocation is where portions of the genome can copy from one section and be pasted into another. With only a singular genome as a reference, these differences are difficult to identify.
“We are correlating these differences to understand if some of them might be playing a role in a disease,” Ratan said. “And if those studies are biased, then you can understand how we would miss certain findings, or we would mischaracterize certain findings.”
A more global and deeper representation of genetic diversity helps to prevent such issues. One common concern when increasing the amount of data referenced is the increase in computing power and processing time that often goes along with it.
These researchers, led by Jouni Sirén from the University of California Santa Cruz, figured out how to avoid such an issue. They were able to keep the operating costs down to only $1.50 per person while keeping processing times the same.
“[Giraffe] allows us to make the reference be derived from thousands of individuals, and we can still do things at the same speed we were doing,” Ratan said. “So now you can have a more inclusive reference, but that does not translate into a huge burden on time and computational power.”
At this cost, they genotyped 167,000 structural variations from 5,202 people. With this data, they estimated for the frequency of these structural variations in subpopulations as well as the human population as a whole. But, this was just the first step toward a much larger goal.
“Creating a pangenome reference is a bigger task than our work,” Sirén said in an email statement to The Cavalier Daily. “We participate in the Human Pangenome Reference Consortium that aims to create a human reference genome from hundreds of high-quality assemblies of individual genomes.”
The researchers are progressing toward the goal of multiple diverse reference genomes. But, technology must be repeatedly iterated and improved over time. This new technology builds on the past tools and provides many distinct benefits such as increased accuracy and reduced bias.
“Giraffe is a short read aligner intended primarily for aligning reads produced by Illumina sequencing machines,” Sirén said. “Long reads produced by PacBio / Oxford Nanopore machines give better access to the repetitive regions of the genome, but their high error rates have made aligning them more difficult.”
Many of the past genome sequencing technologies have used a technique called long reads. These methods had high error rates that made alignment difficult, but gave better access to repetitive regions of the genome. Giraffe by contrast is a short read aligner, so it has a drastically lower error rate, but cannot access the same depth of the genome.
The researchers are currently refining the tool to take advantage of the accessibility benefits of long reads. They hope to extend the benefit to low-error, newer techniques. Giraffe hopes to stretch its neck above other genome sequencing techniques and revolutionize our approach to genomics.