New software developed at Nationwide Children's Hospital in Ohio can take raw sequence data on a person's genome and search it for disease-causing variations in a matter of hours, which its creators claim puts it ahead of the pack as the fastest genome analysis software around. They believe that this makes it now feasible to do large-scale analysis across entire populations.
Whereas it took 13 years and cost US$3 billion to sequence a human genome for the first time, senior author Peter White notes that now "even the smallest research groups can complete genomic sequencing in a matter of days." The chokepoint lies in the next step: calibrating and analyzing the billions of generated data points for genetic variants that could lead to diseases.
White and his team tackled the problem by automating the analytical process in a computational pipeline they called Churchill. Churchill spreads each analysis step across multiple computing instances – a process its creators call balanced regional parallelization – with special care taken to preserve data integrity so that results are "100 per cent reproducible."
Tests showed that Churchill can analyze a whole genome sequence in as little as 90 minutes from a raw FASTQ text-based format through to identifying variant cells at high confidence. An exome, which contains the bulk of disease-causing variants despite being a mere one per cent of the whole genome, can be analyzed in less than an hour. Churchill's performance was validated against the National Institute of Standards and Technology's benchmarks, with scores of 99.7 per cent on sensitivity, 99.99 per cent on accuracy, and 99.66 per cent on diagnostic effectiveness.
While the goal of the research was to create an ultra-fast analysis pipeline, White and his team found an unexpected benefit. Churchill scales efficiently across many servers, which makes it possible to perform population-scale analysis.
They took the first phase of the raw data generated by the 1000 Genomes Project – an international research collaboration started in 2008 to establish an extensive public catalog of human genetic variation across the globe – and put Churchill to task on all 1,088 whole genome samples across a cluster of computers in Amazon Web Services' Elastic Compute Cloud. Churchill averaged a mere nine minutes per genome in its week-long analysis, which the researchers note compares favorably to a similar analysis performed in 2013 on a Cray XE6 supercomputer.
The Cray supercomputer test analyzed 61 whole genomes in two days, at an average of 50 minutes per genome – around five times longer than Churchill required in its cloud test.
"Given that several population-scale genomic studies are underway, we believe that Churchill may be an optimal approach to tackle the data analysis challenges these studies are presenting," White says.
The Churchill algorithm has been licensed to a company called GenomeNext, which adapted the technology for use in a commercial setting. People can get their genome sequenced in a local lab or clinic and then upload the raw data to the GenomeNext system for analysis.
A paper describing the Churchill algorithms and research was published in the journal Genome Biology. The Churchill software is also available, for research purposes only, via its project page.
Source: Nationwide Children's Hospital.