Technology

Adding new letters to DNA alphabet doubles density of data storage

Adding new letters to DNA alphabet doubles density of data storage
A new breakthrough could make data storage in DNA much more feasible
A new breakthrough could make data storage in DNA much more feasible
View 1 Image
A new breakthrough could make data storage in DNA much more feasible
1/1
A new breakthrough could make data storage in DNA much more feasible

As with most things, nature’s data storage system, DNA, far surpasses anything we’ve created. Now, researchers at the University of Illinois Urbana-Champaign have doubled its already incredible storage capacity by adding extra letters to its “alphabet,” and developed a new way to read it back.

DNA is naturally made up of combinations of four nucleobases: adenine, guanine, cytosine and thymine. Represented by the letters A, G, C and T, these bases group together in different sequences to form blueprints for every living organism. And this information storage system is incredibly dense, with a single gram of DNA capable of storing up to 215 petabytes (215 million GB) of data.

That of course makes it a very attractive potential storage solution for the huge amounts of data modern society produces daily – the entire contents of the internet could fit in a shoebox full of DNA. And as if that storage wasn’t dense enough, the researchers on the new study have found a way to double it.

Along with the usual A, G, C and T, the team effectively added an extra seven “letters” to the DNA alphabet. These take the form of chemically modified nucleotides, opening up more varied combinations that allow more information to be stored within the same amount of physical space.

“Imagine the English alphabet,” said Kasra Tabatabaei, co-author of the study. “If you only had four letters to use, you could only create so many words. If you had the full alphabet, you could produce limitless word combinations. That’s the same with DNA. Instead of converting zeroes and ones to A, G, C, and T, we can convert zeroes and ones to A, G, C, T, and the seven new letters in the storage alphabet.”

Of course, adding extra nucleotides means that existing systems for reading data back won’t recognize them, so the team also developed a new system that can. The DNA strand passes through a nanopore in a specially designed protein, which can detect the individual units regardless of whether they’re natural or synthetic. Machine learning algorithms then decode the information stored within.

“We tried 77 different combinations of the 11 nucleotides, and our method was able to differentiate each of them perfectly,” said Chao Pan, co-author of the study. “The deep learning framework as part of our method to identify different nucleotides is universal, which enables the generalizability of our approach to many other applications.”

In addition to density, the new method also improves the writing speed of the data, which is normally a fairly sluggish process for DNA. This system roughly halved the amount of time it takes to write information to DNA.

This work could help make DNA a viable data storage system, although there’s still plenty more work left to be done.

The research was published in the journal Nano Letters.

Source: University of Illinois Urbana-Champaign

3 comments
3 comments
joe46
" our method to identify different nucleotides is universal, which enables the generalizability of our approach to many other applications " so, life on this planet is made from 4-base DNA, I wonder what a life form would look like with 11-base DNA ? CRSPR anyone ?
Wombat56
Seven new letters? Aren't the nucleotides in DNA supposed to come in complementary pairs? If you think of DNA as a twisted ladder then each rung is comprised of two halves, with the deoxyribose polymer up the sides just providing support.
Ralf Biernacki
The nucleotides come in pairs: Adenine connect with Thymine, while Guanine bonds with Cytosine, producing the sense and antisense strand that make up the DNA double helix. It is the base pair, not the lone nucleotide, that constitutes a unit of information. So why are there 7 new bases, and not 6 or 8? It makes no sense. Unless they are counting Uracil, but uracil just substitutes thymine (in RNA), it carries no extra information content at all.