Incredible database now includes almost every protein known to science
Last year, Alphabet’s DeepMind released an open source database of the 3D structures of hundreds of thousands of proteins, including all 20,000 known proteins in the human body. Now, this AlphaFold Protein Structure Database has been expanded to 200 million, including almost every protein known to science.
Proteins are the workhorses of living cells, performing untold numbers of biological processes vital for life. They’re made up of chains of amino acids that fold into intricate three-dimensional shapes, which dictates their function. Mapping out the structures of proteins is important to understand what they do, how they work, and how things can go wrong, which is key to research into everything from new medicines and treatments to improving crops and animal conservation.
But it remains tricky to calculate the exact structure of a protein based on the amino acids that make it up. Figuring this out usually requires a huge amount of computing power and human work hours, and the situation has become known as the "protein folding problem." As such, progress had been relatively slow over the decades.
That is, until Alphabet set its powerful DeepMind AI on the problem. Originally trained on 100,000 known protein structures, the system developed the ability to predict the structures of the many millions of other proteins, with each one taking mere minutes or seconds rather than months or years to ascertain.
In July 2021 the first AlphaFold Protein Structure Database was released to the public for scientists to study. It originally contained over 350,000 protein structures, including around 98.5 percent of human proteins as well as those found in fruit flies, mice, yeast and E. coli. It was later expanded to around a million protein structures from 10,000 species of animals, plants, bacteria, fungi and other organisms. In the year since, over 500,000 scientists from around the world have accessed the database to aid their research.
Now, DeepMind has released a massive new update to the database, which now includes around 214 million structures from a million species. That covers almost every protein currently known to science, offering a huge boon to research into disease treatments, vaccines, sustainability, antibiotic resistance, and even plastic pollution.
“AlphaFold has already accelerated and enabled massive discoveries, including cracking the structure of the nuclear pore complex,” said Eric Topol, director of the Scripps Research Translational Institute. “And with this new addition of structures illuminating nearly the entire protein universe, we can expect more biological mysteries to be solved each day."
The entire database of protein structures, comprised of over 25 Terabytes of data, can be downloaded from Google Cloud Public Datasets.