AI predicts the shape of nearly every known protein


The structure of the protein vitellogenin – a precursor to egg yolk – as predicted by the AlphaFold tool.1 credit

Starting today, determining the 3D shape of almost any protein known to science will be as easy as typing in a Google search.

Researchers used AlphaFold – the groundbreaking artificial intelligence (AI) network – to predict the structures of some 200 million proteins from one million species, covering nearly every known protein on the planet.

The data dump will be freely available on a database set up by DeepMind, Google’s London-based AI company that developed AlphaFold, and the European Molecular Biology Laboratory’s European Institute of Bioinformatics (EMBL-EBI). ), an intergovernmental organization near Cambridge, UK.

“Essentially, you might think it covers the whole protein universe,” Demis Hassabis, CEO of DeepMind, said in a press briefing. “We are at the beginning of a new era of digital biology.”

The 3D shape, or structure, of a protein is what determines its function in cells. Most drugs are designed using structural information, and accurate maps are often the first step to discovering how proteins work.

DeepMind developed the AlphaFold network using an AI technique called deep learning, and the AlphaFold database was launched a year ago with 350,000 structure predictions covering nearly every protein made by humans, mice and 19 other widely studied organisms. The catalog has since swelled to around 1 million entries.

“We are preparing for the release of this enormous treasure,” says Christine Orengo, a computational biologist at University College London, who has used the AlphaFold database to identify new families of proteins. “Having all the data planned for us is just fantastic.”

High quality structures

The release of AlphaFold last year caused a stir in the life sciences community, which was quick to take advantage of the tool. The network produces highly accurate predictions of the shape or 3D structure of proteins. It also provides information about the accuracy of its predictions, so researchers know what to trust. Traditionally, scientists have used time-consuming and expensive experimental methods such as X-ray crystallography and cryo-electron microscopy to solve protein structures.

According to EMBL-EBI, about 35% of the more than 214 million predictions are rated as very accurate, meaning they are as good as experimentally determined structures. Another 45% were deemed confident enough to rely on many apps.

Many AlphaFold structures are good enough to replace experimental structures for some applications. In other cases, researchers use AlphaFold predictions to validate and make sense of experimental data. Bad predictions are often obvious, and some of them are caused by an intrinsic disorder of the protein itself, which means that it has no definite shape, at least without the presence of other molecules.

The 200 million predictions published today are based on sequences from another database, called UNIPROT. It is likely that scientists will already have had an idea of ​​the shape of some of these proteins, since they are covered in databases of experimental structures or resemble other proteins in such repositories, explains Eduard Porta Pardo, biologist computer science at the Josep Carreras Leukemia Research Institute. (IJC) in Barcelona.

But these inputs tend to be biased toward human, mouse and other mammalian proteins, Porta says, so it’s likely the AlphaFold dump will add important insights because it draws on many more diverse organisms. “It’s going to be a tremendous resource. And I’ll probably upload it as soon as it comes out,” Porta says.

Since the AlphaFold software has been available for a year, researchers already have the ability to predict the structure of any protein they want. But many say having predictions available in a single database will save researchers time, money — and faff.It’s another barrier to entry that you remove,” says Porta. “I used a lot of AlpahFold templates. I have never used AlphaFold myself.

Jan Kosinski, structural modeler at EMBL Hamburg in Germany, who has led the AlphaFold network for the past year, can’t wait for the database to expand. His team spent 3 weeks predicting the proteome — all of an organism’s proteins — of a pathogen. “Now we can just download all the models,” he said during the briefing.

Hundred terabytes

Having almost all known proteins in the database will also allow new types of studies. The Orengo team used the AlphaFold database to identify new types of protein families, and now they will do so on a much larger scale. His lab will also use the expanded database to understand the evolution of proteins with properties that are useful, such as the ability to consume plastic, or worrisome, such as those that can lead to cancer. Identifying distant relatives of these proteins in the database can determine the basis of their properties.

Martin Steinegger, a computational biologist at Seoul National University who helped develop a cloud-based version of AlphaFold, is excited to see the database grow. But he says researchers will likely still have to manage the network themselves. Increasingly, people are using AlphaFold to determine how proteins interact, and such predictions are not in the database. Nor are microbial proteins identified by sequencing genetic material from soil, seawater, and other “metagenomic” sources.

Some sophisticated applications of the extended AlphaFold database might also depend on downloading all of its 23-terabyte content, which won’t be feasible for many teams, Steinegger says. Cloud-based storage could also prove costly. Steinegger co-developed a software tool called FoldSeek that can quickly find structurally similar proteins and should be able to significantly overwrite AlphaFold data.

Even with all known proteins included, the AlphaFold database will need to be updated as new organisms are discovered. AlphaFold’s predictions may also improve as new structural information becomes available. Hassabis says DeepMind is committed to supporting the database for the long term, and he might see updates happen every year.

He hopes the availability of the AlphaFold database will have a lasting impact on the life sciences. “It’s going to take quite a big shift in mentality.”

Leave a Comment