Scientists Just Discovered Over 70,000 Bizarre New Viruses With AI

Viruses are everywhere. They’re in the air; in sewage, lakes, and oceans; in grasslands and decaying wood. Some thrive in extreme conditions, like hydrothermal vents, Antarctic ice, and potentially even outer space.

They’re also ancient. Some are likely as old as, if not even older than, the very first cells.

Despite cohabitating with viruses since the dawn of our species, the viral universe remains largely mysterious. For decades, scientists have painstakingly gathered samples from around the globe and sequenced their genetic material. But viruses rapidly mutate, and these efforts only scrape the surface of the virosphere.

Most viral genetic material is biological “dark matter,” Mang Shi at Sun Yat-sen University and colleagues recently wrote in a new paper published in Cell.

With the help of AI, the team is shedding new light on the viral world. The AI, dubbed LucaProt, relies on a large language model to make sense of chunks of viral genetic material. Another algorithm further parses genetic data into more “digestible” bits to increase efficacy.

After analyzing nearly 10,500 samples—some from previous databases, others collected during the study—the AI detected 70,458 new RNA viruses from samples all over the globe.

“All of a sudden you can see things that you just weren’t seeing before,” Artem Babaian at the University of Toronto, who wasn’t involved in the study, told Nature.

Viruses have a bad reputation. The Covid-19 pandemic and annual flu season highlight their destructive side. But they can also be used to battle antibiotic-resistant bacteria, shuttle gene therapies into cells, or be developed into vaccines.

Charting the viral universe offers a bird’s-eye view on the evolution and mutation of viruses—with implications not just for biotechnology but potentially for battling the next pandemic too. In humans, DNA carries the genetic blueprint. DNA translates to RNA—also made up of four genetic letters—which carries the genetic information into a cellular factory to make proteins.

Viruses are different. Some forgo DNA altogether, instead directly encoding their genetic blueprint in RNA. It sounds unusual, but you already know some of these viruses: SARS-CoV-2, which causes Covid-19, is an RNA virus. These viruses have proteins that science knows little about, and they could also offer new insight into biology.

For decades, scientists have tried to decode the virosphere by collecting samples. The sources range from the everyday—water from a local creek—to the extreme, such as Antarctic ice or deep seawater. RNA extracted from these samples is carefully sequenced and deposited into databases. This method, called metagenomics, captures snippets of all viral RNA from an environment.

Making sense of the genetic goldmine takes more work. Classic computational methods struggle to sift these large databases for meaningful insights.

Enter ESMFold. Developed by Meta, the program relies on large language models—the same technology powering OpenAI’s ChatGPT and Google’s Gemini—to predict protein structures based on their amino acid “letters.” Similar methods, including DeepMind’s AlphaFold and David Baker’s RoseTTAFold, recently won their developers the 2024 Nobel Prize in Chemistry.

ESMFold takes in molecular sequences and predicts the 3D structures of proteins at the atomic level. For its first real-life task, scientists used the AI to decode the “dark matter” of proteins in microbes we know the least about. Last year, the AI predicted the structure of over 700 million proteins from microorganisms. Ten percent were completely alien to any previously discovered.

Taking note, Shi’s team asked if a similar strategy could work in the world of RNA viruses.

Scientists have previously used AI to fish out potential new RNA viruses from petabytes of genetic sequencing data—an amount roughly equivalent to 500 million high-resolution photos.

These studies focused on RNA-dependent RNA polymerase, or RdRP. Here, the RNA sequences encode RdRPs, a family of proteins that tags most RNA virus genomes. An early analysis identified nearly 132,000 new RNA viruses based on their genetic data.

The problem? Viruses rapidly mutate. If the genetic letters encoding RdRPs change, AI trained on those sequences may not be able to recognize mutated viruses. The new study tackled the problem by marrying the previous approach with ESMFold in a two-channel AI.

The first channel uses a transformer-based model, similar to ChatGPT, to extract amino acid sequence “keywords” encoding viral RdRPs from a large database. After training with the desired sequences, and some that were randomly generated, the AI created a vocabulary of about 20,000 frequently occurring protein sequences encoding for RdRPs.

Compared to previous methods, this step breaks genetic libraries into more digestible sections, making it easier for the AI to tackle longer genetic sequences and detect viral RdRP proteins.

The second channel taps a version of ESMFold. This is the slow but careful reader. Rather than blazing through protein words, it “reads” every single letter and predicts how each structurally connects with others to form 3D protein shapes. This step grounds the AI, giving it an idea of how RdRPs should look in living viruses.

LucaProt was trained on nearly 6,000 sequences encoding RdRP proteins and over 229,500 sequences known to encode different proteins. Challenged with a test dataset, in which the researchers knew the answers, the AI was exceptionally accurate, returning false positives only 0.014 percent of the time.

The AI found 70,458 potential new, unique viruses. One, isolated from dirt, had a surprisingly long genome—”one of the longest RNA viruses identified to date,” wrote the team. Others could thrive in hot springs and extremely salty lakes.

The expanded virosphere adds new viruses to known viral groups—for example, Flaviviridae, which causes hepatitis or yellow fever. LucaProt also identified 60 different viral groups, each highly different than all known viruses today.

It’s not to say they cause diseases, but they “have largely been overlooked in previous RNA virus discovery projects,” wrote the team.

To Babaian, the study found “little pockets of RNA virus biodiversity that are really far off in the boonies of evolutionary space.”

Viruses require a living host to survive. The team is upgrading their AI to predict these hosts. Most RNA viruses infect eukaryotes, which include plants, animals, and humans. Some viruses can also infect bacteria—their cat-and-mouse game inspired the gene editor CRISPR-Cas9.

“The evolutionary history of RNA viruses is at least as long, if not longer, than that of the cellular organisms,” wrote the authors.

Often ignored is the third branch of life, archaea. Evolved during the early stages of life on Earth, these lifeforms share similarities to bacteria and eukaryotes—for example, how their genetic material replicates.

But archaea are a distinct branch of life that thrives in extreme environments, such as hydrothermal vents or extremely salty water. There are hints that RNA viruses could also infect archaea. If so, it could spur new insights into our tree of life—and as with CRISPR, potentially lead to new biotechnologies.