All life on Earth is written with four DNA “letters.” An AI just used those letters to dream up a completely new genome from scratch.
Called Evo, the AI was inspired by the large language models, or LLMs, underlying popular chatbots such as OpenAI’s ChatGPT and Anthropic’s Claude. These models have taken the world by storm for their prowess at generating human-like responses. From simple tasks, such as defining an obtuse word, to summarizing scientific papers or spewing verses fit for a rap battle, LLMs have entered our everyday lives.
If LLMs can master written languages, could they do the same for the language of life?
This month, a team from Stanford University and the Arc Institute put the theory to the test. Rather than training Evo on content scraped from the internet, they trained the AI on nearly three million genomes—amounting to billions of lines of genetic code—from various microbes and bacteria-infecting viruses.
Evo was better than previous AI models at predicting how mutations to genetic material—DNA and RNA—could alter function. The AI also got creative, dreaming up several new components for the gene editing tool, CRISPR. Even more impressively, the AI generated a genome more than a megabase long—roughly the size of some bacterial genomes.
“Overall, Evo represents a genomic foundation model,” wrote Christina Theodoris at the Gladstone Institute in San Francisco, who was not involved in the work.
Having learned the genomic vocabulary, algorithms like Evo could help scientists probe evolution, decipher our cells’ inner workings, tackle biological mysteries, and fast-track synthetic biology by designing complex new biomolecules.
Compared to the English alphabet’s 26 letters, DNA only has A, T, C, and G. These ‘letters’ are shorthand for the four molecules—adenine (A), thymine (T), cytosine (C), and guanine (G)— that, combined, spell out our genes. If LLMs can conquer languages and generate new prose, rewriting the genetic handbook with only four letters should be a piece of cake.
Not quite. Human language is organized into words, phrases, and punctuated into sentences to convey information. DNA, in contrast, is more continuous, and genetic components are complex. The same DNA letters carry “parallel threads of information,” wrote Theodoris.
The most familiar is DNA’s role as genetic carrier. A specific combination of three DNA letters, called a codon, encodes a protein building block. These are strung together into the proteins that make up our tissues, organs, and direct the inner workings of our cells.
But the same genetic sequence, depending on its structure, can also recruit the molecules needed to turn codons into proteins. And sometimes, the same DNA letters can turn one gene into different proteins depending on a cell’s health and environment or even turn the gene off.
In other words, DNA letters contain a wealth of information about the genome’s complexity. And any changes can jeopardize a protein’s function, resulting in genetic disease and other health problems. This makes it critical for AI to work at the resolution of single DNA letters.
But it’s hard for AI to capture multiple threads of information on a large scale by analyzing genetic letters alone, partially due to high computational costs. Like ancient Roman scripts, DNA is a continuum of letters without clear punctuation. So, it could be necessary to “read” whole strands to gain an overall picture of their structure and function—that is, to decipher meaning.
Previous attempts have “bundled” DNA letters into blocks—a bit like making artificial words. While easier to process, these methods disrupt the continuity of DNA, resulting in the retention of “ some threads of information at the expense of others,” wrote Theodoris.