The explosion in text-to-image AI models like OpenAI’s DALL-E 2—programs trained to generate pictures of almost anything you ask for—has sent ripples through the creative industries, from fashion to filmmaking, by providing weird and wonderful images on demand.
The same technology behind these programs is also making a splash in biotech labs, which are increasingly using this type of generative AI, known as a diffusion model, to conjure up designs for new types of protein never seen in nature.
Today, two labs separately announced programs that use diffusion models to generate designs for novel proteins with more precision than ever before. Generate Biomedicines, a Boston-based startup, revealed a program called Chroma, which the company describes as the “DALL-E 2 of biology.”
At the same time, a team at the University of Washington led by biologist David Baker has built a similar program called RoseTTAFold Diffusion. In a preprint paper posted online today, Baker and his colleagues show that their model can generate precise designs for novel proteins that can then be brought to life in the lab. “We’re generating proteins with really no similarity to existing ones,” says Brian Trippe, one of the co-developers of RoseTTAFold.
These protein generators can be directed to produce designs for proteins with specific properties, such as shape or size or function. In effect, this makes it possible to come up with new proteins to do particular jobs on demand. Researchers hope that this will eventually lead to the development of new and more effective drugs. “We can discover in minutes what took evolution millions of years,” says Gevorg Grigoryan, CEO of Generate Biomedicines.
“What is notable about this work is the generation of proteins according to desired constraints,” says Ava Amini, a biophysicist at Microsoft Research in Cambridge, Massachusetts.
Proteins are the fundamental building blocks of living systems. In animals, they digest food, contract muscles, detect light, drive the immune system, and so much more. When people get sick, proteins play a part.
Proteins are thus prime targets for drugs. And many of today’s newest drugs are protein based themselves. “Nature uses proteins for essentially everything,” says Grigoryan. “The promise that offers for therapeutic interventions is really immense.”
But drug designers currently have to draw on an ingredient list made up of natural proteins. The goal of protein generation is to extend that list with a nearly infinite pool of computer-designed ones.
Computational techniques for designing proteins are not new. But previous approaches have been slow and not great at designing large proteins or protein complexes—molecular machines made up of multiple proteins coupled together. And such proteins are often crucial for treating diseases.
The two programs announced today are also not the first use of diffusion models for protein generation. A handful of studies in the last few months from Amini and others have shown that diffusion models are a promising technique, but these were proof-of-concept prototypes. Chroma and RoseTTAFold Diffusion build on this work and are the first full-fledged programs that can produce precise designs for a wide variety of proteins.
Namrata Anand, who co-developed one of the first diffusion models for protein generation in May 2022, thinks the big significance of Chroma and RoseTTAFold Diffusion is that they have taken the technique and supersized it, training on more data and more computers. “It may be fair to say that this is more like DALL-E because of how they’ve scaled things up,” she says.
Diffusion models are neural networks trained to remove “noise”—random perturbations added to data—from their input. Given a random mess of pixels, a diffusion model will try to turn it into a recognizable image.
In Chroma, noise is added by unraveling the amino acid chains that a protein is made from. Given a random clump of these chains, Chroma tries to put them together to form a protein. Guided by specified constraints on what the result should look like, Chroma can generate novel proteins with specific properties.
Baker’s team takes a different approach, though the end results are similar. Its diffusion model starts with an even more scrambled structure. Another key difference is that RoseTTAFold Diffusion uses information about how the pieces of a protein fit together provided by a separate neural network trained to predict protein structure (as DeepMind’s AlphaFold does). This guides the overall generative process.
Generate Biomedicines and Baker’s team both show off an impressive array of results. They are able to generate proteins with multiple degrees of symmetry, including proteins that are circular, triangular, or hexagonal. To illustrate the versatility of their program, Generate Biomedicines generated proteins shaped like the 26 letters of the Latin alphabet and the numerals 0 to 10. Both teams can also generate pieces of proteins, matching new parts to existing structures.
Most of these demonstrated structures would serve no purpose in practice. But because a protein’s function is determined by its shape, being able to generate different structures on demand is crucial.
Generating strange designs on a computer is one thing. But the goal is to turn these designs into real proteins. To test whether Chroma produced designs that could be made, Generate Biomedicines took the sequences for some of its designs—the amino acid strings that make up the protein—and ran them through another AI program. They found that 55% of them would be predicted to fold into the structure generated by Chroma, which suggests that these are designs for viable proteins.
Baker’s team ran a similar test. But Baker and his colleagues have gone a lot further than Generate Biomedicines in evaluating their model. They have created some of RoseTTAFold Diffusion’s designs in their lab. (Generate Biomedicines says that it is also doing lab tests but is not yet ready to share results.) “This is more than just proof of concept,” says Trippe. “We’re actually using this to make really great proteins.”
A protein structure generated by RoseTTAFold Diffusion that binds to the SARS-CoV-2 spike protein
For Baker, the headline result is the generation of a new protein that attaches to the parathyroid hormone, which controls calcium levels in the blood. “We basically gave the model the hormone and nothing else and told it to make a protein that binds to it,” he says. When they tested the novel protein in the lab, they found that it attached to the hormone more tightly than anything that could have been generated using other computational methods—and more tightly than existing drugs. “It came up with this protein design out of thin air,” says Baker.
Grigoryan acknowledges that inventing new proteins is just the first step of many. We’re a drug company, he says. “At the end of the day what matters is whether we can make medicines that work or not.” Protein based drugs need to be manufactured in large numbers, then tested in the lab and finally in humans. This can take years. But he thinks that his company and others will find ways for AI to speed up those steps up as well.
“The rate of scientific progress comes in fits and starts,” says Baker. “But right now we’re in the middle of what can only be called a technological revolution.”