Small Model, Big Insight: RNA Structure Rules Emerge from Minimal Data


                    MCB Senior Research Fellow Elena Rivas and colleagues have demonstrated that the fundamental rules governing RNA structure can be learned using a remarkably simple computational model—challenging assumptions that increasingly complex systems are required to uncover basic biological principles.
In a new study published in Communications Biology, the team shows that a model with just 21 parameters—trained on a small set of RNA sequences—can recover the core rules of RNA base pairing fully unsupervised, that is, without ever being given structural information.
The work arrives amid growing excitement around large-scale machine learning approaches to biology, particularly models trained on massive datasets. These models, of which  ChatGPT is an example, typically rely on millions or even billions of parameters and vast training collections.
Rivas’s study takes a strikingly different approach. “We demonstrate that learning the fundamental biological rules of RNA base pairing can be achieved with very few parameters,” Rivas said. “A lot less than you might think.”
Learning Biology Like Language
The study builds on the analogy between biological sequences and human language. In language models, algorithms learn patterns by analyzing large collections of text—identifying grammar, structure, and meaning without explicit instruction.
Rivas applied this same idea to RNA. Instead of sentences, the model is given RNA sequences—strings of nucleotides (A, C, G, and U). Crucially, it is not told how these nucleotides interact or which pairs form the structural backbone of RNA molecules.
Yet from this raw input alone, the model is able to infer the fundamental pairing rules: A pairs with U, C pairs with G, and G can also pair with U.  “You give it sequences, but you don’t tell it what to look for,” Rivas explained. “And what comes out is that it learns the basic pairing rules—just from the data itself.”
This mirrors how large language models learn to predict and generate human language. But unlike those systems, which require enormous computational resources, Rivas’s model, which uses the same programming language and optimization routines as those used by large models, achieves this with just 21 parameters and as few as 50 training sequences.
Rethinking Scale in Machine Learning
The findings challenge a prevailing assumption in computational biology: that more data and more complex models are inherently necessary to uncover meaningful biological insights.
Large RNA-focused language models often rely on millions of sequences and extensive parameter spaces. While powerful, these systems can obscure the underlying mechanisms they capture.
Rivas’s work suggests that, at least for fundamental biological rules, simplicity may be sufficient and even advantageous. “When people show these large models learning base pairing, it’s exciting,” she said. “But the point is you can learn that with a lot less.”
The study does not aim to compete with large models in predictive performance. A 21-parameter system is not expected to match state-of-the-art tools for detailed RNA structure prediction. Instead, the goal is more foundational: to understand what is required to learn the rules themselves. “The model is not trying to predict the structure of any given RNA perfectly,” Rivas noted. “What it’s learning are the fundamentals.”
Implications for Computational Biology
Beyond RNA biology, the study raises broader questions about how machine learning is applied across the life sciences. As models grow in size and complexity, it becomes increasingly difficult to disentangle what they are learning—and whether simpler explanations might suffice.
By showing that core biological rules can emerge from compact models trained on limited data, Rivas’s work points toward a complementary strategy: using minimal systems to probe the foundations of biological learning. Minimal systems that, by virtue of being expressed in the same language, can be directly integrated into a larger discovery framework.
The findings also underscore the importance of interpretability. Smaller models, with fewer parameters, are inherently easier to analyze, making them valuable tools for understanding—not just predicting—biological phenomena.
A Different Kind of Progress
In an era defined by ever-larger datasets and increasingly complex algorithms, Rivas’s study offers a reminder that progress in science does not always require scaling up.
Sometimes, it comes from stripping things down. “You don’t need millions of parameters and millions of sequences to learn the basics,” Rivas said. “You can get there with something much simpler.” 
 PDF