Is LLM Performance Predetermined by Their Genetic Code?

Author:Murphy | View: 27939 | Time: 2025-03-22 20:56:52

|LLM|AI|GENETIC|

I'm fascinated by the idea that genetics is digital. A gene is a long sequence of coded letters, like computer information. Modern biology is becoming very much a branch of information technology. – Richard Dawkins

There are plenty of Large Language Models (LLMs) today (both closed and open source), and hundreds are published every day on the Hugging Face Hub alone. This demonstrates both the interest of the community and the success of language models. On the other hand, despite this interest, most of these models are not benchmarked and there is little detail (lack of transparency).

A Requiem for the Transformer?

How transparent are large language models?

A variety of benchmarks are used to track the capabilities of the models, each focused on a different skill to be measured. This makes it complicated to monitor and compare the models. In addition, many models are specific to a particular domain, thus adding complication to a rigorous comparison. Being able to get a full picture of a model's capabilities is a computationally expensive and time-consuming process. This is even more complex if multiple models are to be evaluated.

The truth is that most models are not trained from scratch but are derivative of previous models. Starting from an initial model (for example, Mistral 7B and LLaMA), often by fine-tuning or by some other modification, we can have endless variations. Metaphorically, this process could be seen as an evolutionary process, so one could study the functionality of models and derive models using concepts borrowed from genetics.

There are tools such as phylogenetic trees (which are designed to reconstruct evolutionary relationships between species) but which have also been used in other fields. For example, these trees have been used to reconstruct relationships between popular tales, languages, or other human concepts or constructs (Atkinson, 2008, d'Huy, 2013).

phylogenic tree of language. Draganov, 2024, here

How can population genetics be applied to LLMs?

In phylogenetic trees, we estimate the probability of the distribution of two genetic alleles (two DNA sequences) in two populations. Then considering a set of alleles in the populations, a similarity matrix is created, and the similarity between the two species is calculated.

The generated text can therefore be seen as a thread of DNA, comprised of tokens (alleles) sampled in contexts (genes) according to a probability distribution defined by the LLM. (source)

So you can take "specific genes" (which would be prompted) and then estimate the probability that different species (our LLMs) share the same alleles (the tokens). Then we estimate which tokens are produced in response to specific prompts and calculate the probability that two LLMs produce the same tokens. We do the same similarity calculation for several LLMs and can create our gene tree.

Inspired by biology, it pays to choose genes that are not too dissimilar across species but not too similar either (in the former case all species will be different, in the latter all too similar). In the case of LLMs, it means finding prompts in which there is a variance in completion between patterns. You can use prompts from benchmarks because in theory the models should not be trained on these benchmarks (and thus not have memorized the answer).

We should also select more than the first generated tokens because, in question answering, models tend to give uniform answers before diverging. For example, "The president of France is" all models will continue with "Macron" as the first token.

An interesting point is that a phylogenetic tree generally shows a speciation event: from a common extinct ancestor two species have emerged. For LLMs, the common ancestor is not extinct. For example, Mistral 7B is the ancestor of several models but is still used (so it is not extinct). This causes problems in phylogenetic representation. When this relationship is known, it is obviously easier to be able to best represent the tree's origin and distance.

For closed source, it becomes even more complex because we have less information about relationships. Moreover, there are hidden messages in the prompt that change the completion of the message. Since we cannot know what is in this hidden part of the prompt, a bias is introduced. In this study (Yax, 2024, here), they involved 111 open-access models (from 70M to 176B parameters) and 45 closed LLMs.

The results are intriguing because you can clearly see LLaMa clusters, which are separated from other families. Similarly, one can identify clusters related to other models that have been widely used by the community: Mistral, Qwen, and Bloom. Falcon, OPT, Pythia, and GPT-3 appear to be mixed, probably because they were trained on a similar version of the Common Crawl dataset.

Another interesting element is that distance is also an indication of improvement. For example, PaLM and Gemini are on the same branch, but one is further than the other, this distance in the context of LLMs can be seen as an indication of the improvement of a model.

This difference between the two models is also functional, in this study (Yax, 2024), they show that you can use this distance to predict the model score in a series of benchmarks:

We then investigated whether the genetic distance metric can be used to predict the abilities of language models. As such we used the benchmark scores from the Huggingface open LLM leaderboard. The results indicate that the prediction correlates with the true score of the models – source

By leveraging the genetic distance matrix, it becomes feasible to robustly trace the relationships and evolution of models over time. This is particularly evident in the constructed dendrograms, where clear clusters align with distinct families of LLMs, offering a visual representation of their evolutionary trajectories or at least their training similarity. – source

The idea of being able to draw a phylogenetic tree of models is intriguing. In many surveys (Yang 2023, Zhao, 2023), this has been drawn by hand because these derivation relationships are clear. Being able to generate a similarity matrix and reconstruct these trees shows how we maintain a functional evolution that is tractable. On the other hand, fine-tuning a model changes the performance of the original model but does not erase its linkage, and this is visible with a phylogenetic tree.

an example of an LLM tree in a survey. Yang, 2023, here

It is difficult to reconstruct this relationship between a model and the adapted model, especially when there is no information about the training dataset or what method was used. However it remains a trace, and we can also visualize their evolutionary trajectories or at least their training similarity. This applies not only to open-source models but also to closed-source models.

This allows the research community to be able to study the relationships between the various models. It can also open up interesting questions by studying how different methods and datasets alter genetic relationships between models. It might also be interesting to study whether, for example, techniques like knowledge distillation bring the trajectory of two models (teacher and student) closer together.

The observation that a logistic regression trained on the genetic distance matrix can accurately predict benchmark accuracy has the potential to accelerate the evaluation of new LLMs capabilities in a very computationally efficient manner. – source

The effect of a genetic mutation is a change in the phenotype (physical appearance and behavior of a species). If we keep this metaphor, variations in the "DNA" of LLM should lead to changes in behavior and performance. In a sense, mutations in an LLM go through an evolutionary process, because by using gradient descent we are going to provide selective pressure on parameters. A pressure that is directed to preserve those changes that increase the performance of a model for a given task (we could see it as natural selection in an environment where there are limited resources). Certainly, these changes in the model have a functional significance and for this a certain correlation with its performance in certain tasks. Although it is an interesting perspective, it is probably still early days to be able to imagine predicting the performance of one model from its genetic distance from another.

Recently, several articles (Templeton, 2024, Gao, 2024) have come out on the interpretation of LLMs, discussing latent concepts that could be extracted. It would also be interesting to discuss in genetic terms these concepts, how they evolve with fine-tuning, and how this internal representation evolves and diverges from the original model.

Clear Waters: What an LLM Thinks Under the Surface

Finally, another interesting perspective is to imagine genetic engineering on models. Having tools that track these differences between models and connect them to functional properties could also be a way to conduct engineering on models without fine-tuning. In genetic engineering, we cut/paste DNA sequences to change their behavior. For the LLMs, Take layers and adapters to modify LLMs of the same family.

What are your thoughts on this? How do you think genetics might influence LLMs?

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

PlanRAG: Plan Your Way to Better Decisions

The Goldfish LLM: Swimming Through Data Without Memorizing It

Beyond AlphaFold: The Future Of LLM in Medicine

Are Long-Context LLMs Truly Revolutionary?

Reference

Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.

Yax, 2024, PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks, link
Chang, 2023, A Survey on Evaluation of Large Language Models, link
Draganov, 2024, The Shape of Word Embeddings: Recognizing Language Phylogenies through Topological Data Analysis, link
Templeton, 2024, Golden Gate Claude, link
Gao, 2024, Extracting Concepts from GPT-4, link
Zhao, 2023, A Survey of Large Language Models, link
d'Huy, 2013, A phylogenetic reconstruction of a prehistoric tale, link
Atkinson, 2008, Languages Evolve in Punctuational Bursts, link