Making Language Models Similar to the Human Brain

Author:Murphy  |  View: 24549  |  Time: 2025-03-23 19:12:50

NEUROSCIENCE | ARTIFICIAL INTELLIGENCE | NLP

image by the author using OpenAI DALL-E

There are thousands of vertebrate species on earth, but only one is capable of transmitting infinite concepts through language. Verbal transmission is fundamental to humans and has allowed them to shape the story how we know it.

Do language models fail to match this human capacity? Why?

A new study attempts to answer this question based on both language models and neuroscience:

Evidence of a predictive coding hierarchy in the human brain listening to speech – Nature Human…

How do artificial brains and natural brains speak?

In recent years we have seen that language models (LMs) have made great strides in tasks such as text generation, translation, and completion. This is all thanks to a simple but effective idea: we can predict a word from its context.

Although this idea seems so simple it is the basis of all LMs from BERT to ChatGPT. Each transformer is based on embedding and self-attention (which after all allows us to relate words in a sentence).

Everything but everything you need to know about ChatGPT

Over the years, the authors have tried to detect whether there was a mapping between the activation of these patterns and the response in the human brain to speech and text. Several studies have shown that this mapping is linear, and depends on the ability of a model to be able to predict future words in the sequence.

Recently, one study showed that it is even possible to be able to visualize with AI the images that a person observes. Showing that you can reconstruct from recordings of brain activity what a person sees.

Stable diffusion and the brain: how AI can read our minds

Have AI algorithms caught up with human capabilities?

No, and neither is our understanding of human language or associated brain processes is sufficient.

Yet, a gap persists between humans and these algorithms: in spite of considerable training data, current language models are challenged by long story generation, summarization and coherent dialogue and information retrieval; they fail to capture several syntactic constructs and semantics properties and their linguistic understanding is superficial. (source)

"Even with substantial human context and the powerful GPT-2 Large language model, Beam Search (size 32) leads to degenerate repetition (highlighted in blue) while pure sampling leads to incoherent gibberish (highlighted in red)" source (here)

Examples, show how still LMs have problems in identifying the subject and its dependencies in nested phrases. In any case, the authors note that optimizing only for next-word prediction often leads to generating inconsistent and bland sequences ( and sometimes repetitive loops).

Predictive coding theory potentially offers an explanation for why LMs still lag behind the human language. What it is actually that?

According to the predictive coding hypothesis, the architecture of the cortex implements a top-down prediction algorithm that constantly anticipates incoming sensory stimuli. Each cortical area houses an internal model of the environment, which is generated by compiling the statistical regularities that govern past inputs. (source)

In other words, the brain maintains (and constantly updates) a mental model of its surroundings. The brain has a representation that is hierarchical of space (or of a concept). In fact, the cortex is organized hierarchically from the simplest to the most complex

Therefore, the human brain does not predict the next word in a sequence but makes predictions on multiple timescales and different levels of representation going up the cortical hierarchy.

" Deep language algorithms are typically trained to predict words from their close contexts. Unlike these algorithms, the brain makes, according to predictive coding theory, (1) long-range and (2) hierarchical predictions." image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

In the context of language, this theory entails testable hypotheses, such as that the brain should continuously predict a hierarchy of linguistic representations ranging from phonemes (what sounds are likely to occur) to words and even phrases (what meaning is likely to be conveyed). (source)

For example, previous studies have shown that when a participant hears the phrase "Once upon a … ", one can track from brain recordings the word "time" (even before it is pronounced).

Although we have an idea of the basic principle, the process itself is unclear. In fact, we do not know how multiple levels of predictions are implemented in the brain during speech.

Why do we care?

A better understanding of how this process works is the first step in being able to modify large language models. Making large LMs more similar to the human brain could allow us to reduce the discrepancy between humans and LMs. Third, as demonstrated earlier mapping relationships between brains and models allows us to also better understand the models themselves.


How to map a model to the brain

Schematic depiction of the naturalistic story-listening paradigm and data provenance. source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

The first step is, could we map the activations of neural networks on the brain? There is a relationship between them?

The authors started by using the narratives dataset. A dataset that is composed of 345 subjects listening to a variety of stories (27 different ones) during a total of about 4.6 hours (more than 40,000 unique words).

The authors define:

  • w as a sequence of M words (the words in the story that have been listened to by the subjects)
  • Y as the fMRI recordings elicited by w
  • X as the activations of a deep language model input with w (extracted from the 12th layer of GPT-2)

First, the authors decided to quantify the similarity between the fMRI listening (Y) and the activations of a deep language algorithm (X) when the model was given the same story as input. To quantify this they created a so-called ‘brain score'.

image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

The authors took the same sequence of words from the story that the patient heard (the dataset contains the transcript and in alignment with the fMRI recording) and used it as input for a model. Therefore, for a word's word's vector was computed by inputting the network (the model predicts the next word given a sequence).

Once this was obtained, the authors tried to map Y activation in response to the audio story and X pattern activation:

To this end, we fitted a linear ridge regression W on a training set to predict the fMRI scans given the network's activations. Then, we evaluated this mapping by computing the Pearson correlation between predicted and actual fMRI scans on a held-out set (source)

In other words, they correlated X and Y after obtaining a linear projection of X.

image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

The results show in line with previous studies that GPT-2 activations map accurately to areas distributed over the two hemispheres of the brain. The brain score peaked in the auditory cortex and in the anterior temporal and superior temporal areas.

Moreover, this is not restricted to GPT-2 alone but is also valid for other transformer models that have been analyzed. In other words, this mapping can be generalized to other state-of-the-art LMs.

Overall, these results confirm that deep language models linearly map onto brain responses to spoken stories. (source)

image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

How does the brain predict long-range words?

photo Jessica Yap on Unsplash

The transformer has replaced recurrent neural networks (RNNs) because it is capable of modeling long-term dependency (those cases, in which the output depends on an input present in the past). This is all because of the self-attention that has allowed it to be able to use much longer sequences as inputs.

As we have said, although LMs have made great strides there is a big gap between LMs and humans.

Normally, when reading a reading text, listening to a speech, or in a conversation, there are many long dependencies. The brain can often predict the next word or phrase, taking it out of context with ease. But then how does the brain handle them?

Next, we tested whether enhancing the activations of language models with long-range predictions leads to higher brain scores. (source)

In other words, if we add forecast representations can it improve our ability to predict the brain?

The authors have defined a forecast window which is containing information up to a number d of words in the future. The model remains the same, only the input in this case also has adjoined the forecast representations (forecast window). For a distance d (number of words), the forecast window is the concatenation of the network's activations of seven successive words of the current word.

We do not concatenate the future words in the sequence but the model's activations, so we do not provide the model with the future words in the sequence but with their representation.

The "forecast score" is simply "the gain in brain score when concatenating the forecast windows to the present GPT-2 activations." Or in simple words having a representation of the next words as much as helps us predict brain activity.

"c, To test whether adding representations of future words improves this correlation. d, Top, a flat forecast score across distances indicates that forecast representations do not make the algorithm more similar to the brain. Bottom, by contrast, a forecast score peaking at d > 1 would indicate that the model lacks brain-like forecast. The peak of Fd indicates how far off in the future the algorithm would need to forecast representations to be most similar to the brain." image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

Our results show that F is maximal for a distance of d = 8 words and peaks in the areas typically associated with language processing (Fig. 2b–d). For comparison, there are 2.54 words per second on average in the stimuli. Thus, 8 words correspond to 3.15 s of audio (the time of two successive fMRI scans). (source)

In short, the authors noted several interesting things:

  • Every word from zero up to 10 contributed to this forecasting effect.
  • The best window size is about eight words.
  • random forecast representations did not help predict brain activity.
  • You could even words generated by GPT-2 instead of the true words of the sequence (the future words). This showed a similar result but a smaller effect.
image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

Together, these results reveal long-range forecast representations in the brain representing a 23% (±9% across individuals) improvement in brain scores. (source)

The authors in other words say that these data confirm that past and presented word representations account for a substantial proportion of the brain signals that are involved in language comprehension. Indeed, these signals map to previously defined regions as the language network.

Studies of cerebral anatomy have shown that the cortex is organized hierarchically. Inputs and information are processed in a hierarchical manner, so low-level acoustics, phonemes, and semantics are encoded by different structures in the brain according to a precise hierarchy.

example of how different regions are encoding different tasks in natural speech. source (here)

Thus, the authors asked: "Do the different levels of this cortical hierarchy predict the same time window?"

In other words, they tested how the forecast scope varied along the cortical hierarchy. They then studied how different regions impacted the forecast score and how it varied at word distance d.

The results show that the prefrontal area forecast, on average, is further off in the future than temporal areas. (source)

Having shown that there is a difference between regions, several questions remain. How does this temporal difference relate to context? Is there a difference with respect to syntactic and semantic content?

As has been shown, there is a hierarchical organization in how transformers encode the representation of language. For example, the various layers of BERT capture different representations. The lower layers capture information at the phrase level and then this is diluted. The various language information is learned in a hierarchical manner, with surface features learned in lower layers, syntactic features in middle layers, and semantic features in higher layers.

image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

For this, the authors calculated the forecast score but used different levels of GPT-2. They then mapped these forecast scores by level onto the brain regions.

image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

Together, these results suggest that the long-range predictions of frontoparietal cortices are more contextualized and of higher level than the short-term predictions of low-level brain regions. (source)

In other words, the different representation of the various layers of the model (which as we said above is different, and grows in complexity with the depth of the layer chosen) is mapped differently in the brain. There is a correspondence between the depth of the forecast and the cortical hierarchy.

Low-level predictions (low-level information) are predicted and analyzed in different brain areas (superior temporal sulcus and gyrus, to be precise) from those that deal with high-level predictions (middle temporal, parietal, and frontal areas, which deal with predictions and integration of more complex information).

The authors then extracted syntactic and semantic forecast representations, for each word and its context they generated 10 possible futures that have the same syntax as the original sentence. Practically given a sentence beginning, they generated 10 continuations with different words but with the same syntactic properties (part of speech and dependency tree). However, these sentences have different semantics (i.e., different meanings).

They then extracted GPT-2 activations (layer 8) and averaged the ten possible futures in order to extract syntactic components common to the various futures. Further, they subtracted this averaging (syntactic representation) from the activation of the actual word sequence (so as to have the pure semantic representation). They then constructed a semantic and semantic separated forecasting window:

We built the syntactic and semantic forecast windows by concatenating the syntactic and semantic components of seven consecutive future words, respectively. (source)

image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

This method allows activation to be decomposed into two components, one semantic and one syntactic. Once this was done, the authors calculated the forecast score, as seen before, showing that there is a difference in brain activity.

The results show that semantic forecasts are long range (d = 8) and involve a distributed network peaking in the frontal and parietal lobes. By contrast, syntactic forecasts (Fig. 4b) are relatively short range (d = 5) and localized in the superior temporal and left frontal areas (Fig. 4c,d). (source)

These results as the authors note indicate that the brain conducts multiple levels of prediction and different areas have different tasks:

the superior temporal cortex predominantly predicts short-term, shallow and syntactic representations whereas the inferior-frontal and parietal areas predominantly predict long-term, contextual, high-level and semantic representations. (source)

image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

Can we implement predictive coding inside an LM?

They noticed him from previous results:

These results show that concatenating present and future word representations of GPT-2 leads to a better modelling of brain activity, especially in frontoparietal areas. (source)

Can these principles be translated into GPT-2 training?

In other words, using fine-tuning can the model be taught to predict longer-range, more contextual, and higher-level representations? And if so, does this improve the brain mapping of the model?

To test this, the authors decided to fine-tune GPT-2 on Wikipedia. Instead of using the language modeling approach (where the next word is predicted given the previous word sequence), they decided to modify the objective training. They used in addition to the classical objective the high-level and long-range objective, the model must also predict high-level representations of words that are far off in the sequence.

In detail, the model must not only predict the next word but also predict the representation of subsequent words. The model must also predict the hidden state of a non-fine-tuned GPT-2 (layer 8) model of a word at distance d=8 in the sequence. Then the model also learns long-term, high-level, and more contextualized representations for the next word in the sequence.

image source: [here](https://creativecommons.org/licenses/by/4.0/), license: here

The results show that GPT-2 fine-tuned with high-level and long-range modelling best accounts for frontoparietal responses (Fig. 5, >2% gain in the IFG and angular/supramarginal gyri on average, all P < 0.001). These results further strengthen the role of frontoparietal areas in predicting long-range, contextual and high-level representations of language. (source)

Simply put, providing this contextual representation makes the model better able to predict brain activity.


Parting thoughts

image by Anete Lūsiņa on Unsplash

This study poses interesting perspectives both in better understanding how the brain understands language and responds, and the links between neuroscience and machine learning.

By better understanding these mechanisms, we can design models of Artificial Intelligence that are more similar to the human brain. Current LMs predict the next word given previous words, but the human brain takes into account context and future possibilities. In fact, the brain predicts sensor inputs, then compares its predictions with reality, and then updates its own inner model representation.

So future models might consider during training distant and abstract representations of future words. In this study, they showed that it was not even necessary to change the architecture of the model, but future models could instead readjust their structure to be more effective.

Moreover, as demonstrated in other contexts, it is true that a future observation (e.g., a future image to be classified) remains indeterminate, and its latent representation is much more stable. This is why methods such as contrastive learning have been shown to be much more effective. So if this has been shown to be effective why not implement it in LMs architectures and training that have more contextual information and have more information about latent representation of the future?

The authors noted that however, this study is preliminary:

Finally, the predictive coding architecture presently tested is rudimentary. A systematic generalization, scaling and evaluation of this approach on natural language processing benchmarks is necessary to demonstrate the effective utility of making models more similar to the brain. (source)

In any case, since the transformer was published, the models have become larger but the structure has remained virtually the same. To obviate the current limitations of LMs we need modifications to the architecture and training. And what better source of inspiration than the human brain?

If you have found this interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn.

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to Machine Learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Google Med-PaLM: The AI Clinician

META's LLaMA: A small language model beating giants

Stable diffusion to fill gaps in medical image data

Why Do We Have Huge Language Models and Small Vision Transformers?

Tags: Artificial Intelligence Machine Learning Medicine Neuroscience Science

Comment