Using Vector Steering to Improve Model Guidance

Author:Murphy | View: 27575 | Time: 2025-03-22 19:58:17

Large language models are complex and do not always give answers that are perfect. To remedy this, people try many different techniques to guide the model's output. We've seen pre-training on larger datasets, pre-training models with more parameters, and using a vector-database (or some other form of lookup) to add relevant context to the LLM's input. All of these do see some improvement, but there is no method today that is fool-proof.

One interesting way to guide the model is vector steering. An interesting example of this is the Claude Golden Gate Bridge experiment. Here we see that no matter what the user asks, Claude will find some clever way to bring up its favorite topic: the Golden Gate Bridge.

Image from "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" Showing claude Sonnet's Behavior Change With Steering Vector

Today I'll be going through the research done on this topic and also explaining Anastasia Borovykh's excellent code implementation. If you're interested more in this topic, I highly recommend checking out her video.

Let's dive in!

Theory

To gain an intuition, it's important to understand how vector steering is different from prompt engineering on a data level. When we prompt engineer to change the models behavior, we are adjusting the tokens we put into a model in the hope that this will cause the model to create a different output.

Figure 1 from "STEERING LANGUAGE MODELS WITH ACTIVATION ENGINEERING"

Vector steering instead focuses on injecting information as vectors into specific layers of the model during a forward pass. Thus, the same input into the model will produce a different output if vector steering is used.

Hidden State Subtraction

First, we have to figure out how to get a vector representation of our information. The below algorithm shows us exactly the operations necessary to do this. We start by creating 2 prompts and running our model on them separately.

Algorithm 1 from "STEERING LANGUAGE MODELS WITH ACTIVATION ENGINEERING"

Rather than using the end-state of the model to predict the next token (like we would typically do for inference), we instead focus on a specific layer and capture its hidden state vector (called activations in the above). Using the hidden state vectors from the two different prompts, we then perform a subtraction between the two vectors to find our steering vector.

We can now inject this steering vector into any forward pass we want and have it adjust the model's behavior. Note, that we have a coefficient next to the steering vector. This is done to give us more control over how much of an impact the steering vector should have.

Choices

You likely noticed that I did not tell you exactly which layer we are pulling the vectors from. I also didn't tell you where we're adding the steering vector. These are purposeful omissions as deciding these questions is still open for discussion.

The authors did some experiments to determine which layers were best for injection. To do so, they created a steering vector to have the Llm discuss weddings. Using this vector, they injected it into different layers and then found the average number of wedding-related words as a fraction of all tokens output.

Figure 3 from "STEERING LANGUAGE MODELS WITH ACTIVATION ENGINEERING"

The above graph shows that earlier layers have an outsize impact on the steering layer, with layer 6 being the maximum impact. Note, however, that for different prompts different coefficients are used to have the best impact. In the paper these coefficients vary from 1 to 100.

Example Results from Paper

To see if this is working, we want to determine both that it steers the model successfully and also that it doesn't degrade performance on unrelated tasks.

To examine this, the authors created a dataset from OpenWebText. If the data contained wedding-related words, then it was considered wedding-related. They sampled 300,000 documents where roughly half were wedding-related.

The authors broke these documents into sentences and then measured the model's perplexity. They did this by calculating the log loss after every token generated by the model on a sentence, then averaging this value for each document and model. The graph below shows the comparison of an unmodified model with our vector injected model.

Figure 2 from "STEERING LANGUAGE MODELS WITH ACTIVATION ENGINEERING"

We can see that as the wedding word frequency increases, the perplexity ratio gets lower. This change is because the base model has more perplexity than the steered model, signaling better comparative performance by the steered model. Interestingly, when there is NO wedding words involved, the two models have the same perplexity. This shows that the vectorized model is not impaired in unrelated situations and still sees improvement when there are relevant questions to answer with the vector.

Theories On Why This Works

To understand the research here, I need to go over some quick terminology. In Machine Learning, a feature can be thought of as a human-understandable part of the data. If we are creating a model to determine if a baseball player is likely to hit the ball, a feature might be his batting average. Unlike in regression or tree models, it is difficult to say exactly how features are represented in a neural network. Consequently, while we see these models working, interpretability of these systems is quite limited.

Anthropic's Interpretability Team did a study on Claude Sonnet‘s ability to be steered. They based their research off the linear representation hypothesis and the superposition hypothesis. In short, these theories posit that neural networks represent features as directions within their neural networks (activation space). The image below gives a simplified visualization of how a feature is represented.

Image by Author Comparing Ways to Represent a Feature

The image shows only two dimensions and no hidden layers thus limiting how many features it could have. Because LLMs have an enormous number of parameters, neurons, and hidden layers, they can have an incredible number of features under this hypothesis.

The Anthropic investigation found that the features within Claude can be extremely diverse— representing multilingual, multimodal, concrete, and abstract topics.

Code

For this section, I'll be explaining the code written by Anastasia Borovykh in part two of her video series. You can find her full code here.

From a high-level, we'll be using PyTorch to generate steering vectors and then injecting them into the Phi-3 model.

Finding Steering Vectors

def find_steering_vecs(model, base_toks, target_toks, batch_size = 16): 
    '''
    We want to find the steering vector from base_toks to target_toks (we do target_toks - base_toks)
    Inputs: 
        :param model: the model to use
        :param base_toks: the base tokens [len, seq_len]
        :param target_toks: the target tokens [len, seq_len]
    Output: 
        :return steering_vecs: the steering vectors [hidden_size]
    '''
    device = model.device
    num_its = len(range(0, base_toks.shape[0], batch_size))
    steering_vecs = {}
    for i in tqdm(range(0, base_toks.shape[0], batch_size)): 
        # pass through the model 
        base_out = model(base_toks[i:i+batch_size].to(device), output_hidden_states=True).hidden_states # tuple of length num_layers with each element size [batch_size, seq_len, hidden_size]
        target_out = model(target_toks[i:i+batch_size].to(device), output_hidden_states=True).hidden_states
        for layer in range(len(base_out)): 
            # average over the batch_size, take last token 
            if i == 0: 
                steering_vecs[layer] = torch.mean(target_out[layer][:,-1,:].detach().cpu() - base_out[layer][:,-1,:].detach().cpu(), dim=0)/num_its # [hidden_size]
            else: 
                steering_vecs[layer] += torch.mean(target_out[layer][:,-1,:].detach().cpu() - base_out[layer][:,-1,:].detach().cpu(), dim=0)/num_its
    return steering_vecs

We start off by running our model on the 2 different prompts. base_toks are the regular ones and target_toks are the direction we want it to go. For each batch in our tuple, we do the following.

We perform a forward pass on the model using the base_toks in the batch. We then do the same for the target_toks and take both run's entire hidden state. We now iterate through each layer, taking the hidden state from only the last token to be generated. With these new tuples, we now subtract target_tok‘s tuple from base_tok‘s and then pull out the batch_size dimension (the first dimension). We then divide by the number of batches we're iterating through to give us the average of the batch in the shape of [hidden_state].

All of this gives us back the average difference between the hidden states for every layer within the model.

Using the Steering Vectors

def do_steering(model, test_toks, steering_vec, scale = 1, normalise = True, layer = None, proj=True, batch_size=16): 
    '''
    Input: 
        :param model: the model to use
        :param test_toks: the test tokens [len, seq_len]
        :param steering_vec: the steering vector [hidden_size]
        :param scale: the scale to use
        :param layer: the layer to modify; if None: we modify all layers. 
        :param proj: whether to project the steering vector
    Output:
        :return output: the steered model output [len, generated_seq_len]
    '''
    # define a hook to modify the input into the layer
    if steering_vec is not None: 
        def modify_activation():
            def hook(model, input): 
                if normalise:
                    sv = steering_vec / steering_vec.norm()
                else: 
                    sv = steering_vec
                if proj:
                    sv = einsum(input[0], sv.view(-1,1), 'b l h, h s -> b l s') * sv
                input[0][:,:,:] = input[0][:,:,:] - scale * sv
            return hook
        handles = [] 
        for i in range(len(model.model.layers)):
            if layer is None: # append to each layer
                handles.append(model.model.layers[i].register_forward_pre_hook(modify_activation()))
            elif layer is not None and i == layer:
                handles.append(model.model.layers[i].register_forward_pre_hook(modify_activation()))
    # pass through the model
    outs_all = []
    for i in tqdm(range(0, test_toks.shape[0], batch_size)):
        outs = model.generate(test_toks[i:i+batch_size], num_beams=4, do_sample=True, max_new_tokens=60) # [num_samples, seq_len]
        outs_all.append(outs)
    outs_all = torch.cat(outs_all, dim=0)
    # remove all hooks
    if steering_vec is not None: 
        for handle in handles: 
            handle.remove()
    return outs_all

This is the function that will modify our model's forward pass. We have logic to check if the steering vector is present. If it isn't, then we simply generate as we normally would and then return the tokens generated.

If the steering vector is present, we will then use the hook we have defined to modify forward passes of our model. We have two parameters for our hook: normalize and proj . If normalize is true, we will normalize the vector's values so as to avoid having a disproportionate impact on the input. If proj is true, we will project the input vector directly onto the steering vector and use this new value as our steering vector. We may sometimes want to have the steering vector apply directly to the input, so the projection is giving us some flexibility. No matter what value proj is though, we will end up subtracting the steering vector from input. Note, we also have a variable scale whose job it is to determine how much of an impact the steering vector should have on the input.

We finally either register the hook for every layer of the model or only to a specific layer. Now that we're setup, we can do our forward pass over test_toks in batches of size batch_size. Finally to clean up, we remove the hooks from our model, and return the generated tokens.

Closing

In closing, we've seen both the theory and how to implement steering vectors into your models. Given the lower costs associated with token generation and the increasing compute spent to improve model performance, it will be interesting to see if vector steering gets used for model accuracy.

From a high level view, vector steering works similar to Direct Preference Optimization (DPO), so it makes sense that vector steering would be an effective way to guide behavior.

If you'd like to see more examples of how steering vectors change output, I put together a hugging face spaces. You can use it to build your intuition for how to create steering vectors and which layer to use. Note that the spaces is running on CPU only, so it takes a while to inference.