Automated Interpretability

December 3, 2024 • David Atkinson, Sheridan Feucht

Is it possible to automate the process of interpreting units in a neural network? Maybe, if we turn LLMs back on themselves. The papers for today's class focus on how we can use LLMs to explain the inputs/outputs of units within a network.

Language Models Can Explain Neurons in Language Models

This paper was published in 2023 by the (departed but not forgotten) OpenAI Interpretability team.

Approach

The authors propose to use LLMs to produce explanations of neuron activations. Given a sequence of tokens and their activations from a "subject model", they ask a second LLM (the "explainer") to produce a short explanation of what a particular neuron fires on.

How do we evaluate these explanations, though? The authors introduce a third and final LLM, the "simulator", which uses the explanation to predict the activations of the neuron on the evaluation text. Typically, the subject model is small (they use the 1.5B parameter GPT-2 XL), while the explainer and simulator are large. (In this case, they use GPT-4, which is thought to have ~1.75T parameters).

Finally, the predicted activations are scored according to how similar they are to the true activations.

Scoring

To calculate that final score, the paper uses two methods. The gold standard is interventional ("ablation scoring"): simply replace the neuron's activations with those predicted by the simulator. To quantify the quality of the explanation, then, the authors compare the JS divergence between the simulated activations and the true activations, on the one hand, and the JS divergence between the true activations and the mean activation on each token, on the other. More precisely the ablation score is:

\[ 1 - \frac{\mathbb{E}_x[\text{AvgJSD(m(x, n=s(x)) || m(x))}]}{\mathbb{E}_x[\text{AvgJSD(m(x, n=\mu) || m(x))}]} \]

where:

\(m(x)\) denotes the per-token output distributions over text input \(x\)
\(m(x, n=s(x))\) denotes the same, when replacing the neuron \(n\)'s activations with those predicted by the simulator
\(m(x, n=\mu)\) denotes the same when replacing the neuron \(n\)'s activations with its mean activation over all tokens
\(\text{AvgJSD(\cdot, \cdot)}\) is the JS divergence, averaged over token positions

Unfortunately, ablation scoring is expensive to compute. Instead, the paper mostly uses the correlation coefficient between the simulated activations and the true activations, which they find to be a reasonable proxy for ablation scoring:

(One potential concern is "steganography": that the explanations successfully predict neuron behavior, but do so in uninterpretable ways. To test this, they also validate that humans do indeed prefer explananations that score well according to their metric.)

Distributional concerns

Any approach that relies on activations is forced to choose a dataset of inputs from which to generate those activations.

When generating explanations, they found that providing highly-activating examples to the explainer was more effective than providing random ones. So in practice, they run the model through 50,000 examples, and use the top 20 inputs to generate explanations.

When scoring, they also find that running the explainer-simulator pair on random inputs can run into difficulties: most importantly, neurons tend to activate rarely, making it difficult to get a representative sample of activations. Additionally, polysemanticity means that simple explanations for any particular neuron may not actually exist. In such cases, it still seems worth validating the method under the assumption that polysemanticity is solved. To do so, the authors also report "top-and-random" scores, where they run the explainer-simulator pair on both top-activating sequences as well as a random sample of sequences.

Results

Generally, the scores are quite low, and get lower as the depth of the neuron increases. In the following plot, for example, the method is compared: (in orange) to a simple lookup table, which stores the average activation of the neuron per-token; and (in green) to GPT-4s explanation of the top examples in that lookup table:

Although top-and-random scores are more encouraging:

Finding Directions

We know, however, that neurons are often polysemantic, making explanations difficult to find. In response, the authors propose an optimization procedure to find explainable directions. At each step, they:

Generate a good explanation of the direction, using the same three-step process as before (directions are randomly initialized, to start)
Optimize the direction, using the gradient of the score with respect to the direction

The key insight here is that the correlation between the direction's activation and the simulator's predicted activations is differentiable, if you hold those predicted activations constant.

On a chosen comparison layer (10), this procedure resulted in an average top-and-random score of 0.718; for comparison, the average score for neurons in layer 10 was 0.147. This score of 0.718 is better than roughly 98.5% of the neurons found by the original method.

Scaling Trends

Much of the rest of the paper investigates what happens when one of the three models is scaled up or down. As a quick summary, the authors find that:

Performance improves, slowly, as the explainer model is scaled up.
Simulator quality improves with model size, but has still not reached human performance, as measured by human evaluation.
Subject model size has a significant effect: as you scale the subject model up, the explanation scores plummet.

MAIA: A Multimodal Automated Interpretability Agent

MAIA takes the example of Bills et al. and extends it in two ways. First, it adds multimodal modularity, which enables the explainer model to use a rich set of tools such as image synthesis and editing. Perhaps more importantly, it is iterative: given a neuron, MAIA can propose inputs, examine the resulting activations, and propose new experiments based on those results.

The paper was co-lead by Tamar Rott Shaham (at MIT) and Sarah Schwettmann (formerly of MIT, now at interpretability startup Transluce).

MAIA starts with a query, such as "what does neuron 42 activate for?" The MAIA harness then passes that query, along with a description of the MAIA API, to GPT-4V, which can then make use of the API in generating an experiment or explanation. Experiments can leverage a variety of tools, namely:

dataset_exemplars: return the set of images that maximally activate the component of interest
text2image: use Stable Diffusion to generate an image according to a given text prompt
edit_images: use Instruct-Pix2Pix to construct a modified image according to a prompt
describe_images: use a new GPT-4V instance to describe an image
log_experiment: record the the previous experiment, causing its results to be included in future prompts

Evaluation

To evaluate the quality of MAIA-generated explanations, the authors use two baselines:

MILAN (Hernandez et al., 2022), a single-step method of generating explanations that maximizes PMI between the explanation and the portions of the image the neuron activates for.
Human descriptions, where humans are given the same prompt and API access, and asked to play the role of the experimenter.

The primary evaluation process proceeds as follows:

For each method, give the proposed explanation to a fresh GPT-4 instance, and ask it to produce fourteen image generation prompts. Seven of the prompts should produce images that activate the component highly, and seven should produce generic images with no relationship to the explanation.
Take all 42 prompts, and for each explanation, ask a different GPT-4 instance for the seven prompts which would most highly activate the component (positive prompts), as well as the seven prompts which would least activate the component (negative prompts).
Record the average activations of the component over the positive and negative prompts. If the explanation is is a good one, then the positive prompts should have high activations, and the negative prompts should have low activations.

Finally, the authors conduct a trio of further experiments:

They construct a set of synthetic neurons—with known ground-truth labels—and find that human judges clearly prefer MAIA explanations of MILAN explanations, and marginally prefer them to human explanations.
They apply MAIA to a model trained on the Spawrious dataset (Lynch et al., 2023), and find that MAIA can find a set of neurons that predict a single dog breed independently of the environment. They then train a logistic regression model on the selected neurons, and find that it outperforms a model trained on all neurons when evaluated on the test set.
They use MAIA to detect bias in an ImageNet classifier. In this setting, MAIA is asked to find output dimensions of a model that are biased towards a particular subset of the class label.

Outlook

Considering MAIA as a response to Bills et al., MAIA has a number of strengths:

It's less opinionated about the form of the explainer model, and so able to incorporate new advances and tools.
The prompting interface enables more use-cases than the three-step process used by Bills et al.
It's iterative and causal abilities allow it to find parts of the explanation space that might otherwise have been missed.

Conversely, Bills et al. features—or at least proposes—a more principled evaluation procedure, and its poor performance lets us see how much work there is left to do. Bills et al. also provides some insight into scaling trends, which will ultimately determine how successful this kind of approach can be.

Both are subject to the following two critiques, though, both highlighted by Huang et al. (2023):

Evaluating just the ability of an explanation to predict activations is not enough. We need to perform interventional evaluations, such as the ablation scoring proposed but not used by Bills et al.
It's unclear what form explanations should actually take. Many of the best-performing explanations are extremely vague. As Huang et al. say: "There may be a way to define a fragment of natural language that is less prone to these interpretative issues, and then we could seek to have explainer models generate such language. However, if we do take these steps, we are conceding that model explanations actually require specialized training to interpret. In light of this, it may be better to chose an existing, rigorously interpreted formalism (e.g., a programming language) as the medium of explanation.

I think both papers are clarifying, in that they show that the hard parts of interpretability are primarily conceptual. We don't know what units should be interpreted, and we don't know what kinds of interpretatations should be produced.

Explaining Black Box Text Modules in Natural Language With Language Models

This paper provides another approach to describing the selectivity of units within language models, which also works for fMRI data! The first authors of this paper are Chandan Singh, a senior researcher at Microsoft Research, and Aliyah R. Hsu, a fifth-year PhD student at UC Berkeley. It seems that this project was primarily completed at Microsoft Research, but they also collaborate with Alexander Huth, a professor of neuroscience at UT Austin, and his PhD student, Richard Antonello (who is now a postdoc at Columbia).

This paper introduces a method for analyzing text modules, which they define as any function that maps text to a continuous scalar value. This could be a neuron within an LLM or a voxel in an fMRI scan (when responding to language stimuli). To analyze these modules, they introduce a method called Summarize And SCore (SASC), which takes some text module f and generates a natural language description of the module, as well as a confidence score of how good the explanation is. SASC is a two-step process:

Summarization. Based on a reference corpus, ngrams that activate f the most are sampled. A pre-trained "helper LLM" is then used to generate a summary of those ngrams, which acts as a description of what f responds to.
Synthetic Scoring. To evaluate an explanation of f, the helper LLM is used to generate synthetic data conditioned on that explanation. Then, the mean difference between the text module evaluated on synthetic text \( f(Text^+)\) and the text module evaluated on unrelated synthetic text \( f(Text^-) \) is calculated. Their score is measured in units of standard deviations; for example, a SASC score of \( 1\sigma_f \) indicates that synthetic data based on that explanation increased activations of f by one standard deviation from the mean.

Their Figure 1 shows an example of this process for a module that responds to ngrams like "wow I never". As they mention, the efficacy of this method depends a lot on the length of ngrams fed through the model; however, longer ngrams require more computation time. Another thing to note is that in practice, they use a large generic corpus to calculate \( f(Text^-) \), instead of synthetically generating "neutral" text.

Synthetic Module Evaluation

First, the authors see how well SASC works when trying to recover descriptions of synthetic text modules. They use a dataset from Zhong et al. (2021) consisting of keyphrase descriptions of examples in a dataset (e.g. related to math, contains sarcasm), and then use a text embedding model to embed input examples and output the negative Euclidean distance between the input and the keyphrase description. This gives us text modules that we know the "ground truth" explanation for.

They find that SASC successfully identifies 88% of the ground-truth explanations. If the reference corpus is restricted, or if a lot of noise is added to f, SASC is still successful about 67% of the time. However, they do use examples from the Zhong et al. (2021) dataset as their reference corpus, which seems like it might inflate the efficacy of this method, even in the restricted setting.

BERT Evaluation

Instead of analyzing individual neurons in BERT, the authors analyze transformer factors from Yun et al. (2021). These are features found via sparse over-complete dictionary learning, in a paper that was a precursor to Anthropic's SAE investigations. They find that, using their scoring method with GPT-3, SASC explanations score higher than human explanations. However, scores become worse in later layers.

To further evaluate these explanations, they fit linear classifiers to the factor coefficients in order to perform specific tasks like emotion classification, news topic classificiation, and movie review sentiment classification. When the top 25 regression coefficients are examined qualitatively, they find that, e.g., the feature labeled "professional sports teams" contributes heavily to classification of news articles being sports-related.

fMRI Comparison

Here are two interesting highlights from their fMRI analysis (read the full paper for details). One is that explanation scores for fMRI voxels are much lower than they are for early layers in BERT (but similar to middle BERT layers).

The other thing is that if you fit a topic model to all of the explanations found by SASC, fMRI explanations have a much higher proportion of explanations related to the topic action, movement, relationships.... This is apparently consistent with prior findings showing that the largest axis of variation in fMRI voxels is between social and physical concepts.

Code Resources

Bills, et al. provide a collection of notebooks to explore here. Also make sure to check out the interactive explorer for the neuron explanations their model found here. Similarly, the MAIA paper provides an experiment browser, and a demo notebook.