Circuit Discovery

October 17, 2024 • Philip Yao, Nikhil Prakash

As interpretability researchers, we are often interested in understanding the underlying mechanisms with which deep neural networks perform various kinds of tasks. However, mechanistic interpretability works, before 2022, primarily focused on understanding some specific feature(s) or neuron(s) in the model using techniques like probing, causal probing and activation patching (or causal tracing), rather than investigating the entire end-to-end circuit responsible for performing a task (Bau et. al, Elazar et. al, Dar et. al, inter alia). The landscape started in change in 2022, when a couple works explored the entire circuit for synthetic tasks. Here, we will explore three of revelant works, but there are multiple papers that discover the underlying circuit for other tasks, such as Prakash et. al, Hanna et. al, Stefan et. al, Mathew et. al.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

This paper was published at ICLR 2023.
Kevin Wang Freshman at Harvard (work done at Redwood Research)
Alexandre Variengien AI Safety Researcher at EffiSciences (work done at Redwood Research)
Arthur Conmy Research Engineer at Google DeepMind (work done at Redwood Research)
Buck Shlegeris CTO of Redwood Research
Jacob Steinhardt Assistant Professor at UC Berkeley

What is a circuit?

If we think of a model as a computational graph M where nodes are terms in its forward pass (neurons, attention heads, embeddings, etc.) and edges are the interactions between those terms (residual connections, attention, projections, etc.), a circuit C is a subgraph of M responsible for some behavior (such as completing the IOI task).

Experimental Setup

The indirect object identification task sentence contains two clause: 1) Initial dependent clause, e.g. When Mary and John went to the store and 2) Main clause, e.g. John gave a bottle of milk to. The model should generate Mary as the next token. In this example, Mary is the indirect object (IO) and John is the subject (S).
GPT-2 small a decoder-only transformer model was used for the anlaysis.
They primarily used logit difference to quantify the performance of the model with/without any intervention.

Discovered Circuit for IOI task in GPT2-small

Duplicate Token Heads: These heads occur at the S2 token and primarily attend to the S1 token. These heads write into their corresponding residual stream that this token has already occurred in the previous context.
S-Inhibition Heads: These heads occur at the END token and attend to the S2 token. The information that these head write into the END token residual informs the query vectors of the Name Mover Heads to not attend to the S1 token.
Name Mover Heads: These heads are also present at the END token and attend to the IO token in the main clause. They copy the name information from the IO token residual stream and dump it into the END token residual stream which gets generated as the next token.
Previous Token Heads: These heads copy information about the token S to token S1+1, the token after S1.
Induction Heads: These heads are present at the S2 token and attend to the S1+1 token. They perform the same function as the Duplication Token Heads.
Backup Name Mover Heads: These are interesting set of heads. They become active only when the Name Mover Heads are knockout out using ablation.

Circuit Evaluation

Path patching algorithm is used to extract the IOI circuit present in GPT2-small model. However, how do we know that the discover circuit is indeed the correct one? Hence, this paper performs evaluation of the identified circuit using metrics like Faithfulness, Completeness, and Minimality.

Faithfulness measures how much of the model performance can be recovered by the circuit itself. Formally, it is computed using the equation \(F(M) - F(C)\), where \(F\) is the logit difference measure. They found that the identified circuit has a faithfulness score of \(0.46\), which is \(13%\) of \(F(M)=3.56\), indicating that the circuit can recover \(87\%\) of the model performance.

Completeness measures whether the circuit contains all the model components that are involved in the computation or is it missing some important ones. Mathematically, the Completeness is computed using the equation \(F(C \backslash K)-F(M \backslash K)\), for every subset \(K \subset C\). If a circuit is complete, then the value should be small. However, calculating the metric for every possible \(K\) is computationally intractable, then the authors use sampling techniques to get an approximation of \(K\).

In order to check if the discovered circuit contains redundant components, authors defined the Minimality metrics. Formally, it is defined as: for every node \(v \in C\) there exists a subset \(K \subseteq C \backslash \{v\}\) that has high minimality score \(|F(C \backslash (K \cup \{v\}) - F(C \backslash K))|\).

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-StepReasoning Task

This paper was published in ACL 2024.
Jannik Brinkmann Work done at University of Mannheim as a PhD
Abhay Sheshadri Work done at Georgia Tech as an Undergrad
Victor Levoso Independent Researcher
Paul Swoboda Work done at University Düsseldorf as a Professor
Christian Bartelt Work done at University of Mannheim as a Managing Director

The purpose of this paper is to determine how a transformer language model solves a reasoning task. They use a binary tree traversal task to analyze the model's behavior. The authors find that the model uses a backward chaining algorithm to solve the task. They also show that the model uses a deduction head to copy the source node to the current position. The authors also show that the model uses a parallel backward chaining algorithm to solve the task when the goal node is more than one node away. The authors also show that the model uses a rank-one update to merge the subpaths together. They make these discoveries by utilizing existing techniques The article uses linear probes, activation patching, and causal scrubbing to analyze the model's behavior. In both activation patching and causal scrubbing, the author intervenes during the model's inference by replacing activations of a component of interest with activations of that same component during a different input. This impacts the model's loss and logits. In causal scrubbing the loss is then used for this performance metric: \(L_{CS} = (L_{scrubbed} - L_{random})/(L_{model} - L_{random}) \). where \(L_{model}\) is the test loss of the trained model, \(L_{random}\) of a model that outputs uniform logits, and \(L_{scrubbed}\) of the trained model with the chosen activations resampled

The Reasoning Task

The training dataset is a binary tree \(T=(V,E)\). The model is given an edge list, root node, and goal node. The model needs to predict the path from the root node to the goal node. Notice that this path is unique. See the two figures below. The first one shows the task, and the second one shows the tree structure. Figure 1. A -> B is an edge from node A to node B. Edges are separated by a comma. The goal node and root node are also included.

Figure 2. The red node is the root node and the blue node is the goal node. The green nodes represent the intermediate path

Model Specifications

Authors trained a decoder-only tranformer with 6 layers, where each layer has 1 attention head and a MLP subblock. It is 1.2 million parameter, trained on 150,000 training examples. It could achive an accuracy of 99.7% on 15,000 unseen examples using the extract sequence matching metric.

The Backward Chaining Algorithm

Backwards chaining is a term from symbolic AI which the model's algorithm shares similarities to. The model's algorithm is as follows:

In the first transformer layer, for each edge [A]->[B] it copies the information from [A] into the residual stream at position [B].
The model copies the target node [G] into the final token position.
The authors terms the heads that perform the following operation in each layer as "deduction heads." In a given layer at the current position the model will find the source/parent node and copy it to the current position.
This is repeated in each layer, allowing the model to traverse up the tree

However, this mechanism is limited by the number of layers present in the model, since one layer is needed to traverse a single node in the tree.

Path Merging

Authors found that there are Register Tokens that do not contain any valuable semantic information, but the model uses these tokens as the working memory to store information about other subpaths in the tree. These other subpaths stored on regiter tokens are later merged at the last token when an overlaping subpath is found for the subpath with goal node.

One-Step Lookahead

Finally, authors found another mechanism which identifies child nodes of the current position and increases the prediction probabilities of the children that are not leaf nodes of the tree. Attention heads in the last two layers were primarily responsible for performing this mechanism. Nikhil's Opinion: Although, authors have performed causal experiments for individual submechanism, the results would be more convincing if they conducted a causal evaluation of mechanism responsible for performing this task.

How they interpret the computations

The authors are able to use linear probes to extract both the source and target tokens [A][B] from the residual stream activations at the position of the target tokens after the first transformer layer.
They hypothesize that the "attention head of layer l is responsible for writing the node that is l - 1 edges above the goal into the final token position in the residual stream. This implies that the attention head in layer l should be consistent across trees that share the same node l - 1 edges above the goal." They use causal scrubbing to verify this.

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

This paper is unpublished.
Tom Lieberum Works at DeepMind
Matthew Rahtz Works at DeepMind
János Kramár Works at DeepMind
Neel Nanda Works at DeepMind
Geoffrey Irving Works at DeepMind
Rohin Shah Works at DeepMind
Vladimir Mikulik Works at DeepMind

Motivation

This paper argues that ciruit analysis has multiple wasknesses. Two of them being: 1) Models studied in the existing literature are relatively small and 2) Most of existing works focuses on identifying model components that are causal linked to performing a particular task, but neglects the semantic information flowing through those model components. To overcome these shortcomings, this paper investigates 70B Chinchilla model on multiple-choice question answering tasks, specifically MMLU. Further, they have proposed a technique to understand the semantics of model components by compressing query, key, and value information subspaces and using it to analyse their semantics across different counterfactual examples.

Experimental Setup

This work studies multiple-choice question-answering using a subset of the MMLU benchamark.
Chinchilla 70B a decoder-only transformer with 80 layers and 64 attention heads is investigated.
They use logit of the option labels as the metric for analysis.

MCQ Answering Circuit in 70B Chinchilla

First, the authors utilize Direct Logit Attribution to identify attention heads that have the highest direct effect on the predicted token. Then, they selected the top 45 attention heads, since they were able to explain 80% of the option token logits. They also showed that these heads could recover most of the model performance and loss. To further analyse the attention heads with highest direct effect, they visualized their value-weighted attention pattern. Based on head's pattern, they categorized them into 4 groups:

Correct Letter Heads: These heads attend from the final position to the correct option label.
Uniform Heads: These heads roughly attend to all letters.
Single Letter Heads: These heads mostly atend to a single fixed letter.
Amplication Heads: These heads are hypothesized to amplify the information already present in the residual stream.

Although, these are half baked results, it still shed light on an interesting mechanism that the model is not passing the label information to option tokens from which the last token could just fetch it, instead the model is first deciding on the correct option and then finally fetching the correct label info.

Semantics of Correct Letter Heads

In order to understand the semantics of the Correct Letter heads, authors applied Singular Value Decomposition (SVD) on the residual of key and query vectors cached across 1024 examples. They found that top-3 singular vectors were sufficient to capture most of the variance. They found the low-rank approximation of the key and query information had similar performance as the full-rank ones. They projected the query and key residual onto this 3-dimensional subspace for visualization. A 3-D version can be accessed here. Finally, they come up with specific mutuations of the original examples to determine which piece of information is most critically present in the QK-subspace. They found that the subspace primarily contains "n-th item in an enumeration" information, but also some information that is specific to the optin tokens.

By utilizing these investigating on the semantics of the Correct Letter heads, the authors were able to come up with the following psuedocode for the functionality of these heads.

Code Resources

Try this colab notebook to investigate the IOI circuit with GPT2-small yourself. You can use it to perform direct logit attribution, activation patching and attention pattern visualization to understand the flow of information. This notebook has been inspired from Neel Nanda's exploratory analysis demo.