SAEs

October 3, 2024 • Rahul Chowdhury, Satya Venkata Anudeep Ragata

Who are the paper authors?

Josh Engels is a PhD student at Harvard University who is studying interpretablity of Large Language Model, dictionary learning and its manifolds' linearity in Language Models. Trenton Bricken is a technical staff member at Anthropic who does research in polysemanticity and monosemanticy in neuron activations. Samuel Marks is a Post-Doc at Northeastern University who studies interpretability and causality and authored Sparse Feature Circuits paper.

What are SAEs?

Sparse Autoencoders are a class of autoencoders that enforces sparsity in the encoding space and are used for investigating entangled concepts in neurons or their activations. This class of autoencoders have a single encoding layer of neurons with non-linear activation that outputs a set of coefficients that are sparse, and there is a decoding layer that takes in the the output of the encoder and multiplies them with the transpose of the linear embedding layer of the encoder. [1]

Sparse Autoencoders are often used to estimate the properties of a neurons or their activation. Sparse autoencoders are used with an assumption that output of a single neuron will not carry the full representation of a property in its outputs. Properties might be encoded in a neural network as a superposition of multiple neural activations. Thus in order to separate the concepts the original basis of feature space needs to be expanded. Sparse Autoencoder first expands the basis of the neural activations into to more dimension [1]. In order to learn the concepts for each input token we enforce sparsity in the encoded features of the Sparse Autoencoder since one token is unlikely to belong with all concepts. This action makes the representation interpretable as we can now focus on the dictionary column that could be reponsible for a particular property. Sparse Autoencoders could be used for both discovery and pruning through techniques that identify which columns in the dictionary light up more for a particular direction of change that could be of our interest. Following identifications of the bases that correspond to a particular feature, that particular feature could be kept if that concept needs to be kept or pruned in order to suppress that from being represented in the output of a neural network [3].

Sparse Feature Circuits

Feature disentanglement with sparse autoencoders

The authors follow Bricken et al. (2023) to train SAEs for attention outputs, MLP outputs, and residual stream activations for each layer of Pythia-70M. \[x = \hat{x} + \epsilon(x) = \sum^{d_{SAE}}_{i=1} f_i(x)v_i + b + \epsilon(x)\]

Causal effects showcase Subject-Verb Agreement Influence

The authors are interested in understanding the causal effects of neuron activations in the language model's decision to output "is" vs. "are" on a pair of inputs. They quantify the importance of neuron activations \(a\) from a pair of inputs \((x_{clean}, x_{patch})\) via indirect effect (IE; Pearl, 2001). \[IE(m, \mathbb{a}; x_{clean}, x_{patch}) = m(x_{clean} | do(\mathbb{a} = \mathbb{a_{patch}})) - m(x_{clean})\] Here \(\mathbb{a_{patch}}\) is the value that \(\mathbb{a}\) takes when computing \(m(x_{patch})\) and \(m(x_{clean} | do(\mathbb{a} = \mathbb{a_{patch}}))\) is the value of m when computing \(m_{clean}\) but intervening the computation of m by setting \(a\) to \(a_{patch}\). Here \(a\) represents a node in the computation graph of the model. The authors provided an example where \(x_{clean} = \text{'The teacher'}\) and \(x_{patch} = \text{'The teachers'}\) and the metric \(m(x)\) is defined as \(m(x) = \log P(\text{'are'}|x) - \log P(\text{'is'}|x)\) the log probability difference of output by the language model. if \(a\) is the activation of a particular neuron, a large value of \(IE(m; a; x_{clean}, x_{patch})\) indicates that the neuron is highly influential on the model’s decision to output “is” vs. “are” on this pair of inputs. The authors linearly approximate IE because of its inefficiency to compute for a large number of model components. They propose two methods attribute patching (Nanda, 2022; Syed et al., 2023; Kramár et al., 2024) which employs a first order taylor expansion and a more computationally expensive but more accurate approximation based on integrated gradients (Sundararajan et al., 2017).

Sparse Feature Circuit Discovery

The activations from the clean input and patching input are stored followed by computation of gradients for each node. Step 3 is computing IE for each node and filtering based on IE threshold \(T_N\) and. Similarly, the edges are filtered as well.

SAEs for removing unintended signals

The authors look at two subsamples of the Bias in Bios dataset. Initially, the linear classifier trained on the ambiguous set performs poorly and uses the unintended signal (gender) to make classifications. Applying SHIFT on the model almost completely removes to dependence on the unintended signal and retraining the netowrk with the pruned circuit lead in even better performance.

Linearity And Non-Linearity in Feature Space

f(king) - f(man) + f(woman) = f(queen) [2] is a classic example of an emergence of linearity in feature space of neural networks. In this example, we could see the vector that shifts the representation from one gender to another gender is added to king which is a title representing a status quo with entangled gender information, the output shifts to another title of the corresponding gender. There are two parts that needs to be understood. First, the direction of change, and, second, the entanglement of information in an input token. Not all input token will have another property entangled in it, and not all shift in state will be linear. We can have a look at a clock to develop an intution of the second property. If we want to shift from 1 to 2, there is a change of +1 in magnitude. Now pay attention to the vector change that brought this change from 1 to 2. Can we use the same vector to increment from 2 to 3 in the clock space? Not, right? This time the vector that we need to increment the same +1 would be different. This is an example portraying why might need a non-linear intervention to produce change in different context.

Evidence of Circular Representation

Sparse Autoencoder could be used to analyze the activation space and its circular representaion [3]. The dictionary columns of the decoders could be used as a similarity index since conceptually similar features would light up and reconstruct the input representation [3]. First, the dictionary elements are clustered, and the features are passed through the Sparse Autoencoder with columns that are in a particular cluster [4]. Then, PCA projections are used to visualize their non-linear representation in lower dimensional subspace [3]. We could observe that different days of the week and different months of the year are in a circle. This means that inorder to increment your current state by the same number a linear scaling of basis would not work. This shows assumption that information transformation could be done in one dimension is a partially correct argument.

This figure shows the PCA Projections of Representation of Weeks and Years [4]

Discussion and Opinion

Sparse Feature Circuits is a better causal discovery model than SAE since a model is causally built where in SAE we put meaning in the non-zero features. Although SAE has a non-linear transformation but the non-linear transformation could be weak and this could generalize to PCAs.

Code Resources

Try The Colab Notebook To Train Sparse Autoencoder here (this is based on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning): Sparse Autoencoder Colab.

References

Cunningham, Hoagy, et al. "Sparse autoencoders find highly interpretable features in language models." arXiv preprint arXiv:2309.08600 (2023).
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).
Marks, Samuel, et al. "Sparse feature circuits: Discovering and editing interpretable causal graphs in language models." arXiv preprint arXiv:2403.19647 (2024).
Engels, Joshua, et al. "Not All Language Model Features Are Linear." arXiv preprint arXiv:2405.14860 (2024).