Probing
September 19, 2024 • Rahul Chowdhury, Ritik Bompilwar
Who are the paper authors?
The authors of the papers of today's discussion are mainly Kenneth Li, PhD student at Harvard University, and Dr. Yanai Elazar is postdoctoral researcher on the AllenNLP team at AI2. Kenneth Li is working on LLM dialogues and interpretability for alignment of LLMs. Dr. Yanai Elazar works on interpretability of generative model and has authored a great paper "Null it out: Guarding protected attributes by iterative nullspace projection" that showed how mathematically feature space could be made unresponsive by simple linear algebra.
What is Probing?
Probing is an attempt by computer scientists to understand the workings of neural networks. The most popular way of probing is by learning to make sense of a representation of a neural network by keeping the information in its purest form as much as possible. The information under scrutiny is usually a human interpretable property of known data that is believed could be decoded in the easiest way if the information is present in the representation and also in the data that is passed through the neural network.
There is also a second line of work that is more concerned if a particular property that is learned is used in the training objective. This line of work is unbothered by the presence of a particular feature which could be learned as a by-product of a training objective. This is investigated by switching off the features that are believed to be representing a property and observing the reaction of the neural network's predictability to that intervention.
Linear Probing
Linear Probing is a learning technique to assess the information content in the representation layer of a neural network. This is done to answer questions like what property of the data in training did this representation layer learn that will be used in the subsequent layers to make a prediction. The other question to ask is how we can test the presence of such property without adulterating the purity of information. Computer scientists believed that associating a representation with a said property with parametric learning paradigm could be done with less adulteration by a linear classifier since a complex model with non-linear activations might memorize the data and force the representation to project to the labels associated with the data [1]. One way to test if the probing learnable function is memorizing the label along with the data instead of drawing a valid decision line is by checking the selectivity of the probing.
In this test, a probe for a feature under scrutiny is trained with random labels and the same probing structure is trained with original labels, and the probe trained with random labels serves as a control and their performances are recorded [1]. If the probe achieves high accuracy on the random labels and the difference between the accuracy of the probe trained with original and the noisy labels is not significant, it is said to be low on selectivity and is memorizing the probing dataset. Non-linear probes have been alleged to have this property, and that is why a linear probe is entrusted with this task. Finally, good probing performance would hint at the presence of the said property, which has the potential of being used in making final decisions to choose a label in the farthest layer of the neural network.
Amnesic Probing
Probing usually deals with discovery and localization of concepts in a trained neural network, and investigates what the neural network has learned and how they could be responsible for the outcome of training. There is another line of research that has an orthogonal approach to understanding the dependencies of a neural network on said features when it takes a decision [1]. This line of research wants to study the reaction of the neural network if that feature that might be a representation of a said property is suppressed via intervention. Intervention is usually done in the deep layers rather than input since intervention in the input involves querying a lot of neurons. The goal of this intervention is to test if the features learning identified properties are at all used during making a decision. This could also be a test to observe if the learned representation is actually the feature of the probed label or is it just a mere correlation without causation.
Intervention Mechanism
Iterative Null Space Projection algorithm was used to transform features that could be used to identify particular property [1]. This is usually done through repeated projection of weight matrix used to project features into the distribution of labels to null space, followed by learning a new matrix to linearly transform the null projected features into the distribution of labels. This operation is repeated because null space projection can nullify the contribution by the number of dimensions of the labels. In order to ensure most of the dimensions that could have information about the property, multiple linear projections are required to minimize the rank of the matrix to a point where the information that could be transformed linearly to utilize the said property is diminished [1]. This is repeated until the linear projection can no longer classify the features into the target labels better than random projections [1]. This repeated linear projection gets rid of the chance of the embedding being processed linearly to utilize information that could represent the said property [1].
How Does Intervention Help?
After performing linear intervention, we check if that intervention causes a drop in performance. This brings forth questions: Is the absence of the said property causing the drop? Or was it a mere correlation that could have caused the drop, and were there other features entangled in the embedding that could have been the reason for the original performance? A great way to test if our hypothesis that the said property that the feature is representing is by simultaneously training another set of matrices in the same Iterative Null Space Projection algorithm but with random labels. If the linear intervention finds the neutralized feature caused a drop in performance better than the random linear intervention [1], it could be concluded that the said property might be representing the property and the information in that embedding was essential in determining the classes of an input data.
Emergent World Representations
Language models have shown an amazing range of capabilities, but the question of what they learn from data has been a subject of debate. Do the models memorize a collection of surface statistics, or do they form certain interpretable representations of the process that generates the sequences they see?
Li et al. (2023) [2] proposed the existence of world representations in language models. The focus of the paper is to investigate the form of internal representations in the language model performing a specific task in confined settings. To investigate this world representation, the authors chose the popular game of Othello (try it here). If we think of the board as the "world," then the game provides an appealing experimental testbed to explore world representations.
The authors trained an 8-layer GPT model with an 8-head attention mechanism and a 512-dimensional hidden space on two sets of training data. The first set called "championship" is smaller in size (140K games) but consists of strategic moves by expert human players. The second dataset called "synthetic" is larger in size (2.4M games) consisting of legal but otherwise random moves. The goal was to study how much Othello-GPT can learn from pure sequence information with minimal inductive bias. The model was evaluated on the validation set with error rate as the metric. Othello-GPT had 0.01% error rate on synthetic data, 5.17% error rate on championship data while the untrained model had 93.29% error rate.
Is the Model Memorizing?
To test whether the model was simply memorizing game sequences, the authors created a skewed dataset by removing all games that started with one of the four possible opening moves (specifically, they removed games starting with the move C5). This eliminated 25% of all possible game openings. Despite never seeing these sequences during training, Othello-GPT still performed with a low error rate of 0.02% on them. This indicated that the model is not just memorizing but has learned to generalize from the patterns in the data.
Intervention Technique for Probing Internal Representations
Authors trained probes that predict the board state from the network's internal activations after a given sequence of moves. They found that linear probess had higher error rates than non-linear probes (2 layer MLP), indicating that the probe may be recovering a nontrivial representation of board state in the network's activations.
 
To evaluate whether Othello-GPT’s internal representations of the board state directly influence its move predictions, the authors introduced targeted interventions by altering activations across multiple layers, starting from an initial layer Lₛ. Utilizing gradient descent, they adjusted specific activations to transition the board state from B to B′ by modifying certain state elements. This precise alteration ensures that only a segment of the board state changes, thereby affecting the set of legal moves. The authors then assessed the success of these interventions by probing the modified activations and verifying if the model’s move predictions corresponded with the updated board state.
 
To systematically test the causal impact of internal board representations on Othello-GPT’s predictions, the authors developed a benchmark set comprising 2,000 intervention cases, divided into natural (reachable) and unnatural (unreachable) board states. For each test case, they modified the model’s internal activations to alter the board state and monitored the subsequent move predictions. By comparing these predictions with the expected set of legal moves, they calculated error rates. The results showed a significant reduction in errors from baseline levels, even in the unnatural subset, which underscores that the emergent internal representations play a causal role in shaping the model’s decision-making processes.
Latent Saliency Maps
 
The latent saliency maps are generated by assessing how changes to each tile’s internal representation affect the model’s prediction for a specific move p. For each tile s on the board B, the authors apply the intervention technique to alter the representation of s and observe the resulting change in the prediction probability of p. They calculate the saliency Ss as the difference between the original prediction probability p0 and the new probability ps after intervention (Ss = p0 - ps). This saliency value indicates how much tile s influences the prediction of move p. By computing Ss for all tiles and visualizing them, they create a latent saliency map that highlights the most influential tiles affecting the model’s decision.
Discussion and Opinion
Linear probing and non-linear probing are great ways to identify if certain properties are linearly separable in feature space, and they are good indicators that these information could be used for future token prediction. But large models are much more than that. There are so many heads in language models that even if the information is linearly seperable they might not be used for prediction due to the algorithm that large models might have created using multiple heads. This points out the importance of amnesic probing which would indicate by creating an absence if that information is present and is being used.
Code Resources
The "Emergent World Representations" paper has provided the code here: Othello World. Aditionally, Neel Nanda has created a TransformerLens version of the Othello-GPT,which allows for inspecting each MLP neuron in the model, boosting the mechanistic interpretability. Try the Colab Notebook here: Othello GPT Colab.
The "Amnesic Probing" paper has provided the code here: Amnesic Probing
.References
- Elazar, Yanai, et al. "Amnesic probing: Behavioral explanation with amnesic counterfactuals." Transactions of the Association for Computational Linguistics 9 (2021): 160-175.
- Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., & Wattenberg, M. (2022). Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382.
- Belinkov, Yonatan. (2021). Probing Classifiers: Promises, Shortcomings, and Advances. arXiv preprint arXiv:2102.12452.