Grokking

October 29, 2024 • Sri Harsha and Nikhil

Introduction

Grokking is a fascinating phenomenon where a model, after a period of apparent stagnation, suddenly experiences a rapid and significant improvement in performance. This abrupt transition is like epiphany moment, where the model gains a deep understanding of its task, similar to a human's moment of clarity after grappling with a complex concept. The term "grok" originates from Robert Heinlein's science fiction, meaning to understand something fully and deeply. Grokking challenges conventional expectations of gradual learning, suggesting an alternative dynamic where models might initially show little progress before suddenly achieving generalization.

The three papers that will be explored in this page, the first beeing Nanda et al 2023, Progress measures for grokking via mechanistic interpretability which shows grokking phenomenon on small transformers trained on modular addition tasks. The second paper Murty et al 2023, is about Grokking of Hierarchical Structure in Vanilla Transformers which explores hierarchically generalization. The third Y. Hu et all 2024 Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics, explores how randomness in data order and initialization impacts model training dynamics and outcomes.

Authors

Grokking mech-intrep

Neel Nanda runs the Google DeepMind mechanistic interpretability team.

Lawrence Chan is a PhD at UC Berkeley advised by Anca Dragan and Stuart Russell

Tom Lieberum is a Research Engineer at DeepMind.

Jacob Steinhardt Assistant Professor at Department of Statistics, UC Berkeley

Glossary

Norm is a measure of magnitude in the context of neural networks

Frequency in mech-intrep grokking paper; it means the angular frequency 2pi/P. k is an integer representing the k-th harmonic or multiple of the fundamental frequency. wk is the angular frequency corresponding to the k-th harmonic.

Sparse parities is combinatorial problem where only a small subset of inputs are relevant to solve the task. Example when an input has 40 bits but only 3 bits are necessary to compute the output.

Vanishing gradients: This is a phenomenon where the models gradients become negligible, unable to optimize during backprop.

Exploding gradients: The gradients accumulated are massive and instead of converging to optimum it starts diverge.

Emergent behaviours

Emergent behaviors in machine learning models often arise unexpectedly when models are scaled up, leading to new capabilities such as in-context learning and chain-of-thought prompting. However, these behaviors also present risks, including overfitting and unintended consequences in real-world applications. For instance, YPan et al. discusses the risks associated with recommender systems, in case of YouTube since engineers couldn’t really measure the SWB(Subjective well being) they used other metrics to measure and optimize like click-through rates or watch-time. These objectives don’t make a good estimate of SWB so this led YouTube to overemphasize watch-time and harm user satisfaction and also recommended extreme political (controversial) content to users.

The emergence of these behaviors is surprising to researchers because they appear suddenly and are not easily predictable based on traditional metrics. Ganguli et al. highlight the paradox that while scaling laws predict performance improvements, the specific new capabilities that emerge are unpredictable. In fact there could still be unknow capabilities which are not triggered yet or discovered. Barak et al. further note that sudden phase changes can occur even without changes in data size, emphasizing the need for metrics that can detect these transitions before they happen. Understanding these emergent behaviors requires novel approaches beyond conventional statistical methods, as they can have significant implications for both model performance and societal impact.

Alpha zero learns a lot of human chess concepts between 10 k to 30k training steps. Reinvents opening theories between 25k to 60k.

Modular addition experiment

In this experiment they study modular addition, where a model takes inputs a, b ∈ {0, . . . , P −1} for some prime P and predicts their sum c mod P. Small transformers trained with weight decay on this task consistently exhibit grokking. They reverse engineered the weights of these transformers and find that they perform this task by mapping the inputs onto a circle and performing addition on the circle. Specifically, we show that the embedding matrix maps the inputs a, b to sines and cosines at a sparse set of key frequencies wk. The attention and MLP layers then combine these using trigonometric identities to compute the sine and cosine of wk(a + b), and the output matrices shift and combine these frequencies.

They found four lines of evidence to differentiate the phases of grokking.

1. Network weights exhibit a periodic structure. When a Fourier transformation is applied many components are sparse and supported by few key frequencies.

2. Neuron-logit WL which is the last learnable param matrix which transforms hidden activations into logits. The WL can be well approximated using sinusoidal functions of key frequencies. MLP activations are projected on to these sinusoidal functions produce trigonometric identities from the neurons. Unembedding matrix WL has only a rank of 10 where each direction corresponds to either cosine or sine of only 5 key frequencies. Projecting MLP activations to WL only produces multiples of cos (wk(a + b)) and sin (wk(a + b)) where a and b are the inputs. Hence we can safely say that the sum is not computed in MLP.

3. The MLP and attention heads can be approximated well using a 2nd degree polynomials of trigonometric functions of a single frequency. Attention heads and most neurons are well approximated by 2nd degree polynomials of sins and cosines of a single frequency. The corresponding direction in WL also contains the same frequency. Hence model computations are localized across all the frequencies.

4. Ablating key frequencies reduces model performance but ablating the other 95% improves the performance. Ablating various components and replacing them with Fourier multiplication algorithm does not harm the performance and sometimes improves it. This way the interpretability is faithful. The paper introduces two progress metrics which improve prior to and when grokking occurs.

- Restricted loss: ablating every non key frequency.

- excluded loss: ablating key frequencies

Phase changes

The experiment shows three phases in the training. Memorization of the training data, circuit formation, where the network learns a mechanism that generalizes; and cleanup, where weight decay removes the memorization components. Surprisingly, the sudden transition to perfect test accuracy in grokking occurs during cleanup, after the generalizing mechanism is learned.

Memorization (Epochs 0k–1.4k). A decline of both excluded and train loss, with test and restricted loss both remaining high and the Gini coefficient staying relatively flat. In other words, the model memorizes the data, and the frequencies wk used by the final model are unused.

Circuit formation (Epochs 1.4k–9.4k). In this phase, excluded loss rises, sum of squared weights falls, restricted loss starts to fall, and test and train loss stay flat. This suggests that the model’s behavior on the train set transitions smoothly from the memorizing solution to the Fourier multiplication algorithm. The fall in the sum of squared weights suggests that circuit formation likely happens due to weight decay. Notably, the circuit is formed well before grokking occurs.

Cleanup (Epochs 9.4k–14k). In this phase, excluded loss plateaus, restricted loss continues to drop, test loss suddenly drops, and sum of squared weights sharply drops. As the completed Fourier multiplication circuit both solves the task well and has lower weight than the memorization circuit, weight decay encourages the network to shed the memorized solution in favor of focusing on the Fourier multiplication circuit. This is most cleanly shown in the sharp increase in the Gini coefficient for the matices WE and WL, which shows that the network is becoming sparser in the Fourier basis.

Interative grokking mech-interp

Structural Grokking

The problem addressed in this paper is understanding whether and how vanilla transformer models (standard transformers without architectural modifications) can learn and generalize hierarchical structures inherent in human language. This question arises because hierarchical structure, how smaller units in sentences form larger, nested constituents is critical for human language comprehension and generalization to new sentences.

Historically, research has suggested that transformers, due to their sequential and attention-based design, might lack inductive biases for hierarchical structure, struggling to generalize beyond the training data to structurally novel sentences. Some studies have argued that transformers only learn shallow patterns in the data without truly capturing these deeper, hierarchical relationships.

Background Structural Grokking

Some of the early work on hierarchical generalization(Muller et al) shows evidence that generalization does not occur or poor accuracy on certain datasets. Murty et al 2023 argues that this occurs due early stopping of training. They also claim that, simply by training for longer, mean accuracy across random seeds reaches 80%, and several seeds achieve near-perfect generalization performance.

Experimentation Structural grokking

They selected datasets that require hierarchical generalization, where training data can be explained by both hierarchical and non-hierarchical rules:

Dyck Language Modeling (Dyck-LM): The language of well-nested brackets with 20 types and max nesting depth of 10. This task involves sequences of well-nested brackets with varying types and depths, requiring the model to predict closing brackets based on nested structure.

Question Formation: Convert English sentences into questions. The task involves transforming English declarative sentences into questions by moving auxiliary verbs appropriately, which requires hierarchical manipulation.

Tense Inflection: Map from sentences and tense markers to appropriately re-inflected sentences. This task involves inflecting verbs in sentences according to tense markers, necessitating hierarchical understanding to place verb inflections accurately. Input: "She is reading a book" with a target tense of "past". Expected Output: "She was reading a book".

The models used in the experiment are transformer LMs with {2, 4, 6, 8, 10} layers. For each depth, we train models with 10 random seeds for 300k (400k for Dyck) steps. A simple greedy decoding is used during testing.

Results and conclusion Structural Grokking

Weight norms alone are not effective predictors of hierarchical generalization (or structural grokking) in transformer models. Although weight norms do grow during training, this growth is not a reliable indicator of whether a model will generalize hierarchically on out of distribution data.

Attention sparsity does not reliably predict structural grokking or hierarchical generalization in transformers. Although attention sparsity increases with model depth, this increase is not linked to a model's ability to generalize hierarchically.

Tree-structuredness is a key predictor of structural grokking and hierarchical generalization in transformers. Unlike other internal model properties (like weight norms and attention sparsity), tree-structuredness uniquely correlates with the model’s ability to learn and generalize hierarchical rules to out of distribution data.

The paper concludes that transformers can learn hierarchical language structures through structural grokking, where hierarchical generalization emerges with extended training, challenging the view that transformers lack hierarchical biases. This phenomenon follows an inverted U-shaped relationship with model depth, where intermediate depths are optimal. Among internal properties, tree-structuredness, how well model computations resemble hierarchical trees best predicts successful generalization. Extended training is essential for structural grokking, suggesting prior studies may have underestimated transformers’ capabilities. These findings reveal transformer's potential for capturing syntactic complexity, with implications for their application to hierarchical language tasks.

Latent state Training dynamics

Intro

The paper addresses the challenge of understanding how randomness in data ordering and model initialization affects neural network training outcomes. Despite using identical architectures and hyperparameters, models can take different training paths due to randomness, impacting convergence speed and final accuracy. This variability complicates reproducibility and efficiency, as certain training paths lead to slower or less effective learning. The authors aim to interpret these paths by modeling training dynamics as transitions across latent states using hidden Markov models (HMMs). This approach captures phase transitions and highlights detour states inefficient training stages that delay convergence.

Methodology

By identifying these stages and detours the study provides insights into how randomness influences training trajectories and suggests ways to optimize model architecture or hyperparameters to avoid detours, thus improving training efficiency and stability. This framework offers a structured method to interpret and manage training variability in neural networks.

1. Data collection: Metrics like L2 norm, mean and mean are recorded at each check point to capture the training pattern or dynamics .

2. Hidden Markov Model (HMM): An HMM is fitted to these metrics, treating training as a sequence of transitions between latent states. Each latent state represents a distinct phase in the training process.

3. Training Map Construction: The HMM is used to make a training map which visualizes paths through latent states during training. This map helps identify detour states in the latent state distribution which causes slower convergence.

4. Interpreting Detour States: To connect detour states to training outcomes, the authors use regression models predicting convergence time based on time spent in each latent state, identifying states that slow down convergence.

Experimentation with Latent states

The paper shows experimentation on three different tasks:

1. Grokking Tasks: Modular additions and sparse parities, where models learn simple rules and exhibit clear phase transitions between memorization and generalization.

2. Image Classification: Tasks on MNIST and CIFAR-100 to observe standard training dynamics in neural networks.

3. Masked Language Modeling: Using BERT-like models to assess training dynamics under diverse random seeds.

Results and conclusion Latent States

In Grokking tasks, the training dynamics show clear phase transitions from memorization to generalization. Paths through the training map showed clear changes. Some models followed short, efficient paths, while others took longer detour routes. Certain random seeds led to prolonged convergence times due to detour states that slowed learning by entering optional phases with high L2 norm. These results prove that phase transitions in training can be sensitive to random seeds. Adjusting some of the hyperparameters or making architectural changes like adding normalization mitigates this sensitivity.

The image classification on CIFAR-100 and MNIST have very stable where metrics which linearly progress towards convergence. However, when batch normalization and residual connections were removed in CIFAR-100 it destabilized the training for some random seeds. The reasons for destabilization are attributed to exploding gradiants or vanishing gradiants problems.

The Masked Language Modeling showed variations in weight averages across random seeds, with some converging faster than others. While all of them converged to pre trainning loss benchmark these weight average variations could impact fine tuning and other tasks downstream. Perhaps loss and accuracy are not sufficient metrics to evaluate these language models convergence to optimal training dynamics.

Compare, contrast and opinion

The mech-interp grokking paper has an extensive experimentation combined with a comprehensive reverse engineering, makes an air tight argument about the phase transitions and how tranformer Models achieve generalization. The structural grokking paper shows evidence about hierarchial generalization disproving older works but it does not explore the internal mechanics of how the model achives said hierchial generalization unlike mech-interp which clearly showing approximations of model calculations with trignometric identities. While the structural grokking paper portrays structural grokking an entirely new phenomenon. I (Sri Harsha) reached a conclusion that strutural grokking and the mech-interp grokking are the same phenomeon. The only major difference I see between the experiments of the two papers is that they are training to solve different tasks.

Code and additional resources

1. Mech-interp grokking code https://github.com/mechanistic-interpretability-grokking/progress-measures-paper

2. Structural grokking code https://github.com/MurtyShikhar/structural-grokking

3. Latent states code https://github.com/michahu/modeling-training