The Structure and Interpretation of Deep Networks
A collaborative class handbook on current research methods in mechanistic interpretability.
Table of Contents
Introduction and History
Understanding Representation
Understanding Computation
Understanding Learning
Understanding the World
Class Projects
- Interpreting GPT-2 For Time Series(Rahul)
- Bypassing Large Language Models’ Refusal: an Empirical Study on Understanding Jailbreak
- Unlearning Personally Identifiable Information from LLMs (Anudeep)
- Motion Lens : Interpreting Text Encoder’s role in Text-to-Motion Generative Models
- Evolution of LLM Stages: A Study across Model Sizes
- LLMs know more than they show(Sri Harsha, Poornima Final Project)
- Exploring Automated Weight Space Interpretability (David A)