Unlearning Personally Identifiable Information from OLMo

December 10, 2024 • Anudeep Ragata

Code

Link to the GitHub repository

Introduction

Large Language Models risk the possibility of memorizing sensitive information about individuals. This project explores the problem of unlearning personally identifiable information (PII) from LLMs. I have adapted the paper which proposes a method that identifies and modifies a small subset of neurons relevant to each PII.

Related Work

Machine unlearning, a field dedicated to the removal of specific data from trained machine learning models, has gained significant attention due to increasing concerns over data privacy. Traditional machine unlearning approaches often focus on retraining the model entirely or applying specific updates to erase the influence of undesired data. Techniques like weight rewinding, selective fine-tuning, and gradient inversion aim to minimize the model's dependency on private data without compromising its overall performance. However, these methods face challenges in scalability, as retraining or fine-tuning large-scale models, such as language models, can be computationally expensive. Recent advancements leverage model distillation and pruning techniques to efficiently remove sensitive data while retaining the original model's knowledge to a significant extent.

In the context of language models, the unlearning of Personally Identifiable Information (PII) presents unique challenges. Unlike structured datasets, language models trained on massive text corpora may unintentionally encode PII in ways that are difficult to pinpoint or extract. Existing research on differential privacy and post-hoc removal mechanisms has explored strategies to reduce the inclusion of sensitive data during training or eliminate it afterward. For instance, privacy-preserving mechanisms like noise injection during training and deletion-aware fine-tuning have demonstrated promise in mitigating PII leakage. However, these approaches often struggle with balancing privacy and utility, particularly in generative models, where unintended memorization of training data can lead to accidental generation of PII. Addressing these challenges requires continued innovation in methods for targeted data removal and comprehensive evaluation metrics that account for privacy preservation without degrading the model's functional integrity.

Methodology

To unlearn specific sensitive information, REVS identifies the model parameters most responsible for generating it. It then adjusts certain neuron representations to lower the prominence of the target tokens in each layer’s residuals, reducing their rank. This process minimizes the impact of edits on the overall model while reflecting the updated neuron values, effectively suppressing the sensitive information while preserving the model's general capabilities. The core process of REVS involves the following steps:

1. To unlearn a sensitive sequence S, select a subset of target tokens \(T= t_1,t_2,...,t_t\).

2. For each target sensitive token \(t_i \in T\), identify the layers where the rank of ti in the residual hidden state vector his above a desired threshold rank \(r_h\).

3. Within these identified layers, select which neurons to edit,

4. Iteratively edit selected neurons to reduce rank of \(t_i\) below neuron threshold rank \(r_n\).

5. Update the model with the edited neurons.

Scores

1. Rank Score:

2. Efficacy:

3. Specificity:

4. Perplexity

Results

Dataset	Method	Efficacy	Specificity	Perplexity
SSNs	Unedited	0	100	12.142
	REVS	99.95	71.53	12.172
Emails	Unedited	0	100	8.324
	REVS	96.89	70.77	8.609
Phone Numbers	Unedited	0	100	12.027
	REVS	99.94	71.89	12.112

Takeaways from the results

As shown in the table, REVS achieves strong results in terms of unlearning effectiveness and model integrity. The difference in specificity between datasets can be attributed to the nature of the target token sequences. While the Emails dataset targets consist of unique tokens with low overlap, the SSN targets comprise only numbers, leading to a high overlap with other instances within the specificity calculation. The high generalization score of REVS indicates its ability to comprehensively unlearn generating the target tokens across different prompts, even when applied to a single prompt-target pair.

Conclusion

The REVS method effectively unlearns sensitive information from LLMs while maintaining model performance. It achieves high efficacy and specificity scores, indicating that it successfully suppresses sensitive information. The method also maintains low perplexity scores, demonstrating that it does not significantly impact the model's general capabilities.

References

Ashuach, Tomer and Tutek, Martin and Belinkov, Yonatan, REVS: Rank Editing in the Vocabulary Space for Unlearning Sensitive Information in Large Language Models, arXiv preprint arXiv:2406.09325, 2024