AI Safety: Bridging Perspectives

The discussion around AI safety has become increasingly polarized, with two distinct camps emerging:

The Great AI Safety Debate

The discussion around AI safety has become increasingly polarized, with two distinct camps emerging:

The Existential Risk Perspective

Some researchers and experts, including figures like Eliezer Yudkowsky and Stuart Russell, warn about the potential catastrophic risks of advanced AI systems. They argue that as AI systems become more capable, they could pose existential threats to humanity through various mechanisms, from unaligned goals to potential misuse in creating weapons of mass destruction.

The Skeptics' View

On the other side, researchers like Emily M. Bender and Timnit Gebru argue that current AI systems are merely "stochastic parrots" - sophisticated pattern matching systems that predict the next token based on training data, without true understanding or agency. From this perspective, concerns about existential risk are overblown and divert attention from more immediate issues like bias and environmental impact.

WMDP: : Measuring and Reducing Malicious Use With Unlearning

The work brought together researchers from 23 institutions including MIT, Harvard, Stanford, and Scale AI (one of the larger collaborations in safety research).

Inside the WMDP Benchmark

The WMDP benchmark consists of 3,668 multiple-choice questions carefully crafted by subject matter experts across three domains. Each question was developed with specific threat models in mind:

Biosecurity (1,273 questions): Covering areas like dual-use virology, bioweapons research, reverse genetics, and viral vector research. The questions assess knowledge that could enable the development or enhancement of biological threats.
Cybersecurity (1,987 questions): Testing knowledge across the full attack chain - from reconnaissance and weaponization to exploitation and post-exploitation. These questions evaluate a model's capability to assist in cyber attacks.
Chemical Security (408 questions): Examining knowledge about synthesis, procurement, and deployment of chemical agents. This section focuses on identifying capabilities that could enable chemical weapons development.

Importantly, the benchmark underwent rigorous filtering to remove sensitive or export-controlled information, ensuring it can't serve as a direct guide for malicious actors. The entire dataset cost over $200,000 to develop and involved extensive consultation with technical experts in each domain.

The WMDP benchmark offers a pragmatic middle ground in this debate. Instead of focusing on abstract risks or dismissing concerns entirely, it addresses specific, concrete risks that could arise from AI systems being misused for developing weapons of mass destruction.

Understanding the Unlearning Approach: RMU

The paper introduces RMU (Representation Misdirection for Unlearning), a novel technique for removing specific knowledge from AI models while preserving general capabilities. This addresses a key challenge in AI safety: how to make models safer without significantly degrading their useful capabilities.

Key Findings and Implications

Broader Implications for AI Safety

WMDP represents a new approach to AI safety that bridges theoretical concerns with practical solutions. It demonstrates that:

Conclusion

While the debate about AI safety continues, WMDP shows that concrete progress is possible. By focusing on specific, measurable risks while acknowledging the complexity of the challenge, we can work towards safer AI systems without falling into either excessive alarm or complacency.

Critical Questions About WMDP and Unlearning

Is Gibberish Really Safety?

The paper demonstrates that after unlearning, models output gibberish when asked about hazardous topics. While this prevents coherent harmful responses, it raises several important questions:

Could adversaries distinguish between genuinely unknown topics and artificially unlearned ones?
Does outputting gibberish make it obvious which topics have been intentionally removed?
Might this "signature" of unlearning actually guide malicious actors toward sensitive areas?
Is it really an effective language model if it speaks gibberish ?

The path to truly safe AI systems likely requires a combination of approaches, with unlearning being just one piece of a much larger puzzle.


# Example implementation of RMU unlearning
def rmu_unlearning(model, forget_data, retain_data, c=6.5, alpha=1200):
    # Get random unit vector for steering direction
    u = get_random_unit_vector()
    
    # Calculate forget loss to increase activation norms on harmful data
    forget_loss = 0
    for x in forget_data:
        activations = model.get_layer_activations(x)
        forget_loss += torch.norm(activations - c * u) ** 2
        
    # Calculate retain loss to preserve general capabilities
    retain_loss = 0 
    for x in retain_data:
        activations = model.get_layer_activations(x)
        retain_loss += torch.norm(activations - 
                   model_orig.get_layer_activations(x)) ** 2
        
    return forget_loss + alpha * retain_loss

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Motivation

Despite the relentless efforts of developers, even their most rigorous diagnostics and exhaustive debugging can't always prevent AI systems from exhibiting dangerous, unintended behaviors. The sheer vastness of the attack surface makes it a Herculean task to identify every possible input that could provoke these malicious actions. Red-teaming and adversarial training (AT) are often deployed in a valiant effort to bolster robustness, yet these methods often falter when confronted with failure modes that diverge from the attacks seen during training.

Latent Adversarial Training

Adversarial training without examples that elicit failure

Across the latents, a model gradually develops more compressed, abstract, and structured representations of the concepts it uses to process information. This makes it possible for latent space attacks to activate neural circuitry that elicits failures without requiring inputs that trigger them

Method

Experimental Results

Comparative Analysis of Safety Approaches

Methodological Differences

Aspect	WMDP/RMU	Latent Adversarial Training
Primary Goal	Remove specific harmful capabilities while preserving general abilities	Make models robust against unforeseen failure modes
Technical Approach	Targeted activation steering with forget/retain loss	Adversarial perturbations in latent space
Evaluation Method	Benchmark-based evaluation (WMDP)	Robustness against novel attacks and trojans

Key Trade-offs

Specificity vs. Generality: WMDP/RMU targets specific harmful capabilities with precise control, while LAT provides broader robustness against general failure modes.
Implementation Complexity: RMU requires careful tuning of forget/retain loss balance, while LAT's approach is more straightforward but may require more computational resources.
Scalability: RMU shows strong results on larger models (up to 7B parameters), while LAT's effectiveness has been primarily demonstrated on smaller models.

Complementary Strengths

These approaches could potentially be combined:

Use WMDP/RMU to remove known harmful capabilities
Apply LAT to provide additional robustness against unforeseen failure modes
Leverage both approaches' insights about latent space manipulation for safety

Key Insight: While both papers work with model internals, they represent different philosophies in AI safety: WMDP takes a targeted approach to removing specific capabilities, while LAT aims for broader robustness through adversarial training in latent space. This difference highlights the complementary nature of current safety research.

Code Resources and Implementation Details

Official Repositories

WMDP Benchmark & RMU Implementation: https://wmdp.ai
Latent Adversarial Training: GitHub Repository

Interactive Notebooks

WMDP Evaluation

RMU Implementation Guide

Usage Notes

All implementations use PyTorch and support both GPU and CPU execution
Code is tested with popular model architectures including Llama, GPT, and BERT variants
For better documentation and examples, please refer to the respective repositories

AI Safety: Opinions and Progress