AI Safety: Opinions and Progress

By Rohit Gandikota and Jiachen Zhao

Published on December 1, 2024

The discussion around AI safety has become increasingly polarized, with two distinct camps emerging:

The Great AI Safety Debate

The discussion around AI safety has become increasingly polarized, with two distinct camps emerging:

The Existential Risk Perspective

Some researchers and experts, including figures like Eliezer Yudkowsky and Stuart Russell, warn about the potential catastrophic risks of advanced AI systems. They argue that as AI systems become more capable, they could pose existential threats to humanity through various mechanisms, from unaligned goals to potential misuse in creating weapons of mass destruction.

The Skeptics' View

On the other side, researchers like Emily M. Bender and Timnit Gebru argue that current AI systems are merely "stochastic parrots" - sophisticated pattern matching systems that predict the next token based on training data, without true understanding or agency. From this perspective, concerns about existential risk are overblown and divert attention from more immediate issues like bias and environmental impact.

WMDP: : Measuring and Reducing Malicious Use With Unlearning

First Authors:

Senior Author:

The work brought together researchers from 23 institutions including MIT, Harvard, Stanford, and Scale AI (one of the larger collaborations in safety research).

Inside the WMDP Benchmark

The WMDP benchmark consists of 3,668 multiple-choice questions carefully crafted by subject matter experts across three domains. Each question was developed with specific threat models in mind:

Importantly, the benchmark underwent rigorous filtering to remove sensitive or export-controlled information, ensuring it can't serve as a direct guide for malicious actors. The entire dataset cost over $200,000 to develop and involved extensive consultation with technical experts in each domain.

Figure 1: Overview of the WMDP Benchmark's three main components: biosecurity, cybersecurity, and chemical security evaluation

The WMDP benchmark offers a pragmatic middle ground in this debate. Instead of focusing on abstract risks or dismissing concerns entirely, it addresses specific, concrete risks that could arise from AI systems being misused for developing weapons of mass destruction.

Counter Perspective:

Creating benchmarks for hazardous capabilities could inadvertently provide a roadmap for malicious actors. However, the authors argue that their careful filtering process mitigates this risk.

Understanding the Unlearning Approach: RMU

Figure 2: Architecture of the Representation Misdirection for Unlearning (RMU) method

The paper introduces RMU (Representation Misdirection for Unlearning), a novel technique for removing specific knowledge from AI models while preserving general capabilities. This addresses a key challenge in AI safety: how to make models safer without significantly degrading their useful capabilities.

Figure 3: Performance comparison before and after applying RMU

Key Findings and Implications

Important Results:

Critical Considerations:

The paper's results show some degradation in related benign knowledge (like basic virology) when removing hazardous information. This raises questions about the true separability of harmful and beneficial knowledge.

Broader Implications for AI Safety

WMDP represents a new approach to AI safety that bridges theoretical concerns with practical solutions. It demonstrates that:

  1. Specific safety concerns can be measured and addressed systematically
  2. Safety improvements don't necessarily require sacrificing all related capabilities
  3. A middle ground exists between extreme positions in the AI safety debate

The path forward in AI safety likely involves combining multiple approaches: technical solutions like WMDP, policy frameworks, and ethical guidelines.

Conclusion

While the debate about AI safety continues, WMDP shows that concrete progress is possible. By focusing on specific, measurable risks while acknowledging the complexity of the challenge, we can work towards safer AI systems without falling into either excessive alarm or complacency.

Critical Questions About WMDP and Unlearning

Is Gibberish Really Safety?

The paper demonstrates that after unlearning, models output gibberish when asked about hazardous topics. While this prevents coherent harmful responses, it raises several important questions:

The path to truly safe AI systems likely requires a combination of approaches, with unlearning being just one piece of a much larger puzzle.

Here is a small snippet of RMU code for easy understanding:

# Example implementation of RMU unlearning
def rmu_unlearning(model, forget_data, retain_data, c=6.5, alpha=1200):
    # Get random unit vector for steering direction
    u = get_random_unit_vector()
    
    # Calculate forget loss to increase activation norms on harmful data
    forget_loss = 0
    for x in forget_data:
        activations = model.get_layer_activations(x)
        forget_loss += torch.norm(activations - c * u) ** 2
        
    # Calculate retain loss to preserve general capabilities
    retain_loss = 0 
    for x in retain_data:
        activations = model.get_layer_activations(x)
        retain_loss += torch.norm(activations - 
                   model_orig.get_layer_activations(x)) ** 2
        
    return forget_loss + alpha * retain_loss
        

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Lead Authors:

Senior Author:

Motivation

Despite the relentless efforts of developers, even their most rigorous diagnostics and exhaustive debugging can't always prevent AI systems from exhibiting dangerous, unintended behaviors. The sheer vastness of the attack surface makes it a Herculean task to identify every possible input that could provoke these malicious actions. Red-teaming and adversarial training (AT) are often deployed in a valiant effort to bolster robustness, yet these methods often falter when confronted with failure modes that diverge from the attacks seen during training.

Latent Adversarial Training

Adversarial training without examples that elicit failure

Across the latents, a model gradually develops more compressed, abstract, and structured representations of the concepts it uses to process information. This makes it possible for latent space attacks to activate neural circuitry that elicits failures without requiring inputs that trigger them

Figure 4: The motivation for LAT is based on how models develop more compressed, abstract, and structured representations across their latents. Wehypothesize that many failures that are difficult to elicit from the input space may be easier to elicit from the latent space.

Method

LAT needs to select which layer to attack, i.e., apply the perturbation during training. The choice may differ from case to case and needs to be tuned as a hyper-parameter.

Experimental Results

Figure 5: Evaluation procedures.

The experiments compare methods based on

Figure 6: Evaluation results for image classification.

Comparative Analysis of Safety Approaches

Methodological Differences

Aspect WMDP/RMU Latent Adversarial Training
Primary Goal Remove specific harmful capabilities while preserving general abilities Make models robust against unforeseen failure modes
Technical Approach Targeted activation steering with forget/retain loss Adversarial perturbations in latent space
Evaluation Method Benchmark-based evaluation (WMDP) Robustness against novel attacks and trojans

Key Trade-offs

Complementary Strengths

These approaches could potentially be combined:

Key Insight: While both papers work with model internals, they represent different philosophies in AI safety: WMDP takes a targeted approach to removing specific capabilities, while LAT aims for broader robustness through adversarial training in latent space. This difference highlights the complementary nature of current safety research.

Code Resources and Implementation Details

Official Repositories

Interactive Notebooks

WMDP Evaluation

Usage Notes