By Rohit Gandikota and Jiachen Zhao
Published on December 1, 2024
The discussion around AI safety has become increasingly polarized, with two distinct camps emerging:
The discussion around AI safety has become increasingly polarized, with two distinct camps emerging:
Some researchers and experts, including figures like Eliezer Yudkowsky and Stuart Russell, warn about the potential catastrophic risks of advanced AI systems. They argue that as AI systems become more capable, they could pose existential threats to humanity through various mechanisms, from unaligned goals to potential misuse in creating weapons of mass destruction.
On the other side, researchers like Emily M. Bender and Timnit Gebru argue that current AI systems are merely "stochastic parrots" - sophisticated pattern matching systems that predict the next token based on training data, without true understanding or agency. From this perspective, concerns about existential risk are overblown and divert attention from more immediate issues like bias and environmental impact.
First Authors:
Senior Author:
The work brought together researchers from 23 institutions including MIT, Harvard, Stanford, and Scale AI (one of the larger collaborations in safety research).
The WMDP benchmark consists of 3,668 multiple-choice questions carefully crafted by subject matter experts across three domains. Each question was developed with specific threat models in mind:
Importantly, the benchmark underwent rigorous filtering to remove sensitive or export-controlled information, ensuring it can't serve as a direct guide for malicious actors. The entire dataset cost over $200,000 to develop and involved extensive consultation with technical experts in each domain.
Figure 1: Overview of the WMDP Benchmark's three main components: biosecurity, cybersecurity, and chemical security evaluation
The WMDP benchmark offers a pragmatic middle ground in this debate. Instead of focusing on abstract risks or dismissing concerns entirely, it addresses specific, concrete risks that could arise from AI systems being misused for developing weapons of mass destruction.
Creating benchmarks for hazardous capabilities could inadvertently provide a roadmap for malicious actors. However, the authors argue that their careful filtering process mitigates this risk.
Figure 2: Architecture of the Representation Misdirection for Unlearning (RMU) method
The paper introduces RMU (Representation Misdirection for Unlearning), a novel technique for removing specific knowledge from AI models while preserving general capabilities. This addresses a key challenge in AI safety: how to make models safer without significantly degrading their useful capabilities.
Figure 3: Performance comparison before and after applying RMU
Important Results:
The paper's results show some degradation in related benign knowledge (like basic virology) when removing hazardous information. This raises questions about the true separability of harmful and beneficial knowledge.
WMDP represents a new approach to AI safety that bridges theoretical concerns with practical solutions. It demonstrates that:
The path forward in AI safety likely involves combining multiple approaches: technical solutions like WMDP, policy frameworks, and ethical guidelines.
While the debate about AI safety continues, WMDP shows that concrete progress is possible. By focusing on specific, measurable risks while acknowledging the complexity of the challenge, we can work towards safer AI systems without falling into either excessive alarm or complacency.
The paper demonstrates that after unlearning, models output gibberish when asked about hazardous topics. While this prevents coherent harmful responses, it raises several important questions:
The path to truly safe AI systems likely requires a combination of approaches, with unlearning being just one piece of a much larger puzzle.
# Example implementation of RMU unlearning
def rmu_unlearning(model, forget_data, retain_data, c=6.5, alpha=1200):
# Get random unit vector for steering direction
u = get_random_unit_vector()
# Calculate forget loss to increase activation norms on harmful data
forget_loss = 0
for x in forget_data:
activations = model.get_layer_activations(x)
forget_loss += torch.norm(activations - c * u) ** 2
# Calculate retain loss to preserve general capabilities
retain_loss = 0
for x in retain_data:
activations = model.get_layer_activations(x)
retain_loss += torch.norm(activations -
model_orig.get_layer_activations(x)) ** 2
return forget_loss + alpha * retain_loss
Lead Authors:
Senior Author:
Despite the relentless efforts of developers, even their most rigorous diagnostics and exhaustive debugging can't always prevent AI systems from exhibiting dangerous, unintended behaviors. The sheer vastness of the attack surface makes it a Herculean task to identify every possible input that could provoke these malicious actions. Red-teaming and adversarial training (AT) are often deployed in a valiant effort to bolster robustness, yet these methods often falter when confronted with failure modes that diverge from the attacks seen during training.
Across the latents, a model gradually develops more compressed, abstract, and structured representations of the concepts it uses to process information. This makes it possible for latent space attacks to activate neural circuitry that elicits failures without requiring inputs that trigger them
Figure 4: The motivation for LAT is based on how models develop more compressed, abstract, and structured representations across their latents. Wehypothesize that many failures that are difficult to elicit from the input space may be easier to elicit from the latent space.
Figure 5: Evaluation procedures.
Figure 6: Evaluation results for image classification.
Aspect | WMDP/RMU | Latent Adversarial Training |
---|---|---|
Primary Goal | Remove specific harmful capabilities while preserving general abilities | Make models robust against unforeseen failure modes |
Technical Approach | Targeted activation steering with forget/retain loss | Adversarial perturbations in latent space |
Evaluation Method | Benchmark-based evaluation (WMDP) | Robustness against novel attacks and trojans |
These approaches could potentially be combined:
Key Insight: While both papers work with model internals, they represent different philosophies in AI safety: WMDP takes a targeted approach to removing specific capabilities, while LAT aims for broader robustness through adversarial training in latent space. This difference highlights the complementary nature of current safety research.