SafetyAnalyst
: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
Our SafetyAnalyst
framework redefines how AI safety decisions should be made: by explicitly analyzing the harms and benefits entailed in the AI behavior and trading them off based on the principles of cost-benefit analysis.
Given a prompt, SafetyAnalyst
creates a structured "harm-benefit tree" that identifies the actions that could be taken if a compliant response were provided, the harmful and beneficial effects of those actions, and the stakeholders that would be impacted by those effects—as well as the likelihood, extent or severity, and immediacy of each effect.
We implemented the SafetyAnalyst
framework into an open-source system for harmful prompt classification and released it, along with the synthetic training data. We generated 18.5 million harm-benefit features using frontier LLMs (GPT-4o, Gemini-1.5-Pro, Llama-3.1-70B-Instruct, Llama-3.1-405B-Turbo, and Claude-3.5-Sonnet) on 19k user prompts through chain-of-thought prompting. We embedded each prompt in a hypothetical AI language model usage scenario and instructed the LLMs to enumerate all harm-benefit features, which were then used to train two specialist models—one to generate harms and one to generate benefits (HarmReporter
and BenefitReporter
)—through symbolic knowledge distillation via supervised fine-tuning of Llama-3.1-8B-Instruct. Given any prompt, SafetyReporter
efficiently generates an interpretable harm-benefit tree. The harms and benefits are weighted and traded off by an independent and fully interpretable aggregation model to calculate a harmfulness score, which can be directly translated into content safety labels or refusal decisions. Steerability can be achieved by aligning the weights in the aggregation model to a user’s or community’s preference or to principled safety standards.
We applied SafetyAnalyst
to classify LLM prompt safety on a comprehensive set of prompt safety benchmarks. SafetyAnalyst
ourperformed current LLM safety moderation systems on all benchmarks, while offering the benefits of interpretability and steerability that other systems lack.
Model | SimpSTests | HarmBench | WildGuardTest | AIR-Bench | SORRY-Bench | Average | |
---|---|---|---|---|---|---|---|
Vani. | Adv. | ||||||
OpenAI Mod. API | 63.0 | 47.9 | 16.3 | 6.8 | 46.5 | 42.9 | 41.1 |
Llama-Guard | 93.0 | 85.6 | 70.5 | 32.6 | 44.7 | - | - |
Llama-Guard-2 | 95.8 | 91.8 | 85.6 | 46.1 | 74.9 | 53.9 | 62.9 |
Llama-Guard-3 | 99.5 | 98.4 | 86.7 | 61.6 | 68.8 | 59.1 | 64.6 |
Aegis-Guard-D | 100 | 93.6 | 82.0 | 74.5 | 83.4 | - | - |
Aegis-Guard-P | 99.0 | 87.6 | 77.9 | 62.9 | 62.5 | - | - |
ShieldGemma-2B | 99.5 | 100 | 62.2 | 59.2 | 28.6 | 18.5 | 27.4 |
ShieldGemma-9B | 83.7 | 77.2 | 61.3 | 35.8 | 28.6 | 39.0 | 37.3 |
ShieldGemma-27B | 85.7 | 74.8 | 62.4 | 43.0 | 32.0 | 42.3 | 40.6 |
WildGuard | 99.5 | 99.7 | 91.7 | 85.5 | 87.6 | 58.2 | 71.7 |
GPT-4 | 100 | 100 | 93.4 | 81.6 | 84.5 | 78.2 | 81.6 |
SafetyAnalyst |
88.8 | 96.1 | 90.9 | 79.6 | 88.9 | 75.4 | 81.2 |
@article{li2024safetyanalyst,
title={SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation},
author={Li, Jing-Jing and Pyatkin, Valentina and Kleiman-Weiner, Max and Jiang, Liwei and Dziri, Nouha and Collins, Anne GE and Borg, Jana Schaich and Sap, Maarten and Choi, Yejin and Levine, Sydney},
journal={arXiv preprint arXiv:2410.16665},
year={2024}
}
}