SafetyAnalyst
: Interpretable, transparent, and steerable safety moderation for AI behavior
Our SafetyAnalyst
framework redefines how AI safety decisions should be made: by explicitly analyzing the harms and benefits entailed in the AI behavior and trading them off based on the principles of cost-benefit analysis.
Given a prompt, SafetyAnalyst
creates a structured "harm-benefit tree" that identifies the actions that could be taken if a compliant response were provided, the harmful and beneficial effects of those actions, and the stakeholders that would be impacted by those effects—as well as the likelihood, extent or severity, and immediacy of each effect.
We implemented the SafetyAnalyst
framework into an open-source system for harmful prompt classification and released it, along with the synthetic training data. We generated 1.8 million harm-benefit features using frontier LLMs (GPT-4o, Gemini-1.5-Pro, Llama-3.1-70B-Instruct, Llama-3.1-405B-Turbo, and Claude-3.5-Sonnet) on 19k user prompts through chain-of-thought prompting. We embedded each prompt in a hypothetical AI language model usage scenario and instructed the LLMs to enumerate all harm-benefit features, which were then used to train two specialist models—one to generate harms and one to generate benefits (HarmReporter
and BenefitReporter
)—through symbolic knowledge distillation via supervised fine-tuning of Llama-3.1-8B-Instruct. Given any prompt, SafetyReporter
efficiently generates an interpretable harm-benefit tree. The harms and benefits are weighted and traded off by an independent and fully interpretable aggregation model to calculate a harmfulness score, which can be directly translated into content safety labels or refusal decisions. Steerability can be achieved by aligning the weights in the aggregation model to a user’s or community’s preference or to principled safety standards.
We applied SafetyAnalyst
to classify LLM prompt safety on a comprehensive set of prompt safety benchmarks. SafetyAnalyst
ourperformed current LLM safety moderation systems on all benchmarks, while offering the benefits of interpretability and steerability that other systems lack.
Model | SimpSTests | HarmBench | WildGuardTest | AIR-Bench | SORRY-Bench | Average | |
---|---|---|---|---|---|---|---|
Vani. | Adv. | ||||||
OpenAI Mod. API | 63.0 | 47.9 | 16.3 | 6.8 | 46.5 | 42.9 | 41.1 |
Llama-Guard | 93.0 | 85.6 | 70.5 | 32.6 | 44.7 | - | - |
Llama-Guard-2 | 95.8 | 91.8 | 85.6 | 46.1 | 74.9 | 53.9 | 62.9 |
Llama-Guard-3 | 99.5 | 98.4 | 86.7 | 61.6 | 68.8 | 59.1 | 64.6 |
Aegis-Guard-D | 100 | 93.6 | 82.0 | 74.5 | 83.4 | - | - |
Aegis-Guard-P | 99.0 | 87.6 | 77.9 | 62.9 | 62.5 | - | - |
ShieldGemma-2B | 99.5 | 100 | 62.2 | 59.2 | 28.6 | 18.5 | 27.4 |
ShieldGemma-9B | 83.7 | 77.2 | 61.3 | 35.8 | 28.6 | 39.0 | 37.3 |
ShieldGemma-27B | 85.7 | 74.8 | 62.4 | 43.0 | 32.0 | 42.3 | 40.6 |
WildGuard | 99.5 | 99.7 | 91.7 | 85.5 | 87.6 | 58.2 | 71.7 |
GPT-4 | 100 | 100 | 93.4 | 81.6 | 84.5 | 78.2 | 81.6 |
SafetyAnalyst |
88.8 | 96.1 | 90.9 | 79.6 | 88.9 | 75.4 | 81.2 |
@misc{li2024safetyanalystinterpretabletransparentsteerable,
title={SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior},
author={Jing-Jing Li and Valentina Pyatkin and Max Kleiman-Weiner and Liwei Jiang and Nouha Dziri and Anne G. E. Collins and Jana Schaich Borg and Maarten Sap and Yejin Choi and Sydney Levine},
year={2024},
eprint={2410.16665},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.16665}
}