SafetyAnalyst

Our SafetyAnalyst framework redefines how AI safety decisions should be made: by explicitly analyzing the harms and benefits entailed in the AI behavior and trading them off based on the principles of cost-benefit analysis. Given a prompt, SafetyAnalyst creates a structured "harm-benefit tree" that identifies the actions that could be taken if a compliant response were provided, the harmful and beneficial effects of those actions, and the stakeholders that would be impacted by those effects—as well as the likelihood, extent or severity, and immediacy of each effect.

We implemented the SafetyAnalyst framework into an open-source system for harmful prompt classification and released it, along with the synthetic training data. We generated 18.5 million harm-benefit features using frontier LLMs (GPT-4o, Gemini-1.5-Pro, Llama-3.1-70B-Instruct, Llama-3.1-405B-Turbo, and Claude-3.5-Sonnet) on 19k user prompts through chain-of-thought prompting. We embedded each prompt in a hypothetical AI language model usage scenario and instructed the LLMs to enumerate all harm-benefit features, which were then used to train two specialist models—one to generate harms and one to generate benefits (HarmReporter and BenefitReporter)—through symbolic knowledge distillation via supervised fine-tuning of Llama-3.1-8B-Instruct. Given any prompt, SafetyReporter efficiently generates an interpretable harm-benefit tree. The harms and benefits are weighted and traded off by an independent and fully interpretable aggregation model to calculate a harmfulness score, which can be directly translated into content safety labels or refusal decisions. Steerability can be achieved by aligning the weights in the aggregation model to a user’s or community’s preference or to principled safety standards.

We applied SafetyAnalyst to classify LLM prompt safety on a comprehensive set of prompt safety benchmarks. SafetyAnalyst ourperformed current LLM safety moderation systems on all benchmarks, while offering the benefits of interpretability and steerability that other systems lack.

Model	SimpSTests	HarmBench	WildGuardTest	AIR-Bench	SORRY-Bench	Average
OpenAI Mod. API	63.0	47.9	16.3	6.8	46.5	42.9	41.1
Llama-Guard	93.0	85.6	70.5	32.6	44.7	-	-
Llama-Guard-2	95.8	91.8	85.6	46.1	74.9	53.9	62.9
Llama-Guard-3	99.5	98.4	86.7	61.6	68.8	59.1	64.6
Aegis-Guard-D	100	93.6	82.0	74.5	83.4	-	-
Aegis-Guard-P	99.0	87.6	77.9	62.9	62.5	-	-
ShieldGemma-2B	99.5	100	62.2	59.2	28.6	18.5	27.4
ShieldGemma-9B	83.7	77.2	61.3	35.8	28.6	39.0	37.3
ShieldGemma-27B	85.7	74.8	62.4	43.0	32.0	42.3	40.6
WildGuard	99.5	99.7	91.7	85.5	87.6	58.2	71.7
GPT-4	100	100	93.4	81.6	84.5	78.2	81.6
`SafetyAnalyst`	88.8	96.1	90.9	79.6	88.9	75.4	81.2

BibTeX

@inproceedings{li2025safetyanalyst,
  title={Safetyanalyst: Interpretable, transparent, and steerable safety moderation for ai behavior},
  author={Li, Jing-Jing and Pyatkin, Valentina and Kleiman-Weiner, Max and Jiang, Liwei and Dziri, Nouha and Collins, Anne and Borg, Jana Schaich and Sap, Maarten and Choi, Yejin and Levine, Sydney},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025}
}

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

BibTeX

`SafetyAnalyst`: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior