SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

1UC Berkeley 2Allen Institute for AI 3University of Washington 4Duke University 5CMU

Our SafetyAnalyst framework redefines how AI safety decisions should be made: by explicitly analyzing the harms and benefits entailed in the AI behavior and trading them off based on the principles of cost-benefit analysis. Given a prompt, SafetyAnalyst creates a structured "harm-benefit tree" that identifies the actions that could be taken if a compliant response were provided, the harmful and beneficial effects of those actions, and the stakeholders that would be impacted by those effects—as well as the likelihood, extent or severity, and immediacy of each effect.

Example harm-benefit tree

We implemented the SafetyAnalyst framework into an open-source system for harmful prompt classification and released it, along with the synthetic training data. We generated 1.8 million harm-benefit features using frontier LLMs (GPT-4o, Gemini-1.5-Pro, Llama-3.1-70B-Instruct, Llama-3.1-405B-Turbo, and Claude-3.5-Sonnet) on 19k user prompts through chain-of-thought prompting. We embedded each prompt in a hypothetical AI language model usage scenario and instructed the LLMs to enumerate all harm-benefit features, which were then used to train two specialist models—one to generate harms and one to generate benefits (HarmReporter and BenefitReporter)—through symbolic knowledge distillation via supervised fine-tuning of Llama-3.1-8B-Instruct. Given any prompt, SafetyReporter efficiently generates an interpretable harm-benefit tree. The harms and benefits are weighted and traded off by an independent and fully interpretable aggregation model to calculate a harmfulness score, which can be directly translated into content safety labels or refusal decisions. Steerability can be achieved by aligning the weights in the aggregation model to a user’s or community’s preference or to principled safety standards.

MY ALT TEXT

We applied SafetyAnalyst to classify LLM prompt safety on a comprehensive set of prompt safety benchmarks. SafetyAnalyst ourperformed current LLM safety moderation systems on all benchmarks, while offering the benefits of interpretability and steerability that other systems lack.

Model SimpSTests HarmBench WildGuardTest AIR-Bench SORRY-Bench Average
Vani. Adv.
OpenAI Mod. API 63.0 47.9 16.3 6.8 46.5 42.9 41.1
Llama-Guard 93.0 85.6 70.5 32.6 44.7 - -
Llama-Guard-2 95.8 91.8 85.6 46.1 74.9 53.9 62.9
Llama-Guard-3 99.5 98.4 86.7 61.6 68.8 59.1 64.6
Aegis-Guard-D 100 93.6 82.0 74.5 83.4 - -
Aegis-Guard-P 99.0 87.6 77.9 62.9 62.5 - -
ShieldGemma-2B 99.5 100 62.2 59.2 28.6 18.5 27.4
ShieldGemma-9B 83.7 77.2 61.3 35.8 28.6 39.0 37.3
ShieldGemma-27B 85.7 74.8 62.4 43.0 32.0 42.3 40.6
WildGuard 99.5 99.7 91.7 85.5 87.6 58.2 71.7
GPT-4 100 100 93.4 81.6 84.5 78.2 81.6
SafetyAnalyst 88.8 96.1 90.9 79.6 88.9 75.4 81.2

BibTeX

@misc{li2024safetyanalystinterpretabletransparentsteerable,
        title={SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior}, 
        author={Jing-Jing Li and Valentina Pyatkin and Max Kleiman-Weiner and Liwei Jiang and Nouha Dziri and Anne G. E. Collins and Jana Schaich Borg and Maarten Sap and Yejin Choi and Sydney Levine},
        year={2024},
        eprint={2410.16665},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2410.16665}
  }