Jing-Jing Li

RESEARCH

My research is dedicated to improving the safety of generative AI systems. My goal is to ensure that as AI models become more powerful, they remain robust, interpretable, and aligned with human values. I have applied this focus directly through research internships at AWS Agentic AI, where I investigated the adversarial robustness of AI agents, and at the Allen Institute for AI, where I worked on interpretable and transparent safety moderation.

My approach to AI safety is grounded in cognitive science. As a final-year PhD student at UC Berkeley advised by Professor Anne Collins, I investigate the computational principles behind how humans learn, reason, and make decisions. This background provides a unique lens for analyzing complex, human-like behaviors in AI and for constructing high-quality alignment data that reflects nuanced human judgment.

NEWS

Jul 13, 2025

Presenting my work on AI safety at ICML in Vancouver!

May 16, 2025

Started my internship at AWS Agentic AI in Seattle!

Apr 04, 2025

New work on exploration under uncertainty accepted as a spotlight at RLDM and a talk at CogSci!

Oct 04, 2024

Presenting my newly accepted Cognition paper at SfN24 on the TPDA award!

Aug 06, 2024

Two projects featured at the CCN conference in Boston.

SELECTED PUBLICATIONS

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, and Yanjun Qi

2025 Preprint AI Safety LLM

Abs HTML PDF

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC’s automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.
SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, and Sydney Levine

ICML, 2025 Conference AI Safety LLM

Abs HTML PDF Code

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community’s values), which current systems fall short on. To address this gap, we present SAFETYANALYST, a novel AI safety moderation framework. Given an AI behavior, SAFETYANALYST uses chainof-thought reasoning to analyze its potential consequences by creating a structured “harm-benefit tree,” which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impact on any stakeholders. SAFETYANALYST then aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this conceptual framework to develop, test, and release an opensource LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On a comprehensive set of prompt safety benchmarks, we show that SAFETYANALYST (average F1=0.81) outperforms existing LLM safety moderation systems (average F1<0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.
An algorithmic account for how humans efficiently learn, transfer, and compose hierarchically structured decision policies

Jing-Jing Li, and Anne Collins

Cognition, 2025 Journal Human Intelligence Reinforcement Learning Computational Modeling

Abs HTML PDF Code

Learning structures that effectively abstract decision policies is key to the flexibility of human intelligence. Previous work has shown that humans use hierarchically structured policies to efficiently navigate complex and dynamic environments. However, the computational processes that support the learning and construction of such policies remain insufficiently understood. To address this question, we tested 1,026 human participants on a decision-making task where they could learn, transfer, and recompose multiple sets of hierarchical policies. We propose a novel algorithmic account for the learning processes underlying observed human behavior. We show that humans rely on compressed policies over states in early learning, which gradually unfold into hierarchical representations via meta-learning and Bayesian inference. Our modeling evidence suggests that these hierarchical policies are structured in a temporally backward, rather than forward, fashion. Taken together, these algorithmic architectures characterize how the interplay between reinforcement learning, policy compression, meta-learning, and working memory supports structured decision-making and compositionality in a resource-rational way.
Dynamic noise estimation: A generalized method for modeling noise fluctuations in decision-making

Jing-Jing Li, Chengchun Shi, Lexin Li, and Anne GE Collins

Journal of Mathematical Psychology, 2024 Journal Computational Modeling Method Development

Abs HTML PDF Code

Computational cognitive modeling is an important tool for understanding the processes supporting human and animal decision-making. Choice data in decision-making tasks are inherently noisy, and separating noise from signal can improve the quality of computational modeling. Common approaches to model decision noise often assume constant levels of noise or exploration throughout learning (e.g., the softmax policy). However, this assumption is not guaranteed to hold – for example, a subject might disengage and lapse into an inattentive phase for a series of trials in the middle of otherwise low-noise performance. Here, we introduce a new, computationally inexpensive method to dynamically estimate the levels of noise fluctuations in choice behavior, under a model assumption that the agent can transition between two discrete latent states (e.g., fully engaged and random). Using simulations, we show that modeling noise levels dynamically instead of statically can substantially improve model fit and parameter estimation, especially in the presence of long periods of noisy behavior, such as prolonged lapses of attention. We further demonstrate the empirical benefits of dynamic noise estimation at the individual and group levels by validating it on four published datasets featuring diverse populations, tasks, and models. Based on the theoretical and empirical evaluation of the method reported in the current work, we expect that dynamic noise estimation will improve modeling in many decision-making paradigms over the static noise estimation method currently used in the modeling literature, while keeping additional model complexity and assumptions minimal.

JING-JING LI

JING-JING LI

RESEARCH

NEWS

SELECTED PUBLICATIONS