AI Safety and Security

[Research Statement] [Publications] [Members]

Research Statement

The rapid advancement of AI—especially large language models (LLMs), large multimodal models, and AI agents—offers transformative opportunities while also necessitating robust safety and security measures. Ensuring that these powerful systems are trustworthy, secure, and aligned with human values is critical for their responsible deployment. Advanced models face diverse threats, for example: adversarial prompts can jailbreak aligned LLMs, prompt injection attacks can lead to unintended decisions and actions, small perturbations can mislead multimodal systems, and malicious agents can exploit vulnerabilities within cooperative ecosystems. To address these risks requires rigorous safety frameworks to ensure that AI remains reliable, trustworthy, secure, and aligned in real-world applications.

Our primary goal is to establish foundational principles and practical techniques for building safe and secure AI systems. We focus on understanding how the complex reasoning, multimodal understanding, and interactive capabilities of these systems may introduce novel vulnerabilities or offer unique defense opportunities. Our work is organized around these key areas:

Evaluation & Automated Red Teaming: Developing robust methodologies and automated tools—including benchmark creation and stress-testing—to proactively identify and address vulnerabilities.
Interpretability & Control: Enhancing transparency to reveal the decision-making processes of AI models, thereby identifying root causes of failures and bolstering trust in high-stakes applications.
Robust Defenses: Designing novel mechanisms to detect, mitigate, and defend against adversarial attacks and other security threats, ensuring reliable operation in complex environments.
Alignment & Post-Training: Investigating advanced techniques to align AI systems with human guidelines after the pretraining phase. This includes reinforcement learning from human feedback, alignment fine-tuning, and continuous post-deployment evaluations to guard against drifts and misalignments over time.

Recognition & Awards

We are also honored to have been recognized for our contributions to the field of AI Safety and Security with prestigious awards, including:

Best Research Paper Finalist at VLDB 2024.
Best Scientific Cybersecurity Paper Award 2024 by NSA.
Outstanding Paper Award at NeurIPS 2023.

Recent Publications

An Undetectable Watermark for Generative Image Models

Sam Gunn*, Xuandong Zhao*, Dawn Song

International Conference on Learning Representations (ICLR). April, 2025.

Multimodal Situational Safety

Kaiwen Zhou*, Chengzhi Liu*, Xuandong Zhao, Anderson Compalas, Dawn Song, Xin Eric Wang

International Conference on Learning Representations (ICLR). April, 2025.

Capturing the Temporal Dependence of Training Data Influence

Jiachen T. Wang, Dawn Song, James Zou, Prateek Mittal, Ruoxi Jia

International Conference on Learning Representations (ICLR). April, 2025.

Data Shapley in One Training Run

Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

International Conference on Learning Representations (ICLR). April, 2025.

AIR-Bench 2024: A Safety Benchmark based on Regulation and Policies Specified Risk Categories

Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, Bo Li

International Conference on Learning Representations (ICLR). April, 2025.

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

Yuchen Tian*, Weixiang Yan*, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, Dawn Song

Annual AAAI Conference on Artificial Intelligence (AAAI). February, 2025.

Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

Myeongseob Ko*, Henry Li*, Zhun Wang, Jonathan Patsenker, Jiachen T. Wang, Qinbin Li, Ming Jin, Dawn Song, Ruoxi Jia