AI Safety and Security

[Research Statement] [Publications] [Members]

Research Statement

The rapid advancement of AI—especially large language models (LLMs), large multimodal models, and AI agents—offers transformative opportunities while also necessitating robust safety and security measures. Ensuring that these powerful systems are trustworthy, secure, and aligned with human values is critical for their responsible deployment. Advanced models face diverse threats, for example: adversarial prompts can jailbreak aligned LLMs, prompt injection attacks can lead to unintended decisions and actions, small perturbations can mislead multimodal systems, and malicious agents can exploit vulnerabilities within cooperative ecosystems. To address these risks requires rigorous safety frameworks to ensure that AI remains reliable, trustworthy, secure, and aligned in real-world applications.

Our primary goal is to establish foundational principles and practical techniques for building safe and secure AI systems. We focus on understanding how the complex reasoning, multimodal understanding, and interactive capabilities of these systems may introduce novel vulnerabilities or offer unique defense opportunities. Our work is organized around these key areas:

Recognition & Awards

We are also honored to have been recognized for our contributions to the field of AI Safety and Security with prestigious awards, including:


Recent Publications

An Undetectable Watermark for Generative Image Models

Sam Gunn*, Xuandong Zhao*, Dawn Song

International Conference on Learning Representations (ICLR). April, 2025.

Multimodal Situational Safety

Kaiwen Zhou*, Chengzhi Liu*, Xuandong Zhao, Anderson Compalas, Dawn Song, Xin Eric Wang

International Conference on Learning Representations (ICLR). April, 2025.

Capturing the Temporal Dependence of Training Data Influence

Jiachen T. Wang, Dawn Song, James Zou, Prateek Mittal, Ruoxi Jia

International Conference on Learning Representations (ICLR). April, 2025.

Data Shapley in One Training Run

Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

International Conference on Learning Representations (ICLR). April, 2025.

AIR-Bench 2024: A Safety Benchmark based on Regulation and Policies Specified Risk Categories

Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, Bo Li

International Conference on Learning Representations (ICLR). April, 2025.

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

Yuchen Tian*, Weixiang Yan*, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, Dawn Song

Annual AAAI Conference on Artificial Intelligence (AAAI). February, 2025.

Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

Myeongseob Ko*, Henry Li*, Zhun Wang, Jonathan Patsenker, Jiachen T. Wang, Qinbin Li, Ming Jin, Dawn Song, Ruoxi Jia

Advances in Neural Information Processing Systems (NeurIPS). December, 2024.

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li

Advances in Neural Information Processing Systems (NeurIPS). December, 2024.

GREATS: Online Selection of High-Quality Data for LLM Training in Every Iteration

Jiachen T. Wang, Tong Wu, Dawn Song, Prateek Mittal, Ruoxi Jia

Advances in Neural Information Processing Systems (NeurIPS). December, 2024.

Data Free Backdoor Attacks

Bochuan Cao, Jinyuan Jia, Chuxuan Hu, Wenbo Guo, Zhen Xiang, Jinghui Chen, Bo Li, Dawn Song

Advances in Neural Information Processing Systems (NeurIPS). December, 2024.

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Chengquan Guo*, Xun Liu*, Chulin Xie*, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, Bo Li

Advances in Neural Information Processing Systems (NeurIPS). December, 2024.

LLM-PBE: Assessing Data Privacy in Large Language Models

Qinbin Li, Junyuan Hong, Chulin Xie, Jeffrey Tan, Rachel Xin, Junyi Hou, Xavier Yin, Zhun Wang, Dan Hendrycks, Zhangyang Wang, Bo Li, Bingsheng He, Dawn Song

International Conference on Very Large Data Bases (VLDB) Best Paper Award Finalist. August, 2024.

SHINE: Shielding Backdoors in Deep Reinforcement Learning

Zhuowen Yuan, Wenbo Guo, Jinyuan Jia, Bo Li, Dawn Song

The International Conference on Machine Learning (ICML). July, 2024.

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li

The International Conference on Machine Learning (ICML). July, 2024.

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang “Atlas” Wang, Bo Li

The International Conference on Machine Learning (ICML). July, 2024.

C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models

Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li

The International Conference on Machine Learning (ICML). July, 2024.

The False Promise of Imitating Proprietary Language Models

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song

International Conference on Learning Representations (ICLR). May, 2024.

TextGuard: Provable Defense against Backdoor Attacks on Text Classification

Hengzhi Pei, Jinyuan Jia, Wenbo Guo, Bo Li, Dawn Song

The Network and Distributed System Security Symposium (NDSS). February, 2024.

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

Advances in Neural Information Processing Systems (NeurIPS) Outstanding Paper Award. December, 2023.

Best Scientific Cybersecurity Paper Award of 2024 (by National Security Agency)

DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification

Mintong Kang, Dawn Song, Bo Li

Advances in Neural Information Processing Systems (NeurIPS). December, 2023.

BIRD: Generalizable Backdoor Detection and Removal for Deep Reinforcement Learning

Xuan Chen, Wenbo Guo, Guanhong Tao, Xiangyu Zhang, Dawn Song

Advances in Neural Information Processing Systems (NeurIPS). December, 2023.

PATROL: Provable Defense against Adversarial Policy in Two-player Games

Wenbo Guo, Xian Wu, Lun Wang, Xinyu Xing, Dawn Song

USENIX Security Symposium. August, 2023.

Extracting Training Data from Diffusion Models

Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace

USENIX Security Symposium. August, 2023.

Poisoning Instruction-Tuned Language Models

Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein

The International Conference on Machine Learning (ICML). July, 2023.

Trojdiff: Trojan attacks on diffusion models with diverse targets

Weixin Chen, Dawn Song, Bo Li

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2023.

Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, Tom Goldstein

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). February, 2023.

Scaling Out-of-Distribution Detection for Real-World Settings

Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song

The International Conference on Machine Learning (ICML). July, 2022.

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Nikhil Kandpal, Eric Wallace, Colin Raffel

The International Conference on Machine Learning (ICML). July, 2022.

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2022.

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Dan Hendrycks, Steven Basart*, Norman Mu*, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer.

International Conference on Computer Vision (ICCV). October, 2021.

Extracting Training Data from Large Language Models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel.

USENIX Security Symposium. August, 2021.

Towards Robustness of Text-to-SQL Models against Synonym Substitution

Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew Purver, John R. Woodward, Jinxia Xie, Pengsheng Huang.

Annual Meeting of the Association for Computational Linguistics (ACL). August, 2021.

BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning

Lun Wang, Zaynah Javed, Xian Wu, Wenbo Guo, Xinyu Xing, Dawn Song.

International Joint Conference on Artificial Intelligence (IJCAI). August, 2021.

Natural Adversarial Examples

Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song.

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2021.

REFIT: a Unified Watermark Removal Framework for Deep Learning Systems with Limited Data

Xinyun Chen*, Wenxiao Wang*, Chris Bender, Yiming Ding, Ruoxi Jia, Bo Li, Dawn Song.

ACM Asia Conference on Computer and Communications Security (AsiaCCS). June, 2021.

Understanding Robustness in Teacher-Student Setting: A New Perspective

Zhuolin Yang*, Zhaoxi Chen, Tiffany (Tianhui) Cai, Xinyun Chen, Bo Li, Yuandong Tian*.

International Conference on Artificial Intelligence and Statistics (AISTATS). April, 2021.

Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, Tom Goldstein.

December, 2020.

Imitation Attacks and Defenses for Black-box Machine Translation Systems

Eric Wallace, Mitchell Stern, Dawn Song.

Conference on Empirical Methods in Natural Language Processing (EMNLP), November, 2020.

Blog

Towards Inspecting and Eliminating Trojan Backdoors in Deep Neural Networks

Wenbo Guo*, Lun Wang*, Yan Xu, Xinyu Xing, Min Du, Dawn Song.

IEEE International Conference on Data Mining (ICDM), November, 2020.

Pretrained Transformers Improve Out-of-Distribution Robustness

Dan Hendrycks*, Xiaoyuan Liu*, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song.

Annual Meeting of the Association for Computational Linguistics (ACL). July, 2020.

The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks

Yuheng Zhang*, Ruoxi Jia*, Hengzhi Pei, Wenxiao Wang, Bo Li, Dawn Song.

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2020.

Robust Anomaly Detection and Backdoor Attack Detection Via Differential Privacy

Min Du, Ruoxi Jia, Dawn Song.

International Conference on Learning Representations (ICLR). May, 2020.

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Dan Hendrycks, Mantas Mazeika*, Saurav Kadavath*, Dawn Song.

Advances in Neural Information Processing Systems (NeurIPS). December, 2019.

AdvIT: Adversarial Frames Identifier Based on Temporal Consistency In Videos

Chaowei Xiao, Ruizhi Deng, Bo Li, Taesung Lee, Benjamin Edwards, Jinfeng Yi, Dawn Song, Mingyan Liu, Ian Molloy.

International Conference on Computer Vision (ICCV). October, 2019.

The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets

Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, Dawn Song.

USENIX Security. August, 2019.

Press: The Register | Schneier on Security

How You Act Tells a Lot: Privacy-Leakage Attack on Deep Reinforcement Learning

​Xinlei Pan, Weiyao Wang, Xiaoshuai Zhang, Bo Li, Jinfeng Yi, Dawn Song.

International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May, 2019

Characterizing Audio Adversarial Examples Using Temporal Dependency

Zhuolin Yang, Bo Li, Pin-Yu Chen, Dawn Song.

International Conference on Learning Representations (ICLR). May, 2019.

Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation

Chaowei Xiao, Ruizhi Deng, Bo Li, Fisher Yu, Mingyan Liu, Dawn Song.

European Conference on Computer Vision (ECCV). September, 2018.

Exploring the Space of Black-box Attacks on Deep Neural Networks

Arjun Nitin Bhagoji, Warren He, Bo Li, Dawn Song.

The European Conference on Computer Vision (ECCV). September, 2018.

Generating Adversarial Examples with Adversarial Networks

Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, Dawn Song.

The International Joint Conference on Artificial Intelligence (IJCAI). July, 2018.

Curriculum Adversarial Training

Qizhi Cai, (Min Du), Chang Liu, Dawn Song.

The International Joint Conference on Artificial Intelligence (IJCAI). July, 2018.

Fooling Vision and Language Models Despite Localization and Attention Mechanism

Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darell, Dawn Song.

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2018.

Robust Physical-World Attacks on Deep Learning Visual Classification

Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, Chaowei Xiao, Dawn Song.

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2018.

Press: IEEE Spectrum | Yahoo News | Wired | Engagdet | Telegraph | Car and Driver | CNET | Digital Trends | SCMagazine | Schneier on Security | Ars Technica | Fortune | Science Magazine

Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality

Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, James Bailey.

International Conference on Learning Representations (ICLR). May, 2018.

Spatially Transformed Adversarial Examples

Chaowei Xiao*, Jun-Yan Zhu*, Bo Li, Mingyan Liu, Dawn Song.

International Conference on Learning Representations (ICLR). May, 2018.

Decision Boundary Analysis of Adversarial Examples

Warren He, Bo Li, Dawn Song.

International Conference on Learning Representations (ICLR). May, 2018.

Adversarial examples for generative models

Jernej Kos, Ian Fischer, Dawn Song.

IEEE S&P Workshop on Deep Learning and Security. May, 2018.

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, Dawn Song.

December, 2017.

Press: Motherboard | The Register

Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong

Warren He, James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song.

USENIX Workshop on Offensive Technologies (WOOT). August, 2017.

Delving into Transferable Adversarial Examples and Black-box Attacks

Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song.

International Conference on Learning Representations (ICLR). April, 2017.

Delving into adversarial attacks on deep policies

Jernej Kos and Dawn Song.

ICLR Workshop. April, 2017.


Members