AI Safety and Security

[Research Statement] [Publications] [Members]

Research Statement

As intelligent systems become pervasive, safeguarding their security and privacy is critical. However, recent research have demonstrated that machine learning systems, including state-of-the-art deep neural networks, can be easily fooled by an adversary. For example, it is easy to generate adversarial examples, which are close to the benign inputs but are misidentified by the machine learning models. Moreover, in our recent work, we have shown that such attacks can be successful even without access to the model internals, i.e., in a black-box setting. These attacks may cause severe outcomes: for example, the adversary can mislead the perceptual systems of autonomous vehicles to wrongly identify road signs, which can result in catastrophic traffic accidents. Therefore, such security issues hinder the application of machine learning to security-critical systems.

In AI security research, we aim at investigating into the vulnerability of automatic learning systems, and ultimately, developing robust defense strategies against such sophisticated adversarial manipulations in real-world applications.

Recent Publications

LLM-PBE: Assessing Data Privacy in Large Language Models

Qinbin Li, Junyuan Hong, Chulin Xie, Jeffrey Tan, Rachel Xin, Junyi Hou, Xavier Yin, Zhun Wang, Dan Hendrycks, Zhangyang Wang, Bo Li, Bingsheng He, Dawn Song

International Conference on Very Large Data Bases (VLDB) Best Paper Award Finalist. August, 2024.

 

SHINE: Shielding Backdoors in Deep Reinforcement Learning

Zhuowen Yuan, Wenbo Guo, Jinyuan Jia, Bo Li, Dawn Song

The International Conference on Machine Learning (ICML). July, 2024.

 

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li

The International Conference on Machine Learning (ICML). July, 2024.

 

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang “Atlas” Wang, Bo Li

The International Conference on Machine Learning (ICML). July, 2024.

 

C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models

Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li

The International Conference on Machine Learning (ICML). July, 2024.

 

The False Promise of Imitating Proprietary Language Models

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song

International Conference on Learning Representations (ICLR). May, 2024.

 

TextGuard: Provable Defense against Backdoor Attacks on Text Classification

Hengzhi Pei, Jinyuan Jia, Wenbo Guo, Bo Li, Dawn Song

The Network and Distributed System Security Symposium (NDSS). February, 2024.

 

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

Advances in Neural Information Processing Systems (NeurIPS) Outstanding Paper Award. December, 2023.

 

DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification

Mintong Kang, Dawn Song, Bo Li

Advances in Neural Information Processing Systems (NeurIPS). December, 2023.

 

BIRD: Generalizable Backdoor Detection and Removal for Deep Reinforcement Learning

Xuan Chen, Wenbo Guo, Guanhong Tao, Xiangyu Zhang, Dawn Song

Advances in Neural Information Processing Systems (NeurIPS). December, 2023.

 

PATROL: Provable Defense against Adversarial Policy in Two-player Games

Wenbo Guo, Xian Wu, Lun Wang, Xinyu Xing, Dawn Song

USENIX Security Symposium. August, 2023.

 

Extracting Training Data from Diffusion Models

Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace

USENIX Security Symposium. August, 2023.

 

Poisoning Instruction-Tuned Language Models

Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein

The International Conference on Machine Learning (ICML). July, 2023.

 

Trojdiff: Trojan attacks on diffusion models with diverse targets

Weixin Chen, Dawn Song, Bo Li

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2023.

 

Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, Tom Goldstein

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). February, 2023.

 

Scaling Out-of-Distribution Detection for Real-World Settings

Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song

The International Conference on Machine Learning (ICML). July, 2022.

 

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Nikhil Kandpal, Eric Wallace, Colin Raffel

The International Conference on Machine Learning (ICML). July, 2022.

 

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2022.

 

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Dan Hendrycks, Steven Basart*, Norman Mu*, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer.

International Conference on Computer Vision (ICCV). October, 2021.

 

Extracting Training Data from Large Language Models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel.

USENIX Security Symposium. August, 2021.

 

Towards Robustness of Text-to-SQL Models against Synonym Substitution

Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew Purver, John R. Woodward, Jinxia Xie, Pengsheng Huang.

Annual Meeting of the Association for Computational Linguistics (ACL). August, 2021.

 

BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning

Lun Wang, Zaynah Javed, Xian Wu, Wenbo Guo, Xinyu Xing, Dawn Song.

International Joint Conference on Artificial Intelligence (IJCAI). August, 2021.

 

Natural Adversarial Examples

Dan Hendrycks, Kevin Zhao*, Steven Basart*, Jacob Steinhardt, Dawn Song.

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2021.

 

REFIT: a Unified Watermark Removal Framework for Deep Learning Systems with Limited Data

Xinyun Chen*, Wenxiao Wang*, Chris Bender, Yiming Ding, Ruoxi Jia, Bo Li, Dawn Song.

ACM Asia Conference on Computer and Communications Security (AsiaCCS). June, 2021.

 

Understanding Robustness in Teacher-Student Setting: A New Perspective

Zhuolin Yang*, Zhaoxi Chen, Tiffany (Tianhui) Cai, Xinyun Chen, Bo Li, Yuandong Tian*.

International Conference on Artificial Intelligence and Statistics (AISTATS). April, 2021.

 

Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, Tom Goldstein.

December, 2020.

 

Imitation Attacks and Defenses for Black-box Machine Translation Systems

Eric Wallace, Mitchell Stern, Dawn Song.

Conference on Empirical Methods in Natural Language Processing (EMNLP), November, 2020.

 

Blog

Towards Inspecting and Eliminating Trojan Backdoors in Deep Neural Networks

Wenbo Guo*, Lun Wang*, Yan Xu, Xinyu Xing, Min Du, Dawn Song.

IEEE International Conference on Data Mining (ICDM), November, 2020.

 

Pretrained Transformers Improve Out-of-Distribution Robustness

Dan Hendrycks*, Xiaoyuan Liu*, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song.

Annual Meeting of the Association for Computational Linguistics (ACL). July, 2020.

 

The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks

Yuheng Zhang*, Ruoxi Jia*, Hengzhi Pei, Wenxiao Wang, Bo Li, Dawn Song.

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2020.

 

Robust Anomaly Detection and Backdoor Attack Detection Via Differential Privacy

Min Du, Ruoxi Jia, Dawn Song.

International Conference on Learning Representations (ICLR). May, 2020.

 

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Dan Hendrycks, Mantas Mazeika*, Saurav Kadavath*, Dawn Song.

Advances in Neural Information Processing Systems (NeurIPS). December, 2019.

 

AdvIT: Adversarial Frames Identifier Based on Temporal Consistency In Videos

Chaowei Xiao, Ruizhi Deng, Bo Li, Taesung Lee, Benjamin Edwards, Jinfeng Yi, Dawn Song, Mingyan Liu, Ian Molloy.

International Conference on Computer Vision (ICCV). October, 2019.

 

The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets

Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, Dawn Song.

USENIX Security. August, 2019.

 

Press: The Register | Schneier on Security

How You Act Tells a Lot: Privacy-Leakage Attack on Deep Reinforcement Learning

​Xinlei Pan, Weiyao Wang, Xiaoshuai Zhang, Bo Li, Jinfeng Yi, Dawn Song.

International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May, 2019

 

Characterizing Audio Adversarial Examples Using Temporal Dependency

Zhuolin Yang, Bo Li, Pin-Yu Chen, Dawn Song.

International Conference on Learning Representations (ICLR). May, 2019.

 

Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation

Chaowei Xiao, Ruizhi Deng, Bo Li, Fisher Yu, Mingyan Liu, Dawn Song.

European Conference on Computer Vision (ECCV). September, 2018.

 

Exploring the Space of Black-box Attacks on Deep Neural Networks

Arjun Nitin Bhagoji, Warren He, Bo Li, Dawn Song.

The European Conference on Computer Vision (ECCV). September, 2018.

 

Generating Adversarial Examples with Adversarial Networks

Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, Dawn Song.

The International Joint Conference on Artificial Intelligence (IJCAI). July, 2018.

 

Curriculum Adversarial Training

Qizhi Cai, (Min Du), Chang Liu, Dawn Song.

The International Joint Conference on Artificial Intelligence (IJCAI). July, 2018.

 

Fooling Vision and Language Models Despite Localization and Attention Mechanism

Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darell, Dawn Song.

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2018.

 

Robust Physical-World Attacks on Deep Learning Visual Classification

Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, Chaowei Xiao, Dawn Song.

The Conference on Computer Vision and Pattern Recognition (CVPR). June, 2018.

 

Press: IEEE Spectrum | Yahoo News | Wired | Engagdet | Telegraph | Car and Driver | CNET | Digital Trends | SCMagazine | Schneier on Security | Ars Technica | Fortune | Science Magazine

Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality

Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, James Bailey.

International Conference on Learning Representations (ICLR). May, 2018.

 

Spatially Transformed Adversarial Examples

Chaowei Xiao*, Jun-Yan Zhu*, Bo Li, Mingyan Liu, Dawn Song.

International Conference on Learning Representations (ICLR). May, 2018.

 

Decision Boundary Analysis of Adversarial Examples

Warren He, Bo Li, Dawn Song.

International Conference on Learning Representations (ICLR). May, 2018.

 

Adversarial examples for generative models

Jernej Kos, Ian Fischer, Dawn Song.

IEEE S&P Workshop on Deep Learning and Security. May, 2018.

 

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, Dawn Song.

December, 2017.

 

Press: Motherboard | The Register

Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong

Warren He, James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song.

USENIX Workshop on Offensive Technologies (WOOT). August, 2017.

 

Delving into Transferable Adversarial Examples and Black-box Attacks

Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song.

International Conference on Learning Representations (ICLR). April, 2017.

 

Delving into adversarial attacks on deep policies

Jernej Kos and Dawn Song.

ICLR Workshop. April, 2017.

 


Members