大语言模型提示优化越狱攻击统一框架

doi:10.15888/j.cnki.csa.009980

AIPUB归智期刊联盟

微信公众号

网站二维码

首页 > 过刊浏览>2025年第34卷第11期 >20-29. DOI:10.15888/j.cnki.csa.009980

PDF HTML阅读 XML下载导出引用引用提醒

大语言模型提示优化越狱攻击统一框架
DOI:
                        10.15888/j.cnki.csa.009980
                    
CSTR:
                        32024.14.csa.009980
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

Unified Framework for Jailbreak Attack on Large Language Models via Prompt Optimization

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

越狱攻击对于识别和缓解大型语言模型的安全漏洞至关重要. 这些攻击旨在绕过安全防护机制, 诱导模型产生被禁止的输出. 然而, 由于这些攻击通常在不同的数据样本和模型上进行评估, 因此很难直接公平地比较它们. 本文介绍了EasyJailbreak, 这是一个统一框架, 简化了针对大语言模型的越狱攻击的构建和评估过程. 它使用4个组件构建越狱攻击: 选择器、变异器、约束条件和评估器. 这种模块化设计使研究人员能够轻松组合现有组件或设计新组件, 以构造多种攻击方法. 为了展示该框架的实用性, 本文进行了大规模的实证评估. 目前已基于该框架实现了11种不同的越狱方法, 并在大语言模型上进行了广泛的安全验证, 涉及10种不同大语言模型的超过75万次推理查询, 结果显示在各种越狱攻击下平均突破概率为60%. 值得注意的是, 即使是像GPT-3.5-turbo和GPT-4这样的高级模型, 平均攻击成功率也分别达到57%和33%.

Abstract:

Jailbreak attacks are crucial for identifying and mitigating security vulnerabilities in large language models (LLM). These attacks aim to bypass security mechanisms and induce models to produce prohibited outputs. However, it is difficult to directly and fairly compare these attacks, as they are typically evaluated on different data samples and models. This study introduces EasyJailbreak, a unified framework that simplifies the construction and evaluation of jailbreak attacks for LLMs and constructs jailbreak attacks by adopting four components, including the selector, mutator, constraint, and evaluator. This modular design allows researchers to easily combine existing and novel components to develop various attack methods. To demonstrate the utility of this framework, this study conducts extensive empirical evaluations, with 11 different jailbreak methods implemented based on this framework. Additionally, comprehensive security validations are performed on LLMs, involving over 750000 inference queries across 10 different LLMs. The results reveal an average breach probability of 60% under various jailbreak attacks. Notably, even advanced models like GPT-3.5-turbo and GPT-4 show average attack success rates of 57% and 33% respectively.

参考文献

相似文献

引证文献

引用本文

夏寒,王枭,周玮康,熊立茂,顾滢双,桂韬.大语言模型提示优化越狱攻击统一框架.计算机系统应用,2025,34(11):20-29

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-03-13
最后修改日期:2025-04-15
录用日期:
在线发布日期: 2025-09-30
出版日期:

微信公众号

网站二维码

引用本文

分享

相关视频

文章指标

历史

文章二维码