One Query is All You Need: A Dynamic Approach to Jailbreaking LLMs

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The rapid advancement of Large Language Models (LLMs) has introduced signif- icant challenges in ensuring model safety, particularly in the face of adversarial attacks such as jailbreaks and prompt injections. While prior research has investi- gated jailbreak prompts to uncover vulnerabilities, most existing approaches depend on manually crafted inputs, limiting both scalability and adaptability. Moreover, these methods typically require an extensive number of queries to achieve a success- ful jailbreak, reducing their overall efficiency. In this paper, we propose a Dynamic Adversarial Prompting (DAP) method, inspired by Retrieval-Augmented Genera- tion (RAG), to generate jailbreak prompts capable of exposing vulnerabilities in LLMs. DAP leverages a curated dataset from WildJailbreak, comprising both suc- cessful and failed jailbreak attempts, to construct a single adversarial query em- bedded with multiple illustrative examples substantially increasing the likelihood of bypassing safety mechanisms. Our results demonstrate that DAP achieves an At- tack Success Rate (ASR) of 91% on models like GPT-4o, with an overall efficiency of 99.97%, and remains effective even on models such as Gemini and Llama, out- performing state-of-the-art methods. This work contributes a scalable and effective method for probing LLM robustness, advancing efforts to understand and mitigate model vulnerabilities.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By