Abstract:
The rapid advancement of Large Language Models (LLMs) has introduced signif-
icant challenges in ensuring model safety, particularly in the face of adversarial
attacks such as jailbreaks and prompt injections. While prior research has investi-
gated jailbreak prompts to uncover vulnerabilities, most existing approaches depend
on manually crafted inputs, limiting both scalability and adaptability. Moreover,
these methods typically require an extensive number of queries to achieve a success-
ful jailbreak, reducing their overall efficiency. In this paper, we propose a Dynamic
Adversarial Prompting (DAP) method, inspired by Retrieval-Augmented Genera-
tion (RAG), to generate jailbreak prompts capable of exposing vulnerabilities in
LLMs. DAP leverages a curated dataset from WildJailbreak, comprising both suc-
cessful and failed jailbreak attempts, to construct a single adversarial query em-
bedded with multiple illustrative examples substantially increasing the likelihood of
bypassing safety mechanisms. Our results demonstrate that DAP achieves an At-
tack Success Rate (ASR) of 91% on models like GPT-4o, with an overall efficiency
of 99.97%, and remains effective even on models such as Gemini and Llama, out-
performing state-of-the-art methods. This work contributes a scalable and effective
method for probing LLM robustness, advancing efforts to understand and mitigate
model vulnerabilities.