Quality-Diversity Evolution for LLM Safety - ICLR 2026 AIWILD Poster

1 Motivation & Abstract

As LLMs are deployed in sensitive applications, adversarial testing becomes critical for safe deployment. Current approaches suffer from fundamental coverage gaps:

Manual red-teaming (Ganguli et al., 2022) provides high-quality examples but doesn't scale - human testers converge on similar patterns, leaving vulnerability spaces unexplored.
LLM-as-attacker methods (Perez et al., 2022; Chao et al., 2023) exhibit mode collapse - attacking models generate similar prompts, missing diverse failure modes.
Gradient-based approaches like GCG (Zou et al., 2023) require white-box access and produce uninterpretable token sequences with limited black-box transfer.

We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, and Gemini 2.0 Flash, we discover distinct vulnerability profiles: each model exhibits unique weaknesses to different strategy-encoding combinations, providing actionable insights for improving LLM safety.

Semantic Genome

Evolve strategies, not tokens

MAP-Elites

Maintain a diverse attack archive

Multi-Model

Cross-model vulnerability profiles

2 Method

Semantic Genome Representation

Attacks are represented as compositions of semantic elements rather than raw tokens:

g = (s_p, s_s, e, ρ) → to_prompt() → interpretable attack

Strategies 𝒮 - derived from a taxonomy of commonly reported jailbreak patterns, covering social engineering (roleplay, authority), cognitive reframing (hypothetical), and technical obfuscation (encoding):

DirectJailbreak

Roleplay ("DAN")

Authority

Hypothetical

MultiTurn

Encoding

Encodings ℰ: None · Base64 · ROT13 · Leetspeak · PigLatin · Unicode

In our experiments, Hypothetical and MultiTurn strategies were most frequently retained in the archive (31% and 28% of occupied cells), while Authority was least prevalent (9%), suggesting reframing-based strategies offer richer evolutionary potential.

Fitness Evaluation

Prompts sent to target LLMs; responses classified via heuristic judge:
f(g) = 1.0 compliance · 0.4 ambiguous · 0.0 refusal
(~12% misclassification rate via manual inspection of 50 samples; no false positives observed)

MAP-Elites Quality-Diversity

Rather than single-objective optimization, the archive is indexed by behavior descriptors b(g) = (b_strategy, b_encoding, b_length). Each cell stores the highest-fitness individual, ensuring diversity across strategy types and encodings. Compact semantic space: |𝒮| × |ℰ| = 36 combinations, enabling faster archive coverage than token-level search. Archive fill rate plateaued by generation ~20, suggesting convergence within budget.

Variation operators: mutate_strategy, mutate_encoding, mutate_structure, composite_mutation.

Experimental setup: Population 30, generations 30, mutation rate 0.3, MAP-Elites bins 6×6×6. Models tested at experiment time (January 2026); framework is model-agnostic.

3 Architecture

The framework operates as a closed evolutionary loop: a Semantic Genome encodes attack strategies as structured compositions (not raw tokens), which are converted to prompts via to_prompt() and evaluated against target LLMs. A heuristic fitness judge scores responses, and the MAP-Elites archive retains the highest-fitness individual per behavior cell, selecting parents for the next generation. This ensures both quality (high fitness) and diversity (broad coverage of the strategy-encoding space).

Figure 1: Semantic genome encodes strategies (not tokens); MAP-Elites maintains diversity; parallel evaluation tests multiple LLM targets.

4 Results

Setup: Population 30, Generations 30, Mutation rate 0.3, MAP-Elites bins 6×6×6. Models tested: GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, Devstral-small-2.

Model	Best Fit.	Direct	Role	Auth.	Hypo.	Multi	Enc.
Devstral-small-2	1.0	✓	✓	✓	✓	✓	✓
GPT-4o-mini	0.8	✗	✗	~	✓	✓	~
Claude 3.5 Sonnet	0.4	~	~	~	~	~	~
Gemini 2.0 Flash	0.8	✓	~	✗	~	✓	~

✓ = fitness ≥ 0.8 ~ = 0.4 (ambiguous) ✗ = refused | MAP-Elites fills 16.7% of behavior space (36/216 cells)

5 MAP-Elites Archive

The archive indexes individuals by behavior descriptor b(g) = (b_strategy, b_encoding, b_length). Each cell retains the highest-fitness individual, yielding a map of where in the semantic space attacks succeed - not just how well.

Figure 2: MAP-Elites archive coverage across the strategy × encoding behavior space. Darker cells indicate higher-fitness elites; empty cells mark unexplored or consistently-refused combinations. 36 of 216 cells (16.7%) are occupied after 30 generations.

6 Strategy Effectiveness

Figure 3: Strategy effectiveness across models. Green = success (≥0.8), yellow = ambiguous (0.4), red = refused. Models show distinct vulnerability profiles.

7 Key Findings

Encoding amplifies strategies: ROT13/Leetspeak combined with semantic framing achieves higher success than either alone. Character-level obfuscation disrupts keyword-based safety filters while semantic framing bypasses intent-level classifiers - dual-layer evasion.

Model-specific weaknesses: GPT-4o-mini is susceptible to cognitive reframing (Hypothetical+ROT13), while Gemini's filters are sensitive to semantic intent but weak against character-level obfuscation (Direct+ROT13).

Claude's "soft refusal" is most robust: Ambiguous responses (0.4) that acknowledge but redirect prove more resilient than hard refusals. No strategy achieves compliance.

8 Conclusion & Future Work

Semantic-level quality-diversity evolution produces interpretable, diverse attacks revealing model-specific weaknesses. MAP-Elites discovers 10+ unique strategy-encoding combinations per model across 36 occupied behavior-space cells. Claude 3.5 Sonnet's "soft refusal" strategy proves most robust - no attack achieves compliance.

Comparison with prior work: Unlike Rainbow Teaming (Samvelyan et al., 2024) which evolves prompt text directly, we operate at the semantic strategy level, producing more interpretable attacks. Unlike GCG (Zou et al., 2023) which requires white-box access, our approach is fully black-box. Unlike PAIR (Chao et al., 2023) which optimizes a single attack, MAP-Elites maintains a diverse archive revealing the full landscape of vulnerabilities.

Future directions: Multi-turn attack chains, agent-level safety testing for autonomous systems, attacker-defender co-evolution, integration of LLM-as-judge for refined fitness evaluation, and scaling to larger population sizes.

References

[1] Chao et al. "Jailbreaking black-box LLMs in twenty queries." arXiv:2310.08419, 2023.

[2] Ganguli et al. "Red teaming language models to reduce harms." arXiv:2209.07858, 2022.

[3] Lehman & Stanley. "Abandoning objectives: Evolution through novelty alone." Evol. Comp., 2011.

[4] Liu et al. "AutoDAN: Generating stealthy jailbreak prompts." arXiv:2310.04451, 2024.

[5] Mazeika et al. "HarmBench: Standardized evaluation for red teaming." arXiv:2402.04249, 2024.

[6] Mouret & Clune. "Illuminating search spaces by mapping elites." arXiv:1504.04909, 2015.

[7] Perez et al. "Red teaming LMs with LMs." arXiv:2202.03286, 2022.

[8] Samvelyan et al. "Rainbow teaming: Open-ended diverse adversarial prompts." arXiv:2402.16822, 2024.

[9] Shah et al. "Scalable black-box jailbreaks via persona modulation." arXiv:2311.03348, 2023.

[10] Yu et al. "GPTFuzzer: Red teaming LLMs with auto-generated prompts." arXiv:2309.10253, 2023.

[11] Zou et al. "Universal and transferable adversarial attacks on aligned LMs." arXiv:2307.15043, 2023.