As LLMs are deployed in sensitive applications, adversarial testing becomes critical for safe deployment. Current approaches suffer from fundamental coverage gaps:
Manual red-teaming (Ganguli et al., 2022) provides high-quality examples but doesn't scale - human testers converge on similar patterns, leaving vulnerability spaces unexplored.
LLM-as-attacker methods (Perez et al., 2022; Chao et al., 2023) exhibit mode collapse - attacking models generate similar prompts, missing diverse failure modes.
Gradient-based approaches like GCG (Zou et al., 2023) require white-box access and produce uninterpretable token sequences with limited black-box transfer.
We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, and Gemini 2.0 Flash, we discover distinct vulnerability profiles: each model exhibits unique weaknesses to different strategy-encoding combinations, providing actionable insights for improving LLM safety.
Semantic Genome Representation
Attacks are represented as compositions of semantic elements rather than raw tokens:
Strategies 𝒮 - derived from a taxonomy of commonly reported jailbreak patterns, covering social engineering (roleplay, authority), cognitive reframing (hypothetical), and technical obfuscation (encoding):
Encodings ℰ: None · Base64 · ROT13 · Leetspeak · PigLatin · Unicode
In our experiments, Hypothetical and MultiTurn strategies were most frequently retained in the archive (31% and 28% of occupied cells), while Authority was least prevalent (9%), suggesting reframing-based strategies offer richer evolutionary potential.
Fitness Evaluation
Prompts sent to target LLMs; responses classified via heuristic judge:
f(g) = 1.0 compliance · 0.4 ambiguous · 0.0 refusal
(~12% misclassification rate via manual inspection of 50 samples; no false positives observed)
MAP-Elites Quality-Diversity
Rather than single-objective optimization, the archive is indexed by behavior descriptors b(g) = (bstrategy, bencoding, blength). Each cell stores the highest-fitness individual, ensuring diversity across strategy types and encodings. Compact semantic space: |𝒮| × |ℰ| = 36 combinations, enabling faster archive coverage than token-level search. Archive fill rate plateaued by generation ~20, suggesting convergence within budget.
Variation operators: mutate_strategy, mutate_encoding, mutate_structure, composite_mutation.
Experimental setup: Population 30, generations 30, mutation rate 0.3, MAP-Elites bins 6×6×6. Models tested at experiment time (January 2026); framework is model-agnostic.
The framework operates as a closed evolutionary loop: a Semantic Genome encodes attack strategies as structured compositions (not raw tokens), which are converted to prompts via to_prompt() and evaluated against target LLMs. A heuristic fitness judge scores responses, and the MAP-Elites archive retains the highest-fitness individual per behavior cell, selecting parents for the next generation. This ensures both quality (high fitness) and diversity (broad coverage of the strategy-encoding space).
Setup: Population 30, Generations 30, Mutation rate 0.3, MAP-Elites bins 6×6×6. Models tested: GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, Devstral-small-2.
| Model | Best Fit. | Direct | Role | Auth. | Hypo. | Multi | Enc. |
|---|---|---|---|---|---|---|---|
| Devstral-small-2 | 1.0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| GPT-4o-mini | 0.8 | ✗ | ✗ | ~ | ✓ | ✓ | ~ |
| Claude 3.5 Sonnet | 0.4 | ~ | ~ | ~ | ~ | ~ | ~ |
| Gemini 2.0 Flash | 0.8 | ✓ | ~ | ✗ | ~ | ✓ | ~ |
✓ = fitness ≥ 0.8 ~ = 0.4 (ambiguous) ✗ = refused | MAP-Elites fills 16.7% of behavior space (36/216 cells)
The archive indexes individuals by behavior descriptor b(g) = (bstrategy, bencoding, blength). Each cell retains the highest-fitness individual, yielding a map of where in the semantic space attacks succeed - not just how well.
Encoding amplifies strategies: ROT13/Leetspeak combined with semantic framing achieves higher success than either alone. Character-level obfuscation disrupts keyword-based safety filters while semantic framing bypasses intent-level classifiers - dual-layer evasion.
Model-specific weaknesses: GPT-4o-mini is susceptible to cognitive reframing (Hypothetical+ROT13), while Gemini's filters are sensitive to semantic intent but weak against character-level obfuscation (Direct+ROT13).
Claude's "soft refusal" is most robust: Ambiguous responses (0.4) that acknowledge but redirect prove more resilient than hard refusals. No strategy achieves compliance.
Semantic-level quality-diversity evolution produces interpretable, diverse attacks revealing model-specific weaknesses. MAP-Elites discovers 10+ unique strategy-encoding combinations per model across 36 occupied behavior-space cells. Claude 3.5 Sonnet's "soft refusal" strategy proves most robust - no attack achieves compliance.
Comparison with prior work: Unlike Rainbow Teaming (Samvelyan et al., 2024) which evolves prompt text directly, we operate at the semantic strategy level, producing more interpretable attacks. Unlike GCG (Zou et al., 2023) which requires white-box access, our approach is fully black-box. Unlike PAIR (Chao et al., 2023) which optimizes a single attack, MAP-Elites maintains a diverse archive revealing the full landscape of vulnerabilities.
Future directions: Multi-turn attack chains, agent-level safety testing for autonomous systems, attacker-defender co-evolution, integration of LLM-as-judge for refined fitness evaluation, and scaling to larger population sizes.