HauntAttack: When Attack Follows Reasoning as a Shadow
Jingyuan Ma, Rui Li, Zheng Li, Junfeng Liu, Lei Sha, Zhifang Sui·June 08, 2025
Summary
HAUNTATTACK is a black-box framework that targets large reasoning models by embedding harmful instructions into questions, revealing vulnerabilities in safety-reasoning trade-offs. It exploits replaceable conditions across diverse tasks, outperforming prior attacks, and highlights significant security risks. Studies show current safety methods fail to detect or mitigate risks when harmful intent is hidden in reasoning chains. Evaluations of 11 models' safety and robustness emphasize the need for jailbreak defenses to consider deeper reasoning structures. DeepInception, engaging models in structured processes, is harder to detect, making it more effective in bypassing safety filters compared to surface-level manipulations. The text evaluates models' vulnerability to attacks like cyberattacks, fraud, and misinformation, introducing a risk scoring system to assess potential harm.
Advanced features