🟠 High  |  Source: Schneier on Security


Anthropic’s Claude Fable 5 model, marketed as a safety-hardened version of the Mythos Preview with built-in guardrails against cyberattack generation, was jailbroken within days of release. Researchers were able to bypass the safety restrictions, allowing the model to produce content it was explicitly designed to block. This highlights the persistent fragility of AI safety controls and the difficulty of enforcing hard limits through prompt-level guardrails alone.

Security Architect’s Take: Do not treat AI model safety guardrails as a reliable security control — treat them as advisory at best. If you are integrating LLMs into pipelines or products, enforce content restrictions at the application layer with independent validation, output filtering, and rate-limiting rather than relying on the model’s built-in refusals.

Original advisory: Anthropic’s Fable 5 Model Jailbroken Within Days