Anthropic Admits to Concealed Safety Measures in Claude Fable 5 AI Model

Anthropic revealed it had embedded undisclosed guardrails in Claude Fable 5 that redirected sensitive queries, sparking concerns over transparency and AI benchmarking.

Editorial Team · Jun 11, 2026 · 2 min read · Source: thesheffieldpress.com

Anthropic has acknowledged that its latest AI model, Claude Fable 5, included hidden safety restrictions that rerouted certain cybersecurity and biology-related requests to an alternative model, Claude Opus 4.8, without informing users. This unintended opacity has raised questions about transparency and fairness in AI evaluation and usage.

Claude Fable 5, introduced as Anthropic’s flagship system designed for complex knowledge work and programming, was promoted as highly capable and reliable for extensive tasks. Yet, when it detected queries on specific sensitive topics, it silently diverted those to Opus 4.8, a model deemed slightly less advanced, without adjusting user billing for these rerouted interactions. Anthropic said these guardrails triggered in fewer than 5% of sessions and were calibrated conservatively, sometimes catching innocuous prompts.

The company is now changing course, pledging greater openness about when and why these restrictions apply. Anthropic indicated it would prefer the model to refuse sensitive prompts outright rather than redirect them secretly, prioritizing transparency over uninterrupted service.

Beyond this, Anthropic enforces a 30-day data retention policy for monitoring safety, linking its model deployment closely with oversight procedures. Pricing for Claude Fable 5 was set at a base rate per input and output tokens, featuring prompt-caching discounts, and inference services initially limited to the U.S. Parallel to Fable 5, Anthropic launched Claude Mythos 5—a variant based on the same core model but with fewer safeguards—deployed in collaboration with the U.S. government through Project Glasswing, with plans to expand trusted access later.

This episode highlights a critical challenge in the AI sector: when a model is marketed under specific performance and pricing conditions, undisclosed alterations can complicate how users and researchers assess its capabilities and safety. Anthropic’s publication of updated system and model cards aims to provide more detailed transparency around its evaluation and deployment policies, but the incident underscores ongoing tensions between safety, openness, and commercial claims in AI development.