Leading AI Models Vulnerable to Simple Language Manipulation

TrendAI today publishes new analyses showing how simple text manipulation, known as sockpuppeting, can cause prominent AI models such as GPT-4o, Claude 4 Sonnet and Gemini 2.5 Flash to bypass their own safety filters. By hiding malicious instructions inside a seemingly innocent prompt, an attacker can trick an assistant into violating its guidelines. All tested models that accept prefilled context via API, so-called prefill, proved vulnerable.

TrendAI tested the technique against eleven different models from four providers. The results show the issue is not limited to a single vendor but affects both open-source and privately hosted models. As long as a model accepts prefill it is at least partially exposed to the vulnerability. Only models that block prefill at the API level were fully protected.

Martin Fribrock, Country Manager för Sverige, Finland och Baltikum på TrendAI, Trend Micro | IT-Branschen — Martin Fribrock, Country Manager for Sweden, Finland and the Baltics at TrendAI – Published by IT-Branschen

“The vulnerability is particularly serious because it requires neither special tools nor advanced techniques,” says Martin Fribrock, Country Manager Sweden, Finland and Baltics at TrendAI. “This type of attack targets the very core of how AI operates. An attacker does not need to break into systems—it’s enough to craft the prompt in the right way.”

How the attack works

Most language models include built-in protections intended to prevent the generation of harmful or policy-violating content. In a sockpuppeting attack, a single short line of text is enough to manipulate a model’s context. That line can cause the model to ignore its safety mechanisms and respond to otherwise blocked requests, producing content that would normally be disallowed.

TrendAI’s analysis also shows that models that do not accept prefill prevent this kind of attack already at the API level. For other models, the degree of vulnerability varies, but all tested models were affected to some extent. This suggests a systemic risk across implementations rather than isolated flaws tied to individual vendors.

Recommendations for organizations

TrendAI urges organizations using AI to take the following steps to reduce risk:

Enforce strict control of message flows at the API level and consistently reject requests where the last message has the assistant role.
Regularly test how models handle prefilled context (prefill), and repeat these tests after updates or when switching providers.
Pay special attention when using open-weight or open-source models, which often lack built-in safeguards by default.
Perform broad security testing—different models can be vulnerable to different attack techniques, so comprehensive assessment is essential.

Implementing these measures will help organizations reduce the chances that simple prompt manipulations lead to serious policy violations, data leaks, or misuse of AI systems. Regular testing, hardened API controls, and careful vendor selection are practical steps to strengthen defenses against context-based manipulations.

TrendAI’s findings highlight the need for coordinated efforts across vendors, security teams and policy makers to raise the baseline of AI safety. As AI is increasingly integrated into business processes, operators must assume that attackers will attempt to exploit context manipulation and design defenses accordingly.

For organizations and managed service providers in the Nordics and beyond, the report signals both a risk and an opportunity: risk due to potential misuse and governance failures, and opportunity to build new services and expertise around AI security testing, prompt safety and API governance.