Cisco’s AI Threat Intelligence and Security Research division has published new findings about the vulnerabilities of vision-language models (VLMs), AI systems that interpret visual data. The study reveals that these models can be manipulated by attackers through imperceptible alterations to images.
Exploiting AI with Hidden Instructions
The research demonstrates that attackers can embed commands within images that are undetectable to humans. These commands can instruct an AI to carry out harmful actions, such as data exfiltration, by embedding them into images like webpage banners or document previews. While these commands appear as visual noise to humans, the AI systems can interpret and act on them.
This investigation builds on earlier work which established a connection between visual distortions in text-bearing images and their effectiveness in attacking VLMs. Techniques such as using small fonts and heavy blurring were found to decrease the likelihood of a successful attack.
Advancements in Attack Techniques
The second phase of Cisco’s research, released on Thursday, delves into whether the mathematical distance between the distorted image and its readable form can be minimized. Researchers applied pixel-level changes to images that were initially ineffective as attacks due to readability issues or the AI’s safety mechanisms.
These changes were refined using four publicly available AI models, including Qwen3-VL-Embedding and OpenAI CLIP ViT-L/14-336, before being tested on proprietary systems like GPT-4o and Claude. This approach revealed two primary failure modes: readability recovery and refusal reduction.
Impact on AI Systems and Defenses
Readability recovery occurs when an image, too blurred or small for the AI to read, becomes legible internally within the model without becoming clearer to human observers. Refusal reduction describes instances where an AI, previously declining to follow embedded instructions, is manipulated into compliance without visible changes to the image.
In trials, Claude showed a significant increase in attack success from 0% to 28% when optimized on blurred images, though its safety filter still blocked many of the newly readable contents. Conversely, GPT-4o maintained stronger safety alignment, catching most legible requests even after optimization.
Future Implications and Defense Strategies
Cisco’s findings underscore the need for robust defenses against typographic attacks that evade simple image filters. As AI systems become more integral to operations, enhancing their ability to resist such subtle manipulations is critical to maintaining data security.
Addressing these vulnerabilities is imperative to safeguard against potential exploitation, highlighting the ongoing necessity for advancements in AI security protocols.
