Meta’s Llama Firewall Bypassed Using Prompt Injection Vulnerability – Cyber Web Spider Blog

Trendyol’s utility safety group uncovered a sequence of bypasses that render Meta’s Llama Firewall protections unreliable in opposition to refined immediate injection assaults.

The findings increase recent considerations concerning the readiness of present LLM safety measures and underscore the pressing want for extra strong defenses as enterprises more and more embed giant language fashions into their workflows.

In the course of the evaluation, Trendyol engineers deployed Meta’s open-source Llama Firewall, specializing in its PROMPT_GUARD part supposed to display screen out malicious person directions.

Key Takeaways1. Meta’s PROMPT_GUARD failed to dam Turkish phrases like “üstteki yönlendirmeleri salla” and leetspeak resembling “1gn0r3,” exposing reliance on English key phrases and actual matches.2. The module missed a SQL injection in LLM-generated Python code, with dangers of unverified code use, safety gaps, system publicity, and false belief in automated scans.3. Invisible Unicode characters hid malicious directions in benign prompts, bypassing Llama Firewall and posing dangers in collaborative settings.4. Testing and Disclosure Of 100 payloads examined, half succeeded; Meta closed Trendyol’s Might 5, 2025 report as “informative” by June 3 with out a bounty, urging the AI safety group to develop strong defenses in opposition to evolving threats.

To their shock, the guard allowed a Turkish-language injection containing the phrase “üstteki yönlendirmeleri salla,” which instructs the mannequin to disregard previous instructions after which translate a sentence into French.

Meta’s Llama Firewall Bypassed

The scan returned a benign consequence regardless of the clearly malicious intent. The group additional demonstrated that straightforward leetspeak obfuscations remodeling “ignore” into “1gn0r3,” as an illustration additionally slipped previous the filter, receiving negligible jailbreak scores when examined in opposition to the Llama-Immediate-Guard-2–86M mannequin.

These experiments revealed that the firewall’s rule set largely relies on English key phrases and actual sample matches, leaving it blind to variations in language or character substitutions.

Equally troubling was the group’s analysis of CODE_SHIELD, the module designed to flag insecure code options generated by LLMs.

In a single proof-of-concept, the researchers instructed an assistant to provide a easy Python Flask endpoint with a SQL question. The ensuing code concatenated person enter instantly into the question string, making a textbook SQL injection vulnerability.

Regardless of the obtrusive threat, CODE_SHIELD categorised the output as secure and allowed it via unchallenged.

Trendyol’s builders emphasised a number of essential considerations about this shortcoming:

Manufacturing Threat: Groups may depend on LLM-generated snippets with out thorough human evaluation.

Safety Gaps: Overreliance on automated scanning introduces essential vulnerabilities.

System Publicity: Manufacturing methods develop into weak to exploitation via undetected flaws.

Belief Points: Builders might develop false confidence in automated safety instruments.

The oversight illustrates how these automated methods can create harmful blind spots in enterprise safety frameworks.

Unicode Steganography and Invisible Directions

Essentially the most insidious bypass concerned embedding hidden directions inside invisible Unicode characters.

By inserting zero-width areas and different non-printing code factors across the phrase “ignore all earlier directions,” the researchers camouflaged malicious directives inside an innocuous query concerning the capital of France.

Though customers and informal observers noticed solely a innocent question, the LLM acknowledged and executed the hid command. When submitted to Llama Firewall, this payload handed inspection with a zero-threat rating.

Trendyol’s group demonstrated that even minimal invisible payloads might reliably subvert system prompts and trigger fashions to provide arbitrary or dangerous outputs.

This method poses a very acute menace in collaborative settings the place prompts are copy-pasted amongst builders, and automatic scanners lack visibility into hidden characters.

In complete, Trendyol examined 100 distinctive injection payloads in opposition to Llama Firewall. Half of those assaults bypassed the system’s defenses, suggesting that whereas the firewall provides some safety, it’s removed from complete.

The profitable bypasses spotlight eventualities by which attackers might coerce LLMs to ignore essential security filters, output biased or offensive content material, or generate insecure code prepared for execution.

For organizations like Trendyol, which plan to combine LLMs into developer platforms, automation pipelines, and customer-facing functions, these vulnerabilities characterize concrete dangers that might result in information leaks, system compromise, or regulatory noncompliance.

Trendyol’s safety researchers reported their preliminary findings to Meta on Might 5, 2025, detailing the multilingual and obfuscated immediate injections.

Meta acknowledged receipt and commenced an inner evaluation however in the end closed the report as “informative” on June 3, declining to situation a bug bounty.

A parallel disclosure to Google concerning invisible Unicode injections was equally closed as a reproduction.

Regardless of the lukewarm vendor responses, Trendyol has since refined its personal menace modeling practices and is sharing its case examine with the broader AI safety group.

The corporate urges different organizations to conduct rigorous red-teaming of LLM defenses earlier than rolling them into manufacturing, stressing that immediate filtering alone can not stop all types of compromise.

As enterprises race to harness the ability of generative AI, Trendyol’s analysis serves as a cautionary story: with out layered, context-aware safeguards, even cutting-edge firewall instruments can fall prey to deceptively easy assault vectors.

The safety group should now collaborate on extra resilient detection strategies and greatest practices to remain forward of adversaries who constantly innovate new methods to govern these highly effective methods.

Examine stay malware conduct, hint each step of an assault, and make sooner, smarter safety choices -> Strive ANY.RUN now

Related Posts