Turning AI Safeguards Into Weapons with HITL Dialog Forging

Here’s an interaction with an AI code-assistant agent that flags a command injection vulnerability in a codebase and offers to fix it. It uses a Human-in-the-Loop (HITL) control; that is, it presents an HITL dialog that explains what it intends to do and asks for permission. Helpful, right?

Screen capture of an HITL dialog from Claude Code

Unfortunately for you, approving that seemingly benign dialog just executed arbitrary code, supplied by an attacker, on your machine.

Skeptical that an AI agent has such a remote code execution flaw? Watch this:

Animation of malicious code that only starts calc.exe; it could be worse…

What you’re seeing is a typical safety pattern for AI agents: an agent asks for user confirmation before performing potentially harmful operations. This Human-in-the-Loop (HITL) dialog acts as a safeguard, intended to ensure that agents don’t execute these potentially harmful actions without explicit approval.

The Lies-in-the-Loop (LITL) attack exploits the trust users place in these dialogs by forging their content. This technique is also known as HITL Dialog Forging.

When the LITL attack is not involved, this is how Claude Code’s HITL dialog looks when it runs in VS Code terminal:

Regular behavior: git init command displayed and HITL dialog asks permission

Get Checkmarx Zero in your Inbox

Background

This article provides a deeper technical analysis of the novel agentic AI attack vector: the LITL attack, which we recently developed and documented in Bypassing AI Agent Defenses With Lies-In-The-Loop. The LITL attack directly targets the HITL component, causing the agent to prompt the user with a seemingly benign HITL dialog that can deceive users into approving a remote code execution attack originating from indirect prompt injections.

Why is that an issue? Well, because HITL dialogs are one of the recommended mitigations by OWASP for two vulnerabilities from the OWASP LLM Top 10 list:

Generally speaking, the HITL dialog can be thought of as the last line of defense the agent has before executing a sensitive operation, whether maliciously or unintentionally introduced (e.g., due to an agent’s mistake).

This is Google’s security philosophy for HITL dialogs:

This human-in-the-loop (HITL) approach acts as a final safeguard against unauthorized or unintended actions resulting from a successful prompt injection attack.
Source: Google’s layered defense strategy for Gemini

However, the Lies-in-the-Loop (LITL) attack exploits the trust users place in these approval dialogs. By manipulating what the dialog displays, attackers turn the safeguard into a weapon — once the prompt looks safe, users approve it without question. The LITL attack is a particular concern for privileged agents, such as code assistants, which can perform very sensitive actions like running OS commands and, as a result, usually lack other safeguards recommended by OWASP.

Today, we would like to expand on additional risks and mitigations that both users and developers of agentic AI can take to reduce the risks.

Additional Risks

HITL Dialog Forging using Padding

The original post primarily discussed the fact that an attacker can tamper with the dialog by appending very long text to the malicious command, pushing the payload above the visible part in the terminal. However, in practice, nothing prevents an attacker from also prepending benign-looking text to the malicious payload, further concealing its malicious intent.

In other words, not only is the payload out of sight, but scrolling up to the start of the dialog will also lead the user to benign-looking text, further reducing the victim’s suspicions.

Metadata Tampering

Another interesting fact not mentioned in the first blog post is that sometimes metadata is attached to the dialog.

For example, in Claude Code, this is a one-line description that summarizes what the agent is trying to do. And yes, a remote attacker can tamper with it as well, via indirect prompt injection:

HITL dialog in Claude Code, edited to highlight descriptive line

Markdown Injection

This concept is not new and has already been discussed in the OWASP Prompt Injection Prevention Cheat Sheet.
But when it comes to the LITL attack, it is particularly interesting. Exploiting HITL dialog manipulations involves not only the content shown to the user, but also the interface’s design and presentation. Typically, the UI is built with Markdown (or HTML); the fact that attackers can theoretically break out of the Markdown syntax used for the HITL dialog, presenting the user with fake UI, can lead to much more sophisticated LITL attacks that can go practically undetected.

Let’s see that in practice. As it turns out, the Copilot Chat VS Code extension fails to properly sanitize Markdown:

Copilot Chat evaluating Markdown and rendering it

This finding deserves a closer look. Once Copilot Chat is confirmed to be vulnerable to Markdown injection, two primary attack vectors emerge:

Injecting page elements that could trigger XSS or expose sensitive data — a vector unrelated to the LITL attack (see Exploiting Markdown Injection in Microsoft Copilot Chat and Google Gemini).
The primary focus of this section – Manipulating the HITL dialog through Markdown.

Let’s see how Copilot Chat handles the HITL dialog when you combine it with Markdown. Below is a PoC showing Markdown injected into the HITL dialog using direct prompt injection (just for demonstration). As you can see, when Copilot Chat first displays the dialog, everything shows up properly inside a code block, making it clear what command is about to run — no Markdown injection yet. But right after the user approves or skips the command, those same elements suddenly render as Markdown.

Copilot Chat session rendering injected markdown only after approval

In this specific case, there’s no additional risk because the markdown is only rendered after the user responds to the dialog. However, if attackers can inject Markdown into the HITL dialog through indirect prompt injections that is rendered before the user’s response, they could execute a far more sophisticated LITL attack, for example:

Prematurely closing the code block containing the actual malicious command.
Inserting lengthy, innocuous explanatory text to push the malicious command out of view.
Opening a new code block displaying benign commands (like git status or bug fixes).

With the malicious command hidden, the user reviews only what appears to be safe commands, approves the execution, and unknowingly authorizes malicious code to run.
Our guess? An agent that does precisely this either already exists or will in the future.

Defensive Measures for Agentic AI Users

Awareness & Education

As we highlighted in the first blog post, this is a user-deception attack, and like other similar attacks (for example, phishing), the first step is education and awareness. That means spreading the word of this risk will help minimize loss. It’s also encouraged for organizations to educate their employees on this threat, ensuring everyone is aware of the risk.

Pay Attention to Details

HITL dialog components have distinct visual styling, though these differences may be subtle. Recognizing these indicators is crucial for identifying tampering attempts (as long as the agent isn’t susceptible to Markdown injection vulnerabilities).

It’s also worth considering that agents operating in feature-rich UI environments offer advantages over command-line terminals. For instance, VS Code extensions provide full Markdown rendering capabilities, whereas terminals typically display content using basic ASCII characters. Choosing agents with more sophisticated UIs can make it easier to identify deceptive behavior early.

Consider the following example: In Claude Code (which runs in the VS Code terminal in my setup), the HITL dialog is distinguished only by a thin 1-pixel border. It shows a lengthy shell command where most of the command consists of injected, benign-looking text. The terminal’s limited UI features make it remarkably easy to overlook or misinterpret:

Claude Code HITL dialog edited to highlight the thin border around dialog content

In contrast, here’s the presentation in the Copilot Chat VS Code extension. While this approach isn’t foolproof, the visual distinction is considerably more apparent:

Copilot Chat in VSCode with stronger visual distinction

Defensive Measures for Agent Developers

Agents use HITL dialogs to involve users in decision-making, which naturally places some responsibility on the user. That’s understandable and reasonable. However, agent developers can still make this process safer by helping their users distinguish between legitimate and potentially malicious dialogs. This section provides recommendations for doing so.

Dialog Clarity

UI matters significantly. Implementing a straightforward, well-designed user interface that enables users to easily distinguish between HITL dialog elements is essential. This visual distinction empowers users to detect suspicious behavior more effectively (as outlined in the “Pay Attention to Details” section above).

Classic AppSec Still Matters

Don’t forget the basics: Input should always be validated and sanitized. While the following vulnerabilities aren’t the only ones that should be addressed, they’re particularly worth highlighting:

OS Command Execution: When an agent executes OS commands based on indirect prompt injection, it’s crucial to use safe OS APIs that clearly separate the command from its arguments. This approach provides two layers of protection: it reduces the impact if an attacker successfully injects their payload into an argument (rather than the command itself), and in LITL attacks specifically, it constrains the payload in ways that make malicious commands more visible and easier to spot.
Markdown Injection: As noted above, any Markdown or HTML from users or external sources must be escaped or sanitized appropriately before being incorporated into the conversation. Otherwise, they can be weaponized in LITL attacks.

Metadata Tampering Prevention

First of all, we don’t really know how Claude or any other agent constructs their metadata. Still, we can generally suggest that the description of the HITL dialog (or any other metadata attached to it) will be derived from the dialog only after it has been entirely constructed. Thus, the metadata can also reflect the risk in the dialog, which also works pretty well with the following suggestion.

Guardrails for HITL Dialog Validation

Another potential mitigation is to implement guardrails between the agent and the user to validate HITL dialogs before they’re displayed. These guardrails could include content validation (checking commands against allowlists and detecting suspicious patterns), metadata consistency checks (ensuring descriptions accurately reflect the actual operations), and prompt injection detection (scanning for known attack patterns or obfuscation techniques). However, this approach faces its own challenges: false positives may block legitimate operations, attackers can evolve to bypass detection, and if the guardrail itself uses an LLM for validation, it is also subject to prompt injection attacks. While guardrails can add a valuable layer of defense, they shouldn’t be relied upon as the sole protection mechanism.

HITL Dialog Length Restriction

Restricting dialog length is not a substantial mitigation on its own. However, since part of the risk comes from malicious content being pushed out of sight in overly long HITL dialogs, and because extremely long dialogs rarely provide real benefit, imposing a reasonable length limit can help reduce this risk.

When long commands are genuinely required, they can be handled more securely by relying on safe OS APIs (as noted above), which allow commands to be separated into multiple smaller HITL dialogs. This separation not only makes it easier for users to identify commands and their arguments but also reduces the risk of hidden or confusing input.

Sandboxing

This could be the best solution, but we find it hard to believe that agents (especially code assistants and the like) can be sandboxed appropriately and still be helpful. Nevertheless, if sandboxing can be used without affecting the agent’s usability, it should be employed.

Final Notes on HITL Dialog Forging and LITL

There is no silver bullet against Lies-in-the-Loop attacks. As long as systems rely on human-in-the-loop safeguards, vulnerabilities from over-trust, complacency, and divided attention will persist. The fundamental issue is that humans can only respond to what the agent presents — and that presentation can be manipulated through indirect prompt injections. By poisoning the agent’s context, attackers trick users into believing they’re approving benign actions while actually authorizing malicious ones. Once the HITL dialog itself is compromised, the human safeguard becomes trivially easy to bypass.

However, developers adopting a defense-in-depth strategy with multiple protective layers, as listed above, can significantly reduce the risks for their users. At the same time, users can strengthen resilience through greater awareness, attentiveness, and a healthy degree of skepticism.

Ultimately, the goal is not to eliminate the risk but to manage and mitigate it effectively.

Disclosure Timeline

Anthropic (Claude Code)

27 August 2025: Reported arbitrary command injection via the Bash() utility.
28 August 2025: Reported the HITL dialog forging issue.

Both reports were acknowledged by Anthropic and classified as “Informative,” falling outside their current threat model.

Microsoft (Copilot Chat)

Report was submitted – 15 Oct 2025
Microsoft acknowledges the report – 15 Oct 2025
Microsoft notified us that the engineering team is still working on the issue – 28 Oct 2025
Microsoft marks the report as Completed without fixing the issue 04 Nov 2025

Final response (text version below)

Screenshot of response from MSRC; text follows

Dear Ori,

Thank you for your submission and for continuing to engage with MSRC.

After careful review, we’ve determined that the behavior demonstrated does not meet our classification for a security vulnerability. It requires multiple non-default user actions, does not reliably reproduce across environments, and includes warnings designed to mitigate risk.

Our assessment also considers the role of Workplace Trust, which assumes users operate in environments where they review and trust the code they choose to run. This principle is reflected in Microsoft’s AI Vulnerability Severity Classification, which evaluates both impact and exploitability.

That said, we agree this is a thoughtful observation. While not classified as a vulnerability, we’ve shared it with the engineering team to explore ways we can make this behavior more transparent to users.

We appreciate your efforts to highlight potential concerns and welcome future submissions that demonstrate broader impact or bypass existing safeguards.

Sincerely,
Justin

Microsoft Security Response Center

Tags:

AI Agent

AI Security

Claude Code

Copilot Chat

HITL

Human-in-the-Loop

Lies-in-the-Loop

LITL