Unearned Confidence: AI Security Reviewers Don't Really Get It

Did you ever have a co-worker that was generally good at their job, but also would give extremely confident answers that were entirely wrong? I certainly have. And that’s kind of what it’s like to work with the current crop of AI security reviewers.

The Checkmarx Zero research team has been constantly evaluating AI-based security review capabilities since AI tools first started to have security scope. While progress has been exciting in several ways, we keep finding ways the approach breaks down significantly. You can trick it into ignoring horribly insecure (even malicious) code, for example.

But more concerning, at least to me, is the unearned confidence of these AI systems. They’ll confidently give you answers that range from over-simplified to the outright incorrect.

To be clear, this doesn’t mean there’s no value. These tools can be extremely useful in the hands of a skilled professional. It does mean, however, that it’s essential to understand their limits so that you can put them in the right place within your security program.

No hype, just the lastest research in your inbox.

AI Is Good Now. And That’s The Problem.

Large language models (LLMs) have reached a point where they can perform surprisingly solid code reviews. They can trace control flow, explain complex refactors, identify injection risks, reason about type confusion, and compare patches against vulnerability descriptions.

In many cases, they surface relevant insights faster than a human scanning the same diff for the first time. For routine review tasks, they can be genuine productivity multipliers. Some models can even find zero-day vulnerabilities, with some important limitations.

That progress is real.

But alongside it comes a subtle risk. Because the output looks structured, confident, and technically articulate, it is tempting to treat it as authoritative, especially in security contexts. When a model references specific functions, describes control flow accurately, and delivers a decisive conclusion, it feels like expert analysis.

As we are about to see, even when an AI demonstrates strong code comprehension, it can still misjudge exploitability, misunderstand configuration defaults, and overstate weaknesses. It can produce an analysis that sounds exactly like a senior security engineer while missing critical contextual details that completely change the outcome, resulting in false positive results.

AI can now assist with serious code review. But we still cannot trust it on security issues with our eyes closed.

Short Answer: No.

Recently, I decided to test Claude Opus 4.6’s “zero-day identification” capabilities by asking it to analyze whether the fix for CVE-2022-4506, an unrestricted file upload vulnerability in OpenEMR, was sufficient. For this experiment, we targeted OpenEMR, an open-source electronic health record (EHR) and medical practice management platform.

It confidently responded:

short answer: No, it’s insufficient… It determines the mimetype more accurately but never acts on it… The real guard (isWhiteFile) is conditional.

It further claimed that because certain checks were gated behind $GLOBALS['secure_upload'], the protection was effectively optional.

It sounded convincing. It referenced real functions. It identified real control flow. It made a clear, assertive claim.

There was just one problem. It was wrong. Making it a serious FP result.

What the Patch Actually Did

Looking at the actual commit, the changes in controllers/C_Document.class.php show a meaningful hardening of the upload logic. The patch removes reliance on the user-controlled $_FILES['file']['type'], introduces server-side MIME detection via mime_content_type(), performs explicit DICOM signature validation (‘DICM’), and skips the upload entirely if no MIME type can be determined.

In other words, the patch shifts trust from client-controlled metadata to server-side validation. That is not cosmetic. It is a fundamental security improvement. This is precisely what we want to see in a file upload fix: eliminate trust in user input and enforce server-side validation.

The AI’s conclusion hinged on two contextual misunderstandings that were the base of this FP result.

(AI generated description follows). Screenshot of a dark-theme code review diff for controllers/C_Document.class.php, showing 1 file changed with 10 additions and 3 deletions. The view is split into two side-by-side panes: removed lines highlighted in red on the left and added lines in green on the right. The changes appear inside a PHP function related to upload_action_process(). One removed line sets $mimetype from the uploaded file type directly. New added logic later in the function checks whether $mimetype is empty, tries to detect it with mime_content_type(...), and if detection still fails, logs an error and skips uploading the file with continue;. The interface includes line numbers, a search box labeled “Search within code” in the top right, and overall resembles a GitHub-style pull request diff. — The fix: Code diff adding safer MIME type handling for file uploads.

The Forgotten Whitelist

First, the model treated isWhiteFile() as if it were a weak or secondary check, something incidental in the flow. But it missed what the function actually does. isWhiteFile() is not a cosmetic helper. It performs a whitelist verification of allowed file extensions and types. In other words, it enforces a positive security model where only explicitly permitted file types are accepted. That is a meaningful control, not a soft signal.

The `iswhitefile()` function definition; screenshot.
See code as text
`function isWhiteFile($file) { global $white_list; if (is_null($white_list)) { $white_list = []; $lres = sqlStatement("SELECT option_id FROM list_options WHERE list_id = 'files_white_list' AND activity = 1"); while ($lrow = sqlFetchArray($lres)) { $white_list[] = $lrow['option_id']; } /* ... */ if (in_array($mimetype, $white_list)) { $isAllowedFile = true; } else { /* ... */ return $isAllowedFile; }`

Second, the model argued that because certain checks were gated behind $GLOBALS['secure_upload'], the protection was effectively optional. However, in standard OpenEMR installations, secure_upload is enabled by default. That means these restrictions are active out of the box in real-world deployments. Treating a default-enabled safeguard as optional misrepresents the practical security posture of the system.

globals.inc.php file with `secure_upload` structure; screenshot.
See code as text
`'secure_upload' => [ xl('Secure Upload Files with White List'), 'bool', // data type '1', // default xl('Block all files types that are not found in the White List. Can find interface to edit the White List at Administration->Files.') ]`

This is where the analysis failed. It reasoned about configurability in theory without considering default configurations in practice. Security is rarely about isolated lines of code. It is about how features behave in real deployments.

Now, can reasonable people disagree? Certainly! And that’s the point: AI doesn’t have “experience” the same way a security team does, so it can’t parse these nuances to arrive at a sensible recommendation for a specific organization. And it doesn’t seem like that ability will arise in the near future.

In short, whether you’re a developer validating the robustness of a fix or a security researcher attempting to bypass a CVE patch, using AI naively can introduce unnecessary overhead in both time and resources. Developers may waste valuable effort hardening code that is already secure, while researchers may pursue convincing but ultimately false leads. In both cases, the outcome is the opposite of what AI is meant to provide: increased efficiency and clarity.

Trivially Bypassable

I ran a similar experiment against the fix for CVE-2022-4733, an XSS vulnerability in OpenEMR. I asked the model to validate the relevant patch.

Its conclusion was again definitive:

“Is the fix sufficient? No, it’s trivially bypassable.”

It then proposed bypass techniques.

Bypass 1: Nested string
jajavascriptvascript:alert(1)

The model reasoned that after str_ireplace removes “javascript” from the middle, what remains is:

javascript:alert(1)

It argued that the function performs a single-pass replacement and does not loop until all matches are removed.

Bypass 2: URL encoding and mixed techniques

It further suggested that inputs like:

javascript:alert(1)

might bypass filtering depending on how encoding interacts with str_ireplace and template attribute escaping.

function javascriptStringRemove($text)
{
    return str_ireplace('javascript', '', $text ?? '');
}

Again, the reasoning sounded plausible. It demonstrated awareness of common filter bypass patterns. It referenced real string manipulation behavior. It framed the issue in a way that resembles classic XSS filter failures.

By this point, I was already convinced that the fix is truly bypassable. But, as I discovered in the first section of this blog, AI can sound confident even when it is dead wrong, so you can never be too sure. So the only thing that was left for me to do was to manually verify if the new suggested payload would run or not, and one of them did! So Claude was partially right with 1 out of 2 suggested payloads.

inserting the suggested payload into the vulnerable field

hovering over the user manual shows the link URL is our payload `javascript:alert(1)`

alert pop after clocking user manual link

Fool Me Once

Since Claude misled me while testing the CVE in the first section, I decided to push back and see how it would operate on shakier ground. Guess I was still a bit hurt.

So instead of celebrating that Claude had successfully bypassed the second CVE fix, I chose to fool it and claim that the payload didn’t work at all – a perfectly reasonable scenario, considering Claude may interact with either a non-security professional or a developer who felt challenged by its analysis and became protective of the code they wrote.

Claude’s response turned out to be yet another good example of why relying on AI can be dangerous:

“The rendered link is opened in a new tab using target="_blank". This key detail is critical in this case since modern browsers block navigation to javascript: URLs in many contexts, particularly when opened as a new browsing context”.

The model response basically claimed that browsers can sometimes block navigation to new tabs if the URL has a javascript: scheme, and this explains why the payload it suggested did not execute. Again, this sounds very convincing, right? But as we have already proven in the previous section, the payload worked just fine.

Someone without the knowledge of how to test and challenge the explanations given by the model has no other option but to rely on its answers blindly, with the belief that it knows best.

This is very risky since, as we know, humans make mistakes, and if we insist on our mistakes, then even AI models can’t help us out.

After Further Questioning

After the first run on each CVE, I refined my prompt in the hope of getting a more complete analysis. Taking CVE-2022-4506 as an example, I explicitly instructed the model to “deep dive into all relevant files and methods,” since it had not done so in the initial response. I also pointed out the sanitizing function it had missed, isWhiteFile(), and to the fact that $GLOBALS['secure_upload'] is enabled by default. These were the two main reasons its original assessment was incorrect.

With additional prompting, the model eventually corrected itself. It acknowledged the broader safeguards. It adjusted its assessment.

That is important.

The issue is not that the model is incapable of understanding these nuances. It is that it does not reliably account for them on its own. Without guided skepticism from the user, it can confidently deliver an incomplete or misleading security verdict.

And that distinction matters.

The Real Cost of Confident Mistakes

The main part of this story is not that the model made a mistake and produced an FP result, but rather how authoritative it sounded while doing so.

When I first read the model’s response about the fix for the file upload CVE (CVE-2022-4506), it sounded very convincing. Based solely on what the model presented, it was easy to believe the fix could be bypassed and to spend precious time trying to exploit it.

In the case of the XSS vulnerability (CVE-2022-4733), when I told the model that its proposed payload did not work, it produced another detailed and persuasive explanation for why it supposedly failed to execute. If I had not had the tools and the relevant knowledge to verify that explanation myself, I might have accepted it as correct and assumed the fix was sufficient, even though the code could still have been vulnerable.

The responses were structured, decisive, and technically detailed. They referenced specific functions and control flow. If you were not deeply familiar with the codebase or browser behavior, you might accept the conclusions without double-checking.

In security work, that is risky. Not only is it risky, since AI tools are now an integral part of security reviews, especially in the bug bounty community, vendors are drowning in AI-produced reports that mostly end up being full of false positive results that cost precious time to understand and analyze.

A flawed AI assessment can incorrectly flag a fix as insufficient, overlook default safeguards, misinterpret configuration-driven logic, ignore execution context, generate noise that wastes engineering time, and erode trust in legitimate security fixes. That’s a whole world of possible FPs waiting to happen.

In vulnerability research and code review, context is everything. Default settings matter. Surrounding validation layers matter. Deployment assumptions matter. Runtime behavior matters.

LLMs do not always reliably model those layers.

Defense in Depth vs. Single-Function Thinking

This example highlights a broader pattern in AI security reviews. Models are often good at local reasoning, analyzing a function, spotting a potentially unsafe pattern, comparing before-and-after diffs, and constructing hypothetical bypass strings. They are much weaker at system reasoning, understanding how configuration defaults, architectural guardrails, browser behavior, and layered defenses interact across a full execution path.

Security fixes are rarely about a single line. They are about defense in depth.

Hypothesis, Not Verdict

When an AI treats configurable safeguards as effectively disabled, ignores defaults, or assumes execution without validating runtime constraints, it can produce technically plausible but operationally incorrect conclusions.

AI can absolutely accelerate security reviews. It can summarize diffs, highlight suspicious patterns, explain unfamiliar code, and brainstorm potential attack paths. But it should not be treated as a final authority, especially when evaluating exploitability or patch sufficiency. And if not used wisely, it can cause the opposite effect, extending the time you spend verifying, revalidating, and analyzing solid fixes.

Every AI-generated claim about security should be treated as a hypothesis, not a verdict.

If anything, this experience reinforced an old lesson: context and careful code review still matter.

AI can help you think faster and accelerate your analysis. It cannot replace a deep understanding of how the system actually behaves.

And in security, “actually” is the only thing that counts.

Tags:

AI Security

Claude Code Security

Open-Source Software

Unearned Confidence: AI Security Reviewers Don’t Really Get It