Learning About LLM-Based Zero-Day Hunting with Claude Code's Opus 4.6

Can AI Really Find Zero-Day Vulnerabilities?

An AI model independently surfacing real, previously unknown security flaws in production software. A landmark moment, right? That’s certainly the initial reaction to Anthropic’s release of Claude Opus 4.6 and its demonstrated capability to find zero-day vulnerabilities.

But is it as magical as it seems? Well, not exactly.

For one, LLM (Large Language Model) assisted zero-day identification isn’t new. This isn’t the first time someone has claimed to find zero-days using LLM-assisted tools. Google’s Big Sleep project found a zero-day in SQLite back in 2024, and researchers have used o3 to uncover CVE-2025-37899. These results have been out there for quite some time already.

But Opus 4.6 gave us yet another reason to take a hard look at where these capabilities actually stand. And a chance to look at the bigger picture, share the patterns we observed, what works, what doesn’t, and what it all means for the application security industry and zero-day identification at large.

AI For Vulnerability Hunting (Magic Not Included)

The AppSec paradigm, or at least the hype behind it, is seeking to take advantage of modern AI capabilities, and that includes LLM-based security reviews. But there’s still a long way to go.

LLMs work remarkably well in some contexts and fail dramatically in others. The picture that emerges is a familiar one: LLMs are a powerful complementary tool, but not yet a replacement for the security tooling and processes we all rely on today (SAST, DAST, IaC scanners, and the security experts using them).

Enforceable security controls depend on consistency, traceability, coverage guarantees, and compliance alignment, capabilities delivered by established AppSec tooling. Security professionals turn these tool outputs into organization action: they provide the critical judgment, validation, and accountability required to operationalize and govern these controls effectively.

GenAI-based security tools are weak when it comes to providing the deterministic, auditable security guarantees that effective security programs rely on. Their strength lies in accelerating triage, enriching context, and improving developer productivity (e.g., synthesizing findings, explaining vulnerabilities, and recommending remediation steps). All of which are high-value augments for developers, security professionals, and others involved in delivering secure software.

Don’t miss the latest research and analysis

Interestingly, even Anthropic themselves acknowledge this in their LLM zero-day hunting blog:

As the volume of findings grew, we brought in external (human) security researchers to help with validation and patch development. Our intent here was to meaningfully assist human maintainers in handling our reports, so the process optimized for reducing false positives. (emphasis ours)
– Evaluating and mitigating the growing risk of LLM-discovered 0-days

Honestly, one of the reasons we like working with Anthropic is that they have a habit of this kind of transparency, which we deeply appreciate as researchers.

But let’s get deeper into what that means by breaking down where Claude and other LLMs shine. And what the pitfalls and caveats are to be aware of before you go all-in on an LLM-based security reviewer.

Context Is Everything

Yeah, I know: that title doesn’t really say anything new. We all know that when it comes to LLMs, context is king. Yet while everybody talks about zero-day identification, few mention context anymore. As if they have a magic solution for finding zero-days in given, arbitrarily long, complicated context (like real production codebases). Spoiler: there is no such solution.

If you know what you’re doing, Claude (and other LLMs) can really save you time on the low-hanging fruit. (And let’s be honest, “if” is doing a lot of lifting there.) Claude can review many lines of code in a matter of seconds, identify well-known patterns, and flag any strong signals of insecure code, such as a call to eval(). But in practice, the larger the context, the worse the results tend to be.

We really don’t want you to blindly trust us, though, that’s why we want to walk you through the exact same journey the Checkmarx Zero research team took.

Let’s start by examining one of the zero-day Anthropic claims Claude Opus found in the CGif library during their research. Examine the overflow in the CGif library for yourself.

Reviewing this repo, you’ll see that the actual code resides in only 3 different C files (~600 LoC [Lines of Code] each) and 2 additional header files (~100 LoC each), so the total is less than 2,000 lines. How representative of enterprise code and vulnerabilities is this? While this finding is really cool as an experiment, is this a realistic example? Does it cover YOUR use case? We’re not quite sure.

In a similar vein, demos of the security features of Claude Code Security features since the introduction of Opus 4.6 show its use for reducing false positives from established security tools. But we found that many straightforward, well-known vulnerability classes are missed when you take a large codebase and simply ask Claude to “find some vulns here.” Worse, that approach doesn’t just reduce true positives, it also significantly inflates the false-positive rate.

Let’s see what we got when Claude was tasked with reviewing the n8n repository as a whole:

the analysis identified eight vulnerabilities while consuming approximately 90% of the available tokens, and that entire exchange stemmed from a single prompt. Of those eight findings, only two were true positives

Claude Code security reviewer analyzing n8n; specific details redacted as repair has not yet been completed

Results breakdown (some info was redacted for the sake of simplicity):

Form Trigger XSS: False Positive – correct sink + input, but missing sanitizers (e.g., sanitization for form description, customCss, or form HTML) weren’t recognized
The unauthenticated debug mode endpoint is disabled by default

One of our tests included a full production-grade codebase scan, without any specific context other than asking Claude to search for vulnreabilities. The analysis identified eight vulnerabilities while consuming approximately 90% of the available tokens, and that entire exchange stemmed from a single prompt. Of those eight findings, only two were true positives.

To improve results, we tried tailoring the context by letting Claude review specific code with some vulnerability-related context, like past vulnerabilities. For example, here’s some of what Claude found in open-source packages:

False 0day, real problem: Null Pointer Dereference in FreeRDP. Claude reported a null-pointer dereference in FreeRDP’s SDL2 file. This finding initially appeared worthy of upstream disclosure, so we did. What we noticed, however, is that the issue had already been disclosed in the original advisory, while the maintainer applied the fix to a similar file, SDL3, instead. Claude merely rediscovered a known bug and presented it as a novel zero-day, without recognizing or disclosing the prior report. Furthermore, the SDL3 component was intentionally fixed instead of SDL2, as SDL2 is deprecated for over a year. Nevertheless, the FreeRDP maintainers responded to this issue and provided a patch for it.
GitHub comment showing that SDL2 has been deprecated in FreeRDP

XSS via SVG in n8n. Claude Opus 4.6 successfully identified an SVG-based vulnerability in n8n. Notably, this wasn’t exclusive to the top-tier model, OpenAI’s models and other Anthropic models also caught it. However, Opus 4.6 also stated that once SVG is addressed, the issue will be resolved. This is definitely not true, refer to some great research on MIME type sniffing and BlackFan’s content-type research for why. This highlights the importance of treating AI-generated recommendations with skepticism, and having a qualified person review them before taking action.

To be clear, we are not raising these points to shame Anthropic. Quite the opposite: we have a lot of respect for them and their products, and it’s part of why we spend research time on them. They have built an impressive product that, only a few years ago, would have seemed purely aspirational. However, these important nuances are often not sufficiently highlighted in PR communications or by industry influencers — just look at the outsized reaction to the announcement of Claude Code Security.

As a security company, we know first hand that defenders need much more than just “identification.” Security teams need to know whether a result is a false positive, whether it’s really exploitable, what the vulnerability’s impact is, how to prioritize it, and so on. These aren’t a wish list: they’re real client needs that we hear every day.

The current generation of AI tools simply can’t provide those answers in a repeatable and reliable way. Without sufficient context (and the associated costs for consumers), their capabilities are further limited. And as we already said, providing adequate context isn’t easy. And when you start to consider the costs of scanning projects like n8n with LLM and adequate context, you realize that you may end up exhausting your token budget before even seeing meaningful results.

For example, recall that in the n8n review we had to consume a large amount of our context budget just for this one finding, and still had false positives. What does that look like at Enterprise scale?

What This Means for AppSec

The hype around AI-powered vulnerability discovery is warranted, but only partially. LLMs can really find zero-day vulnerabilites. They can catch issues that traditional tools miss, reason about complex code flows, and surface vulnerabilities that require an understanding of application logic rather than just pattern matching. And they fit neatly into developer and security workflows. There’s a reason Checkmarx offers AI security tools too: there’s genuine value there.

But AI security tools are not a silver bullet. They miss things. They hallucinate findings. And their effectiveness varies wildly based on how they’re used and what context they’re provided with.

The future of AppSec isn’t “AI or traditional tools”, it’s “AI and traditional tools”, wielded by security professionals who understand the strengths and limitations of both. It’s wisely-chosen tool stacks deployed intelligently into security-aware development lifecycles where they can accelerate security processes and lower both software risk and security program cost. The organizations that will benefit most are those that treat LLMs as a force multiplier for their existing security programs, not a substitute for them.

This context-dependency has significant implications. It means that getting value from LLM-based security review isn’t as simple as pointing a model at your repo and waiting for results. It requires security expertise to direct the model effectively, knowledge of what to look for, how to decompose a codebase into reviewable units, and how to craft prompts that minimize noise while maximizing coverage. And it requires a careful balancing of benefit vs. cost to keep security programs within challenging budget constraints.

Whether Claude surfaces a vulnerability depends heavily on how you scope the review, what context you provide, and which techniques you use to guide the analysis. If that makes you start to feel that Claude complements your existing tools and team of security researchers, that’s exactly what it made us feel, too.

Bring Your Own Magic

LLMs are getting better at finding vulnerabilities, that much is clear. But the gap between “impressive demo” and “production-ready security tool” is still wide. Context matters more than anyone in the hype cycle wants to admit; findings need expert validation; and the real value comes from integrating these capabilities thoughtfully into an existing security program alongside proven tools and qualified experts.

Tags:

AI Agent

AI Security

Claude Code

Claude Code Security

LLM

Opus

Learning About LLM-Based Zero-Day Hunting with Claude Code’s Opus 4.6

Can AI Really Find Zero-Day Vulnerabilities?

AI For Vulnerability Hunting (Magic Not Included)

Context Is Everything

What This Means for AppSec

Bring Your Own Magic