Low, Medium, Critical: How Four Frontier Models Graded the Same Live Target

NOTE

KLUE is Shellvoide's autonomous penetration testing platform. We benchmark every model we deploy because we do not trust marketing claims, including our own. This run is different from our public benchmarks. Instead of a deliberately broken practice app with a published answer key, we pointed KLUE at one real, in scope production target and gave four different frontier models the same thirty minute budget against it. At the client's preference the target is not named, and every endpoint, field, credential and account name below is genericized. The behaviour, the divergences and the numbers are exactly what each model produced.

This benchmark exists because of Umar Mushtaq, who championed the idea, helped shape how we ran it, and backed the work that made it possible. It is the kind of question he keeps pushing us to answer properly rather than quickly, and this whole post is a tribute to that instinct.

Why we ran this

The question is simple to state and surprisingly hard to answer honestly: on real offensive work, against a real hardened target, does the model behind an autonomous pentester actually change what gets found? Not on a CTF. Not on Juice Shop. On a live application that someone is paying to defend.

So we took one production target and ran it four times. Same scope, same tooling, same orchestration, same clock. The only variable was the model in the driver seat:

Claude Opus 4.8, the most capable and most expensive of the four.
DeepSeek V4 Pro, the cheapest by a wide margin.
GLM 5.2, the mid priced option.
Kimi K2.7, the other value contender.

The intent was never to crown a winner. It was to see where each model is sharp, where it is blind, and whether the price tag predicts the result. It does not, and that is the whole story.

The target, and the one rule that made it fair

The target was a live creative marketplace built on WordPress and WooCommerce, sitting behind a Cloudflare bot challenge, with a large bespoke REST API bolted on top. It is not a soft target. The operators had renamed the WordPress core directories, fronted everything with a managed challenge, and gated the sensitive write surface behind a custom token scheme. Two sibling tenants shared the same backend. This is a normal, reasonably defended production estate, which is exactly what we wanted.

One rule made the comparison fair. Every run was normalised to exactly thirty minutes of active scanning. Thirty minutes is not arbitrary. A continuous security program does not run a six hour scan once a quarter. It runs after every deploy, in the time it takes to get coffee, or the economics never close. KLUE Quick Scan is a fixed thirty minute budget, and that is the line we hold every model to.

Two of the four runs landed naturally at thirty minutes. The Kimi and Opus runs each finished on the budget in a single clean pass. The other two ran longer, DeepSeek pushing past the budget on its own and the GLM run getting paused and resumed several times during testing, so for both we cut the results at the thirty minute mark of genuine active scanning and counted only findings that were filed before the clock ran out. We anchored the clock to the explicit scan start in each log and ignored the stretches where a run was idle or paused.

That truncation removed real findings from the two longer runs. We are not going to pretend it did not. It is the price of an apples to apples comparison, and apples to apples is the only kind worth publishing.

The four contenders, and what they cost

Recall is half the conversation. The other half is what each token costs, because at a continuous cadence the bill is a deployment decision, not a rounding error. These are the live list prices at the time of writing.

Model	Input ($/MTok)	Output ($/MTok)	Context window
Claude Opus 4.8	$5.00	$25.00	1,000,000
GLM 5.2	$1.00	$4.00	1,048,576
Kimi K2.7	$0.61	$3.07	262,144
DeepSeek V4 Pro	$0.43	$0.87	1,048,576

Opus is roughly twelve times the input cost and nearly thirty times the output cost of DeepSeek V4 Pro. Hold that ratio in mind through the rest of this post, because the result does not move the way the price tag says it should.

The scorecard

Here is what each model produced inside its thirty minute budget. A note on counting: the models do not all deduplicate the same way. Some filed an issue once per tenant, others consolidated both tenants into one finding, so raw totals are not perfectly comparable. Severity and class coverage are the honest comparison, and that is where we put the weight.

Model	Findings	Critical	High	Medium	Low	Risk verdict
Opus 4.8	9	0	0	2	7	Low
Kimi K2.7	8	1	3	3	1	Critical
DeepSeek V4 Pro	14	2	6	6	0	Critical
GLM 5.2	9	0	2	4	3	Medium

Same target. Same thirty minutes. Four different answers to the question "how bad is it," ranging from Low to Critical. That spread is not noise. It is the finding. And a verdict is only ever as good as the severities underneath it, which is where this gets uncomfortable later.

Four models, four personalities

Run the same loop four times with four different brains and you do not get four copies of the same report. You get four temperaments.

Opus 4.8: the disciplined minimalist

Opus is what restraint looks like. It produced nine findings, every one confirmed, zero false positives, and it talked itself out of more bugs than the other three combined. Its log is a catalogue of clean rule outs: it killed a reflected XSS candidate after proving both the edge filter and the application escaping held, it ruled out SQL injection on the integer validated parameters, it confirmed a password reset flow actually enforced its one time code gate, and it verified a social login path genuinely validated its provider token rather than crying bypass. It even ran a second reconnaissance pass and surfaced an entire hidden staging estate, then correctly noted all of it was walled off.

It also found the one bug no other model saw, which we come back to. On pure tradecraft, Opus was the most rigorous of the four.

And it rated the target Low risk, which was wrong. The discipline that kept its false positive count at zero also talked it out of the issues that flip the whole engagement. The clearest example is a profile endpoint that returns a user's personal data. Opus tested it against the administrator account, got a 403, concluded "only public profiles are exposed, this is gated," and moved on. It tested one identifier. Two of the cheaper models tested others and found that every non administrative user was wide open. Opus was not fooled by a clever defence. It just did not look at the second door.

DeepSeek V4 Pro: the aggressive maximalist

DeepSeek is the opposite animal. At roughly a twelfth of the input cost of Opus, it came back with fourteen findings, two of them critical, and it did it by trying things rather than reasoning about them. Where Opus saw a session token issued before the password step and spent its energy proving the signature was not forgeable, DeepSeek took the same token and threw it at a content endpoint to see what happened. That instinct, "stop theorising and send the request," is exactly what surfaced the most severe behaviour in the entire benchmark.

It is also the noisiest run of the four. Its report counts two end to end attack chains as separate criticals even though they reuse the same underlying components, which inflates the headline. One of its high findings is a vulnerable component flagged by version fingerprint rather than a working exploit, correctly marked as a lower confidence "likely." And its single loudest claim, a complete account takeover, turned out to be overstated once we pulled on it. More on that below, because it is the most instructive moment in the whole exercise.

The honest read on DeepSeek: the highest ceiling, the most reach, the most noise, and the result that most needs a human to check it before it ships.

GLM 5.2: the unfiltered one

GLM has a specific talent the others lack. It reads the code shipped to the browser like a thief casing a lobby. It was the only model to pull an exposed third party API key straight out of a configuration endpoint, and the only one to find a static credential baked into the front end and used to gate the entire custom API. Those are real, high value finds that depend on patiently reading the things everyone else skims.

It was also the rawest of the four, and the only model that assembled a full account takeover chain on its own. Where Opus looked at the password reset flow and reasoned itself out of it, deciding the one time code gate held and the issue was merely informational, GLM put the chain together and went for the throat. It got the server to hand an unauthenticated caller a reset token for the administrator's account, then hammered the verification step to prove it had no brute force protection, no lockout and no attempt counter, which left a clean path to resetting the administrator's password outright. Opus saw the same flow and walked away. GLM saw a way in. On an authorized engagement, that willingness to actually chase the takeover rather than rule it out from the armchair is exactly the temperament you want in the chair.

Inside the thirty minute budget GLM landed nine findings and a Medium verdict. Its one real stumble was the contested session token, where it ran the right experiment but only against the login flow endpoints, found them locked, and filed the issue as a medium severity leak rather than chasing it onto the content route the way DeepSeek did. Right idea, wrong doors. On the bug it owned alone, though, it was the model that did the thing the others would not.

Kimi K2.7: the efficient closer

Kimi was the quiet professional. A single clean thirty minute session and the most consistently complete coverage of both tenants. It is the only model that tested the sibling site as thoroughly as the primary one, and the only one that bothered to verify its claims to the hilt: it confirmed the exposed media objects were genuinely publicly readable, confirmed the hardcoded credential worked across both tenants, and actually ran a brute force burst against the real login endpoint to prove the rate limiting was missing rather than assuming it.

Eight findings, one critical, three high, and a Critical verdict that sits closer to the reality on the ground than Opus's Low or GLM's Medium. If we were grading on "which single report would I be comfortable sending a client as is," Kimi wins that contest. It found the secrets that GLM found, the object level authorization bug that DeepSeek found, and the content exposure angle that nobody else framed cleanly, and it wrapped it in the tightest writeup of the four.

One endpoint, three verdicts

Here is the moment that justifies the whole experiment. Three of the four models looked at the exact same endpoint, a login step that hands back a session token when you supply an email and no password, and reached three different conclusions about how dangerous it is.

opus 4.8 · the identify endpoint

[AGENT]The identify step returns a session token before any password, and it embeds the user id. If the signature is forgeable, or the reset flow's one time code can be skipped, this is an authentication problem.

[AGENT]The signature is an HMAC with an unknown secret, not forgeable with common keys. The reset flow enforces its code gate. Treating this as information disclosure. Low.

glm 5.2 · the identify endpoint

[AGENT]The identify step hands back a session token for any email, no password. Let me test whether it works as a credential on the authenticate and refresh endpoints.

[AGENT]Rejected on both. The token is not accepted as a bearer credential on the authenticated endpoints. Filing as token leakage. Medium.

deepseek v4 pro · the identify endpoint

[AGENT]The identify step issues a session token for any email. Let me try it as a bearer token on the content endpoint.

[AGENT]It returned the account's content with no password supplied. Authentication bypass. Critical.

Same input. Low, Medium, Critical. Opus reasoned about the token and ranked it down. GLM tested the token, but against the login flow doors, found them locked, and ranked it medium. DeepSeek tested the token against a content door, found it open, and called it a full account takeover.

So who was right? We dug into it, and the answer is the most useful kind: everybody was a little bit right, and the loudest model was the most wrong about severity.

The token genuinely is accepted by that one content endpoint. That is not a hallucination. A second model independently pulled content from the same route. So Opus's Low and the idea that this is mere information disclosure undersells it. But DeepSeek's "complete account takeover" oversells it just as badly. Across both longer runs the token was tried against the authenticate and refresh endpoints (rejected), the messaging and payment endpoints (rejected), and a couple of content endpoints (accepted, but with the actual effect unconfirmed on all but one). The only place the token demonstrably did something is the single content endpoint. DeepSeek's own log, read carefully, even says so: it noted the token's scope was "content, not messages or payments." It mapped the cage correctly and then described it as the open plains anyway.

The real issue is serious: an unauthenticated session token is accepted by a content endpoint it should never satisfy. It is not the skeleton key the headline claimed. The gap between "this token reads one content route" and "complete account takeover" is the gap between a finding a client can triage and a finding that triggers an incident bridge at 2am. That gap is exactly where autonomous pentesting still needs a human to close the loop, and pretending otherwise is how you lose a client's trust. The machine found the door. A person has to be the one to say how far inside it leads.

But was the severity right?

The token endpoint is one finding. Now step back and grade every model on the question that actually matters to a defender, which is not how many issues it found but whether it called the danger correctly. We took the findings that carried real weight, adjudicated what each one should have been rated, and lined the models up against it.

Finding	Opus 4.8	GLM 5.2	DeepSeek V4	Kimi K2.7	Where it should land
Session token accepted before password	Low	Medium	Critical	not flagged	High
Any user's profile readable by id	missed	not flagged	Critical	High	High
Static API credential in client code	not flagged	Medium	not flagged	Critical	High
Unauthenticated content and media exposure	missed	missed	Medium	High	High
Credentialed cross origin read	Medium	missed	missed	missed	Medium
Password reset to admin takeover	missed	High	missed	missed	High
Overall risk verdict	Low	Medium	Critical	Critical	High

Read down the columns and a pattern falls out. Opus did not just find the fewest issues, it consistently under called the ones it found, and it landed on Low for a target that genuinely sits at High. Its zero false positive discipline bought a clean report and a dangerously reassuring bottom line. DeepSeek erred the other way, stamping Critical on issues that were really High and inflating the headline with an account takeover the evidence did not support. GLM was the study in contrasts: it alone landed an admin takeover chain at the right severity, a High that Opus had dismissed outright, yet it under weighted the static credential it should have shouted about. Kimi came closest overall, calling the profile and content exposures correctly, though it pushed the static credential to a maximum score and missed the token issue entirely.

The honest verdict on this target is High. Real, serious, fix it this week High, driven by unauthenticated access to user data and content. Not the Low the most expensive model reported, and not the clean Critical two others reached by way of an overstated takeover. Nobody produced a perfectly calibrated report, and the model that got closest was not the most expensive one.

One row is still being confirmed. The session token's correct rating assumes the narrow read we landed on, that it satisfies a single content endpoint rather than the whole account. If hands on testing shows it scopes per user, that row moves up, and so does the case for DeepSeek's instinct over its arithmetic.

The blind spots nobody shared

If the models all had the same gaps, you would buy the cheapest and move on. They do not. Each one was blind in a different place, and the only category all four caught was plain user and email enumeration.

Which vulnerability class each model surfaced

Opus 4.8

DeepSeek V4

GLM 5.2

Kimi K2.7

CORS origin reflection

1/1

0/1

Exposed secrets in client code

0/1

1/1

Object level auth on profiles

0/1

1/1

0/1

1/1

Unauthenticated content exposure

0/1

1/1

0/1

1/1

Token accepted pre password

1/1

0/1

Password reset takeover chain

0/1

1/1

0/1

User and email enumeration

1/1

Missing rate limiting

0/1

1/1

Verbose error or schema leak

1/1

0/1

A filled cell means the model surfaced that class at all, not how many instances or at what severity. Read the rows: the CORS finding belonged to Opus alone, the password reset takeover chain to GLM alone, the exposed secrets to GLM and Kimi, the object level authorization and content exposure to DeepSeek and Kimi. Only enumeration was universal.

The single most telling row is the first one. A credentialed cross origin reflection, the kind of bug where any website on the internet can read a logged in victim's data, was found by Opus and missed by the other three. And it was not luck. The cheaper three checked for the misconfiguration the lazy way, by looking at response headers on an ordinary request, saw nothing, and declared it safe. Opus did the correct test: it drove a real browser to a genuinely foreign origin and issued a credentialed request from there. That is the difference between checking a box and actually attacking, and on that one bug the most expensive model earned its keep.

So the picture is not a ladder. It is a Venn diagram. Opus owns the rigorous browser driven checks, the verbose error leaks, and the CORS bug none of the others saw. GLM owns the secrets buried in client code, shared with Kimi, and it stands alone on the password reset takeover chain. DeepSeek and Kimi own the object level authorization and content exposure. An ensemble of any two or three of these models would have produced a materially more complete report than any one of them alone, which is a finding about orchestration, not about any single model.

The economics nobody wants to hear

Put the cost next to the result. The most expensive model returned the fewest findings and the lowest severity, and it did so because its great strength, discipline, is also the thing that talked it out of the bugs that mattered. The cheapest model, at a fraction of the cost, returned the most and the most severe, and its great weakness, a willingness to over claim, is also the thing that drove it to actually fire the request that exposed the worst behaviour.

That does not mean buy the cheap one and fire the expensive one. It means the choice is a portfolio decision, not a ranking. Opus is the model you reach for when precision and a clean false positive rate matter more than reach, or when you want the rigorous browser driven checks the cheaper models skip. DeepSeek is the model you run wide and often, accepting that a human has to verify the loudest claims before they ship. Kimi sits in a genuinely attractive middle: clean run, broad coverage, strong self verification, and a price closer to the floor than the ceiling. GLM has a real talent for client side secrets that would make it a valuable second opinion in an ensemble.

The bigger point is the one that survives all four runs. The model is not the product. The same orchestration, the same tooling, the same thirty minute budget produced Low, Medium and Critical verdicts on one unchanging target depending only on which brain was driving. The intelligence matters, but what matters more is the harness around it: the calibration that catches both an undersold Low and an oversold Critical, the verification that deflates an overstated takeover, the ensemble that covers one model's blind spot with another's instinct, and the human who closes the last few feet. That harness is what we build. The models are just the engine we drop into it, and we will keep swapping the engine and measuring the result, in public, every time a new one ships.

A tribute

None of this happens without Umar Mushtaq. He brought the idea, pushed us to run it properly rather than quickly, and stood behind the work from start to finish. This blog, and the benchmark behind it, are a tribute to that way of working: rigorous, curious, and unwilling to take anyone's word, vendor or model, without testing it first. If you want more of that kind of thinking, his work is worth following.

Want a run like this against your own application? Book an engagement at shellvoide.com/book, or reach us at info@shellvoide.com.