KLUE · Autonomous Reasoning Agent

KLUE in the field.AI offensive & defensive PTaaS agents.

Autonomous reasoning across whitebox, greybox, and blackbox engagements. From surface mapping to exploit validation, all in one agent. KLUE works in real browsers, on real targets, and ships proof of impact, not lists of suspicions to triage.

CVE · 6.5-Hour Run

5CVEs disclosed

Coordinated across open source projects.

Critical · 48-Hour Engagement

57+findings validated

Working exploits across whitebox, greybox, and blackbox modes.

Benchmark · SAST

100%precision

76.5% recall, 7–12 seconds to validate per finding.

WhiteboxCase 01

11findings

Container escape on staging

Source-level audit before go-live. Eleven critical bugs caught and chained on the staging environment.

Pre-launch SaaS
GreyboxCase 02

37min

Account takeover

Coordinated disclosure with NCERT. Seven chained findings across auth and session vectors.

Ministry of IT · NCERT
BlackboxCase 03

61min

Full database read

External attacker simulation. From first request to crown-jewel data, end to end.

Public sectorRead case

We ran the benchmark.
KLUE sat above the curve.

Rule-based engines cluster lower-left on the precision-recall plane. The structural trade-off between catching more and staying quiet bends them onto a single curve. A reasoning engine is not bound by it.

RECALL %PRECISION %050100050100ABOVE THE CURVERULE-BASED CEILINGScanner A · 9.5%Scanner B · 61.1%Scanner C · 72.7%KLUE100% · 76.5%

Precision

100%

Zero false positives across the public SAST benchmark.

Recall

76.5%

Above the empirical rule-based ceiling.

Ground truth

48vulns

Hand-labeled before scoring. Zero ambiguity.

Validation time

7-12sec

Per finding, including a working exploit proof.

Most security tools look for what they already know.

A rule. A signature. A playbook of attacks someone wrote down years ago. The work of finding new ways in has been quietly outsourced to a library of old ways in.

The Unspoken Contract

"We will find what is already known. The rest is your problem."

Every product in the category, restated honestly. Useful as a baseline. Inadequate as a security program.

Every scanner · Every signature · Every playbook

Every static analyzer, every vulnerability scanner, every breach and attack simulator on the market today shares one underlying engine. They look for patterns they were told to look for. They catch what their rules describe. Everything else slips past.

That model worked when threats were a slow-moving library. It does not work when the bug you need to catch was written by your team yesterday, with no rule yet to describe it. And it does not work when the attack surface itself is new. There is no signature for an LLM tool definition that becomes a database sink. There is no playbook entry for two safe operations performed in the wrong order. Those are judgments about intent. Intent is not something a pattern can encode.

Three engagements. Three case studies.

Real targets. Real chains. Each step confirmed with a working artifact. Nothing reported on suspicion alone.

Public Sector Web Application · Black Box

Eleven findings. Sixty-one minutes. One database, fully compromised.

Internet-facing citizen services portal. No prior endpoint inventory shared. The agent mapped the surface on its own, the way a human tester would, and walked out with the database.

11

Findings

61min

Duration

Crown jewel

Full DB read

Blind SQLi chained to a superuser role.

The chain

  1. 01Reconnaissance.
  2. 02Probing.
  3. 03Exploitation.
  4. 04Report.

Outcome

Critical findings remediated within the same week. Permissive CORS and mass assignment patched first. Database role downgraded from superuser the same day the report shipped.

Ministry of IT Portal · Coordinated with NCERT

Seven findings, one chain, full account takeover.

A national portal hosted under the Ministry of Information Technology. The agent found a chain that reached every account on the system without ever supplying a password. Disclosure was coordinated with the National Computer Emergency Response Team (NCERT).

7

Findings

37min

Duration

Crown jewel

Account takeover

Any official, by username. No password required.

The chain

  1. 01Unauthenticated IDOR.
  2. 02Enumerate every account.
  3. 03Recovery endpoint issued a real auth token.
  4. 04Switch account to Admin.

Outcome

Coordinated disclosure with NCERT. Authentication middleware shipped on the four critical endpoints within forty-eight hours.CVSS 9.8 · OWASP A01 · A07

Case Study II · Engagement record

Cybersecurity Training Platform · Pre-launch Whitebox Review

Eleven findings before go-live. Container escape caught on staging.

A live-fire training platform two weeks out from public launch. The agent was pointed at the repository on the dev branch and surfaced issues the conventional CI pipeline had not flagged, including a path to host-level access through the API container.

11

Findings

Whitebox

Duration

Crown jewel

Container escape

Docker socket mounted, container running as root.

The chain

  1. 01Docker socket mounted into API container.
  2. 02CORS allow-all with credentials.
  3. 03Hardcoded Django secret-key fallback.
  4. 04Shell command execution with shell=True.

Outcome

All four high-severity issues fixed on the dev branch before launch. The Docker socket mount was replaced with a scoped socket proxy. The container shipped as a non-root user. Launch went out on schedule with no findings outstanding.

Case Study III · Engagement record

One agent. Four surfaces.

SAST, DAST, cloud audit, and full black-box pentest, same reasoning engine, same isolated runtime, same report format. Pick the surface; the agent does the rest.

Whitebox & Blackbox Engagements

Full-spectrum testing in one engine. External attacker simulation, source-level review, or greybox hybrid - same reasoning runtime.

  • Blackbox external simulation
  • Whitebox source + infra
  • Greybox hybrid runs
  • Real-browser exploit validation

SAST & DAST

Static analysis that traces data flows, paired with dynamic testing against live web apps, APIs, and SPAs. Logic bugs scanners miss.

  • Static data-flow tracing
  • Live web + API + SPA testing
  • Business logic + auth bypass
  • Working proof-of-concept output

Cloud & M365 Engagements

Configuration posture for the hyperscalers and Microsoft 365 tenants. Identity sprawl, public exposure, and conditional-access gaps.

  • AWS / Azure / GCP posture
  • M365 tenant assessment
  • IAM / RBAC sprawl
  • CIS / NIST framework mapping

Code Analysis

Connect a repository. The agent traces injection sinks, spots custom logic bugs, and audits IaC for misconfiguration before deploy.

  • Custom logic flaws
  • Injection sink tracing
  • Terraform / IaC review
  • Supply chain & dependency risks

Thirty minutes. Not four hours. Not weeks.

Recall is half the story. Duration is the other half. The public benchmark for agentic offensive tools sits at four to six hours a run. That cadence forces quarterly testing economics. A thirty-minute run can sit inside a deploy pipeline.

ToolRecallDurationSoft Positive Rate

KLUE

Reasoning agent, thirty-minute budget

82%~30 min~4%

Leading commercial agent

Multi-model orchestration

75.0%4 hr6.3%

Raw frontier model

Unorchestrated agent loop

70.0%10 min6.7%

Open-source agent A

Same underlying model

45.0%4 hr10.0%

Open-source agent B

Same underlying model

30.0%6 hr25.0%

Open-source agent C

Same underlying model

5.0%2 hr0.0%

Public field benchmark against a recognized target with twenty-two documented vulnerabilities. KLUE figures from real runs under a thirty-minute budget. Other figures from publicly reported results.

One

Orchestration is the story.

Three tools on the public board share the same underlying model and produce 45%, 30%, and 5% recall. That spread is an orchestration story. KLUE is the orchestration.

Two

Model choice becomes a cost decision.

On a separate benchmark, KLUE on an open-weights model tied a frontier closed model on critical-severity recall, at one-fifth the cost.

Three

The cadence is the unlock.

Four-hour scans force quarterly cycles. Thirty-minute scans run after every deploy. That is the line credible autonomous tools will be measured against.

Three categories. The category decides the ceiling.

Every product in this space falls into one of three groups, defined by what its engine actually does: matches patterns, replays known attacks, or reasons about the target. The group sets the cap.

CapabilityKLUEVulnerability scannersBreach simulatorsOther AI agentsHuman pentesters
Engine
Underlying modelReasoningRulesSimulated playbooksMixed agentsHuman reasoning
Discovers unknown vulnerabilitiesPartial
Writes custom exploit codePer targetPresetPreset
Chains findings into attack pathsMulti-stepLimitedSomeManual
Delivery
Time per engagement~30 min to 6 hrContinuous CVEHoursHours2 to 6 weeks
OutputPDF, JSON, PoCCSV, dashboardDashboardDashboard, PDFPDF (weeks later)
Free retest after fixNot applicableRe-runRe-runExtra cost
Operations
Continuous coverage in CICVE scan onlyScheduledScheduledImpossible
Parallel scansUnlimitedUnlimited, shallowScheduledScheduledOne per team
Vendor modelProprietary, exclusiveLicensed softwareLicensed softwareLicensed softwareConsulting hours

Most tools answer "what known weakness might you have?" KLUE answers "what would an attacker actually do here?" The category decides which question can be asked.

Closing

See what an hour finds.

Run KLUE against your own surface. One hour. Real exploits. Real remediation. No procurement cycle. No consulting engagement.