EngineeringCategory

The Zero-Code Security Team: Shifting Left with Prompt-Native AI Agents

12 min read
Nitish Agarwal

Here's a pattern that plays out at most engineering organizations running at scale. A developer writes a feature, opens a pull request, and somewhere between the CI run and the security certification process, they find out they've introduced vulnerabilities. Sometimes it's 12 issues. Sometimes it's 50. Either way, they're now context-switching back into code they mentally closed days ago, making fixes under time pressure, and occasionally introducing new problems while patching old ones.

This is the classic shift-left failure. We talk about catching problems early, but in practice, "early" still means post-commit for most security tooling. GitHub Actions, SAST pipelines, security reviews — they all fire after the code leaves the developer's machine.

I recently entered GoDaddy's "Compress the Cycle" hackathon, focusing on building solutions to help increase developer productivity and reduce developer cycle time. Our team was selected as a runner-up for the solution we created.

The question we posed is: can AI agent teams move that feedback all the way to pre-commit, running locally, before a single line is pushed?

The answer is yes — and the architecture to do it is surprisingly lightweight. This blog post discusses the solution we built and the lessons we took away from the process.

Security review is a fan-out problem

Manual code security review has always been a parallelism problem in disguise. A thorough review requires multiple lenses simultaneously: static analysis for known vulnerability patterns, logic review for authentication bypass and injection risks, infrastructure configuration for IAM and network policy issues, architectural compliance for approved patterns and design decisions. A single reviewer (human or AI) context-switching between these domains serially produces worse results than specialists working in parallel.

This is exactly the problem that multi-agent orchestration solves. The architecture is a hub-and-spoke with one orchestrator and N specialized domain agents running in parallel:

Developer trigger (pre-commit or CI)
            │
            ▼
    ┌──────────────────┐
    │   ORCHESTRATOR   │
    │  • Reads diff    │
    │  • Spawns lanes  │
    │  • Aggregates    │
    └──┬───────────────┘
       │ fan-out (parallel)
  ┌────┼──────────────────────┐
  ▼    ▼           ▼          ▼
[SAST][Logic   ][IaaC    ][Policy
 Agent][Review  ][Agent   ][Agent ]
  ↕       ↕         ↕         ↕
[Valid.][Valid. ][Valid.  ][Valid.]
  └────┴──────────┴─────────┴──────┘
            │ fan-in
            ▼
     Structured findings
     (CRITICAL blocks commit)

Each domain agent is narrowly scoped — it owns one review domain and nothing else. The static analysis agent doesn't do logic review. The IaaC agent doesn't opine on application code. Strict scope boundaries prevent agents from stepping on each other's findings and producing conflicting, redundant output.

But the fan-out topology alone isn't what makes this work. The paired Validator per domain agent is.

The Devil's Advocate pattern

AI agents hallucinate. This is a known, well-documented problem. In most domains, a hallucinated answer is an inconvenience. In security code review, a hallucinated finding — a false positive surfaced with high confidence — erodes developer trust immediately. Once a developer dismisses three findings as noise, they start dismissing all findings as noise. The tool becomes useless faster than it became useful.

The standard approach to this problem is prompt engineering: ask the agent to be more careful, add a self-reflection step, tune the confidence threshold. These help at the margins but don't address the root issue — you're asking the same agent that produced the potentially wrong finding to also evaluate whether it was wrong. That's not a reliable check.

The more robust architectural answer is adversarial pairing: every domain agent runs alongside a dedicated Validator agent whose default assumption is that the domain agent got something wrong.

┌────────────────────────────────────────────────┐
│              Domain Agent (SAST)               │
│  • Scans code for vulnerability patterns       │
│  • Produces findings with severity + location  │
└──────────────────┬─────────────────────────────┘
                   │ findings
                   ▼
┌────────────────────────────────────────────────┐
│           Validator Agent (SAST)               │
│  • Default assumption: finding is WRONG        │
│  • Checks against false positive registry      │
│  • Verifies rule currency + policy scope       │
│  • Failure mode: NEEDS_HUMAN (not REJECT)      │
└──────────────────┬─────────────────────────────┘
                   │ confirmed | rejected | needs_human
                   ▼
              Aggregator

This separation of concerns — one agent for breadth, one for precision — is what gets false positive rates into the range where developers actually trust the output.

The following design rules are imperative:

  • The Validator's failure mode must be NEEDS_HUMAN, not REJECT. If the policy registry is unavailable and the Validator defaults to rejecting all findings it can't verify, you'll silently suppress real vulnerabilities on infrastructure failures. Unavailability is not evidence that a finding is wrong.
  • The Validator needs a different toolset than the domain agent. The domain agent needs code analysis tools. The Validator needs policy verification tools — access to false positive registries, active policy decisions, rule currency checks. Same finding, different evidence sources. This is why they're separate agents and not a single self-critique prompt.
  • Rejection reasons are more valuable than confirmations. A confirmation tells you the finding is real. A rejection tells you the domain agent has a systematic blind spot — it consistently misidentifies a pattern, cites a rule that no longer exists, or applies a policy to code outside its scope. Categorize rejections weekly and you have a direct improvement roadmap for your prompts, no model retraining required.

The architecture has almost no code in it

This is the part that surprises most engineers when they first encounter it: the orchestrator described above is not a Python application, not a LangGraph StateGraph, not a containerized microservice. It is a Markdown file.

Claude Code's Agent Teams feature executes natural language control flow natively. The orchestrator prompt reads like a do-while loop in plain English — analyze the diff, conditionally spawn domain agent tasks in parallel, collect inbox messages from each lane, aggregate findings. Claude Code reads this, understands the conditional logic, and executes it through its Task and Teammate primitives.

The entire deployable system we built consists of less than 10 Markdown files and 0 lines of application code.

The practical implications of this architecture are significant:

Distribution is a file copy. Rolling out to a new team means adding these files to their repository. There's no service to deploy, no infrastructure to provision, no SDK to install beyond Claude Code itself.

Iteration is prompt editing. Improving the system means editing Markdown files. The feedback loop from "validator is rejecting too aggressively" to "fixed" is minutes, not a build/deploy cycle.

Domain experts can contribute without writing code. A security engineer who has never written a Python script can improve the SAST agent's detection logic by editing its system prompt. In fact, one of our team members was a technical writer. The barrier to contribution is writing clearly, not software engineering.

The same artifact runs locally and in CI. The CLAUDE.md, agent definitions, and MCP server connections are identical whether invoked by a developer pre-commit or by a GitHub Actions workflow on PR open. The trigger changes; nothing else does.

Handling policy context without drowning every agent in it

One of the non-obvious architectural challenges in this type of system is shared policy context. Organizations accumulate design decisions, approved patterns, and architectural constraints over time. This context is relevant to every domain agent — a static analysis agent needs to know about approved cryptography libraries, an IaaC agent needs to know about required resource tags, a logic review agent needs to know about approved authentication patterns.

The naive approach is to inject all of this into every agent's context window. This creates two problems: token cost scales badly, and agents start producing findings that conflict with policy decisions they received but misapplied.

A cleaner approach runs policy context in two modes:

Mode 1: Scoped injection at spawn time. Before the orchestrator fans out, it queries the policy store for decisions relevant to the changed files and injects a scoped summary into each agent's startup prompt. The IaaC agent gets IaaC-relevant policies. The logic review agent gets authentication and library policies. Agents start with the right knowledge, not all knowledge.

Mode 2: Standalone policy compliance lane. A dedicated policy agent runs in parallel specifically checking for architectural drift — cases where the implementation diverges from what was formally decided. Its scope is narrow: compliance checking only, not security vulnerability analysis. Its Validator specifically deduplicates against findings from other lanes before confirming.

The two modes together mean policy knowledge is always current (queried live, not embedded statically in CLAUDE.md), always scoped (agents don't wade through irrelevant decisions), and always checked once (the dedup step prevents the same policy violation being surfaced by three different lanes).

What the output actually looks like

The goal is not comprehensive documentation of every potential issue. The goal is the minimum information a developer needs to unblock themselves, in priority order.

╔══ Security Review ═══════════════════════════════════════╗
║  2 CRITICAL  |  3 HIGH  |  1 MEDIUM  |  5 suppressed     ║
╠══ CRITICAL ══════════════════════════════════════════════╣
║  [Static Analysis + Logic Review — consensus finding]    ║
║  Hardcoded credential — src/config/database.js:14        ║
║  Rule: SAST-SEC-001 | Fix: use environment variable      ║
╠══ HIGH ══════════════════════════════════════════════════╣
║  [Logic Review]                                          ║
║  Admin route bypasses auth middleware — routes/admin:89  ║
║  Policy requires authentication on all /admin paths      ║
╚══════════════════════════════════════════════════════════╝

Two design choices worth calling out:

Consensus findings get flagged explicitly. When two or more domain agents independently confirm the same issue, that confirmation is surfaced in the output. It tells the developer this isn't a borderline call from one agent — multiple independent analyses reached the same conclusion.

Suppressed count is shown, not hidden. The developer knows five findings were filtered out by validators. This builds trust in the filtering mechanism because it's an auditable layer that can be queried rather than a black box.

CRITICAL findings exit with a non-zero code, blocking the commit. HIGH findings require acknowledgment. The enforcement is proportional to severity, not binary. A developer who hits a wall of unblockable warnings on every commit will find a way around the tool. Proportional enforcement keeps the tool in the path without becoming the obstacle.

The feedback loop that makes the system improve over time

Surprisingly, the rejection log is the most durable part of this architecture, not the agents.

Every time a Validator rejects a domain agent's finding, it records the reason: wrong file/line match, rule no longer active, known false positive, policy exemption applied, severity overclaimed. These reasons, aggregated over weeks of real PR reviews, produce a precise improvement backlog for the domain agent prompts.

The static analysis agent cited SAST-AUTH-003 seventeen times this week, and the validator rejected all seventeen because that rule was deprecated in the last ruleset update.

That's a one-line fix to the agent prompt. No model retraining, no infrastructure change — edit the Markdown file to reference the current rule ID.

This feedback loop compounds. Each iteration of the domain agent prompts reduces the validator's rejection rate. A lower rejection rate means more confirmed findings per scan. More confirmed findings per scan means developers encounter fewer false positives. Fewer false positives means higher trust. Higher trust means higher adoption.

The system gets better at roughly the rate you're willing to read rejection logs and edit prompts. For an internal platform team, that's a sustainable operational model.

When not to use this pattern

Multi-agent orchestration has real costs. Each domain agent is a separate LLM call with its own context window. A full five-lane review with validators runs 8–10 LLM instances simultaneously. For a small diff, that's significant token spend relative to the signal produced.

The conditional fan-out logic mitigates this — only spawn agents relevant to the changed files. A CSS-only change doesn't need IaaC validation. A pure Terraform change doesn't need logic review. But the orchestration overhead is real and shouldn't be obscured.

Use the following heuristics to help determine when this pattern earns its cost (and when it's counterproductive):

UseDon't use
Multi-domain review surface: findings in one domain affect decisions in another. Security review genuinely benefits from parallel specialists with cross-lane correlation.Tasks that are inherently sequential: if your review pipeline requires each step to depend on the last, fan-out adds coordination overhead without time savings. Use subagents instead.
False positive rate is a hard constraint: the validator layer exists specifically to get precision high enough that developers trust the output. If you can tolerate noisy output, a single well-prompted agent is cheaper and faster.Problem is well-solved by a single focused agent: multi-agent orchestration solves the context window problem for large, multi-domain tasks. For a focused, narrow task, it's architectural overengineering.

The broader principle

What makes this architecture interesting beyond the security use case is what it says about where the value sits in AI-powered systems.

The instinct when building internal AI tooling is to reach for frameworks, infrastructure, and application code. LangGraph pipelines, vector databases, custom APIs, containerized deployments. That instinct often produces systems that are hard to iterate on, hard to contribute to, and tightly coupled to specific runtime environments.

The prompt-native approach inverts this. The intelligence is in the prompts — the precise scoping of each agent's domain, the adversarial posture of the validator, the do-while control flow of the orchestrator. The runtime (Claude Code's Agent Teams) handles execution. The MCP servers are thin wrappers around existing internal APIs, not net-new application logic.

The result is a system where the primary engineering challenge is thinking clearly about agent responsibilities, scope boundaries, and validation logic — and encoding that thinking in structured natural language. The deployable artifact is a directory of Markdown files. The contribution model is accessible to anyone who can write clearly, not just engineers who can navigate a complex codebase.

That's a different kind of leverage than most engineering infrastructure delivers.

If you're building similar systems or have pushed this architecture into production, the most useful thing to share is what breaks at scale — specifically, how validator accuracy degrades as codebase complexity increases, and what prompt patterns have held up. That's where the interesting engineering still lives.

Originally published on Medium.