AI Agent Audit Program: A CAE's Field Guide

Somewhere in your organization right now, an AI agent has more permission than its job requires.

Not because anyone meant for that to happen. Because the service account got provisioned off a standard onboarding template, the access review never got AI-specific, and nobody asked the question until I started asking it: what is this agent actually allowed to do, versus what it actually needs to do to complete its task?

That gap — between granted permission and required permission — is where most of the real exposure in agentic AI lives. Not in some dramatic rogue-AI scenario. In a service account with write access to fourteen database tables when the agent's job touches four of them, and delete rights on two tables nobody can explain.

If your Audit Committee asked you next quarter whether Internal Audit has independently tested AI agent controls, what would you put in front of them? For most CAEs I talk to, the honest answer is somewhere between "we're working on it" and silence. That's the gap this program is built to close.

Why Your Existing Audit Program Doesn't Cover This

Internal Audit already knows how to test IT general controls, access reviews, and change management. The problem is that AI agents break the assumptions those programs are built on.

Traditional application controls assume a deterministic system: same input, same output, every time. An AI agent is non-deterministic by design, and increasingly, it doesn't just generate text — it acts. It calls APIs, writes to databases, initiates transactions, and does all of this with limited or no per-action human approval. Your standard access review catches over-provisioning at a point in time. It was never built to catch an agent that inherits broad privilege from an orchestrator, or that can be redirected by an instruction hidden inside a document it was simply asked to read.

This is why folding "AI risk" into a generic IT audit checklist produces an opinion nobody can really stand behind. Excessive agency, prompt injection, and goal hijacking aren't variations on existing control language — they're a different category of failure mode, and they need their own test steps, evidence requirements, and standards mapping. That's the gap I built this program to close.

What the AI Agent Audit Program Actually Is

This isn't a generic "AI governance framework." It's a control matrix: every known agent failure mode is linked to (a) the governing clause in a recognized standard, (b) specific audit test steps, and (c) the evidence an auditor should request to substantiate the control. Three frameworks anchor it throughout:

a. ISO/IEC 42001 — the AI Management System standard. This is the one your organization can actually get certified against, and increasingly regulators and customers treat that certification as a credible maturity signal.

b. NIST AI RMF — a voluntary framework organized around four functions: Govern, Map, Measure, Manage. Useful as a structure even outside the US.

c. EU AI Act — the one binding legal obligation of the three, referenced by article where it actually creates one (high-risk system requirements, GPAI provider duties, transparency obligations).

I deliberately don't conflate these. A management-system standard, a voluntary framework, and binding law are three different kinds of obligation, and a CAE who blurs them in front of a committee or a regulator loses credibility fast.

The program organizes risk into seven domains — Governance & Accountability, Excessive Agency & Authorization, Prompt Injection & Adversarial Exposure, Human Oversight & Escalation, Data Lineage/Quality/Privacy, Vendor & Third-Party Risk, and Monitoring/Logging/Incident Response — and assigns each in-scope agent a risk tier (Critical, Elevated, Limited) based on autonomy, data sensitivity, reversibility, and regulatory exposure. The tier determines how much testing depth that agent actually gets. It also assumes a three-lines-of-defense model — and if your second line for AI risk doesn't exist yet, that absence is itself a finding, not an excuse to skip testing.

Walking Through the Resources

1. AI Agent Audit Program & Risk Checklist

This is the working engine. Each of the seven domains is a self-contained control matrix: failure mode, standard mapping, audit test steps, evidence to request, severity rating. There's a working paper template built for QA review and committee reporting, and a heat-map rollup format for presenting findings by domain rather than as a flat issue list.

A concrete example of what it surfaces: testing an agent's service account against the access-control domain and finding it holds read/write access to far more tables than its documented task requires, including delete rights on tables completely outside its scope. That's an Excessive Agency finding, mapped directly to ISO 42001's least-privilege expectations under Annex A.6 — not a hypothetical, but exactly the kind of exception this checklist is built to catch before it becomes an incident.

2. Companion Guide for CAEs

This is the translation layer — built for the conversations the checklist alone won't help you win. It includes a five-level maturity model (Initial through Optimized) so you can show the committee a trajectory instead of a static red/amber/green grid, a regulatory horizon tracking EU AI Act phased applicability dates, and — critically — honest resourcing guidance: five of the seven domains are executable by a competent generalist audit team with a short briefing. Prompt Injection & Adversarial Exposure is the exception. Testing it properly requires security engineering or red-team capability; don't try to self-certify that domain with a checklist and good intentions.

One nuance worth flagging if you're building a board deck off this: the EU AI Act's core high-risk obligations carry an August 2026 applicability date in the text currently in force, though a deferral has been negotiated and isn't yet formally adopted. Use the binding date as your forcing function for resourcing conversations, but confirm current status against the Official Journal before you put a specific date in front of your committee.

The guide also includes lift-and-use templates: an executive summary built for a committee deck, an engagement planning memo with indicative hours by risk tier, and example findings written in full condition/criteria/cause/effect/recommendation form — including a "rubber-stamp" human oversight finding (a 98% approval rate with a median review time of seconds is not evidence of oversight, it's evidence of the absence of it).

3. External Threat Taxonomies

This one maps the OWASP Top 10 for LLM Applications and the OWASP Top 10 for Agentic Applications — released by the OWASP GenAI Security Project, the latter as recently as December 2025 — to the same three standards, item by item. The value here is less about novelty and more about vocabulary: testing for "ASI02: Tool Misuse" with your engineering team produces a far more specific conversation than discussing "excessive agency" in the abstract. OWASP's lists aren't law. But regulators, cyber-insurance underwriters, and external assessors are increasingly treating them as a de facto technical baseline — an organization that can't show it tested for prompt injection will struggle to credibly claim it has managed AI risk, however polished its governance documentation looks on paper.

How to Start Using This Monday Morning

1. Run the inventory test before anything else. Management's stated list of AI agents is itself an audit object, not a given — corroborate it against procurement records, cloud/API billing, and SaaS discovery tooling, not just the register you're handed.

2. Risk-tier what you find using autonomy, data sensitivity, reversibility, and regulatory exposure before you decide testing depth. Not every agent needs the full seven-domain treatment.

3. Make the build-vs-co-source call on Prompt Injection now, while your Tier 1 population is still small enough that bringing in outside testing capability for one domain is cheap relative to building it permanently in-house.

4. Take the maturity tracker into your next risk or audit committee meeting and rate where you actually are today. "We're Developing on Prompt Injection, targeting Defined by Q3" is a far stronger conversation than a static scorecard.

5. Pilot one domain on one Tier 1 agent before trying to scale the whole program across your population. Build credibility with one defensible test before you ask for budget to do ten.

I built this because "we performed an AI risk assessment" isn't a sentence I want standing between Audit Committee and a real exposure. If you're a CAE, audit leader, or risk/IT executive trying to get ahead of this rather than reacting to it after the first incident, the full Risk Checklist, the Companion Guide, and the Threat Taxonomies are free to download from Resources page.

I wrote previously about the broader shift this is part of in Rise of AudTech, and about the analytics foundation this kind of testing builds on in Advanced Techniques in Risk and Audit Analytics for Anomaly Detection. If you're working through this in your own shop, I'm genuinely happy to compare notes — connect with me on LinkedIn or drop a mail.