Key takeaways
  • Traditional software audits assume deterministic, version-stable behaviour with a fixed input-output specification. Generative AI agents fail this assumption across four distinct dimensions: non-determinism, model drift, emergent capability, and tool-use expansion.
  • These four failure modes are not edge cases. They are structural properties of how generative models and agentic orchestration architectures work, and they require different audit methods at every stage of evaluation.
  • Certification must shift from a point-in-time audit to a continuous monitoring regime. A certification issued at a single moment in time provides limited assurance the day after the underlying model is updated or a new tool is connected.
  • The Agent Certified methodology addresses this directly through Dimension 6 (AI Integration) and Dimension 7 (Autonomy Envelope), both of which require documented continuous monitoring rather than periodic manual review.
  • Operators should expect re-certification triggers tied to four event classes: significant model updates, new tool or API additions, new user populations, and detected behavioural drift crossing a defined threshold in production monitoring.

The dominant certification model for software was designed around a simple premise: given the same input, the system should produce the same output. This determinism assumption is the foundation on which unit testing, regression suites, formal verification, and point-in-time security audits all rest. It is the premise that allows an auditor to examine a system at one point in time and produce conclusions that carry forward with reasonable confidence until the next scheduled review.

Generative AI agents do not satisfy this premise. They are probabilistic by construction, they change behaviour when their underlying models are updated, they discover capabilities that were not in their original specification, and they expand their effective action surface each time a new tool is connected. None of this is a bug. These are structural properties of large language model-based systems operating in an agentic architecture. The certification methodology must be rebuilt to match them.

Why software certification worked for deterministic systems

To understand why generative agents break the model, it helps to be precise about what the model assumed. A well-designed certification audit for traditional software rests on four properties that, when they hold, make a point-in-time evaluation valid and useful.

First, determinism: the same input reliably produces the same output. This allows test suites to be written against known expected outputs, regression results to be compared across builds, and security assessments to probe specific code paths with confidence that the paths are consistent.

Second, version stability: between audit cycles, the system changes only through a managed change control process with explicit version numbers and change logs. An auditor who certifies version 3.4.1 of a system can be confident that version 3.4.1 continues to behave as certified until 3.4.2 is released and reviewed.

Third, specification completeness: the system has a defined input-output contract that can be tested exhaustively or at least representatively. The space of possible inputs is bounded, and the acceptable space of outputs for each input class is specified.

Fourth, a closed execution environment: the system's capability set is fixed by its codebase. It cannot acquire new capabilities without a code change that goes through change control.

These four properties are the load-bearing structure of traditional software certification. When they hold, a point-in-time audit is a reasonable proxy for ongoing conformance. When they do not hold, the audit produces a result that may be misleading by the time the ink is dry.

How generative agents break each assumption

Generative AI agents built on large language models (LLMs) fail each of these four properties in specific and measurable ways.

Determinism fails because LLM inference is probabilistic. The same prompt passed to the same model at temperature greater than zero will produce different outputs on successive calls. At temperature zero, outputs are more stable but not guaranteed to be identical across infrastructure changes, and most production deployments do not run at temperature zero. An agent's reasoning chain, tool selection decisions, and output phrasing will vary across runs even when the task is identical.

Version stability fails because model providers update their models continuously, sometimes with version numbers and sometimes without. A model deployed through an API endpoint labelled "gpt-4o" or "claude-3-5-sonnet" may receive updates to the underlying weights without the deploying organisation being notified. NIST AI 100-1 (the NIST AI Risk Management Framework, published January 2023) and the subsequent NIST AI 600-1 Generative AI Profile (July 2024) both acknowledge model provenance opacity as a systemic risk category. The certified behaviour of a system at audit may not be the current behaviour of the same system six weeks later.

Specification completeness fails because natural language is not a formal specification language. The input space for a generative agent is essentially unbounded: any string a user or upstream system can construct is a potential input. No test suite can cover this space. The acceptable output space is equally open-ended. Bommasani et al. (2021, "On the Opportunities and Risks of Foundation Models," Stanford HAI) documented this opacity problem: the same foundation model may exhibit substantially different behaviours across domains and prompting styles in ways that are not predictable from the training specification alone.

The closed execution environment fails because agents are specifically designed to use tools. Adding a new API connection, a new retrieval source, or a new code execution capability expands what the agent can do in ways that the original audit did not evaluate. A customer service agent certified in January that gains access to a payment processing API in March has materially expanded its risk surface without any change to its certification status.

The four failure modes specific to agentic AI

Combining these broken assumptions, four concrete failure modes emerge for generative agents. These are not theoretical categories: each has produced documented incidents in production deployments as of early 2026.

Failure mode Root cause Example manifestation Traditional audit coverage
Non-determinism Probabilistic inference at sampling temperature above zero Agent takes a different tool sequence for identical tasks, producing different outputs that satisfy or violate policy depending on the run None: deterministic regression testing does not surface stochastic failures
Model drift Provider-side model weight updates behind a stable API label Instruction-following fidelity degrades after a provider update; safety behaviours change; output format shifts break downstream parsing None: version control assumes the versioned artifact is fixed
Emergent capability Capabilities present in base model but not elicited until specific prompting patterns are encountered in production Agent exhibits reasoning or knowledge in a domain the operator did not intend, enable, or test for Partial: red-teaming can probe for some emergent behaviours but cannot enumerate them
Tool-use expansion New tools or APIs added to the agent's accessible action space without re-evaluation Agent that was certified for read-only operations gains write access to a production database through a new tool integration None: certification assumed a fixed capability set

Each failure mode has a different root cause and requires a different mitigation approach. Non-determinism requires statistical sampling across output distributions rather than deterministic test comparison. Model drift requires production monitoring with drift detection against a behavioural baseline established at certification. Emergent capability requires structured red-teaming combined with ongoing operational telemetry. Tool-use expansion requires a formal change control process that treats new tool additions as re-certification triggers.

What red-teaming adds to certification (and where it does not suffice)

Red-teaming, as codified in NIST AI 600-1 (Generative AI Profile, July 2024) and referenced in the EU AI Act Regulation (EU) 2024/1689 Article 9's risk management requirements, is an adversarial evaluation practice in which trained evaluators attempt to elicit unsafe, harmful, or policy-violating behaviour from an AI system. For generative agents, red-teaming is a necessary component of certification evaluation. It is not sufficient on its own.

Red-teaming is effective at identifying prompt injection vulnerabilities, jailbreaking paths that bypass stated policy constraints, manipulation resistance failures where an agent can be induced to take actions outside its stated scope, and specific high-risk output categories (harmful content, privacy violations, misinformation) that can be probed through targeted adversarial input construction.

Red-teaming is structurally limited in several ways that matter for certification. First, it is a point-in-time evaluation: it probes the system as it exists at audit time against a defined threat model. It does not provide assurance about behaviour after a model update, a new tool addition, or exposure to a user population with different prompting patterns. Second, the coverage of the input space is necessarily incomplete. A red-team exercise of even substantial scope cannot enumerate the adversarial input space for a system with an open natural language interface. NIST AI 600-1 explicitly notes that "red-teaming alone does not provide comprehensive risk coverage for generative AI systems." Third, red-teaming evaluates the system at one configuration. It does not evaluate the governance structure, the data provenance controls, the operator accountability framework, or the autonomy envelope specification that determine how the system behaves when it encounters situations outside the red-team's threat model.

For certification purposes, red-teaming evidence contributes to the Trust and Safety dimension evaluation. It does not substitute for the Governance, Context Integrity, AI Integration, or Autonomy Envelope dimensions. Each of those requires different evidence types that red-teaming does not produce.

Continuous certification: the shift from annual audit to live monitoring

The response to the failure modes described above is not to make the annual audit more rigorous. A more rigorous point-in-time audit does not solve the drift problem, the emergent capability problem, or the tool-use expansion problem, because all three of these can change the system's risk profile between audits without triggering any scheduled review.

The structural response is to treat certification as a continuous state rather than a periodic event. This requires three operational components working in parallel.

The first is a behavioural baseline established at certification time. When a certification assessment is completed, the assessor and operator jointly establish a quantitative behavioural profile of the system: output distribution statistics across a defined evaluation suite, tool-use frequency and pattern data, refusal rates for a set of standardised adversarial probes, and latency and coherence metrics. This baseline is the reference against which ongoing production telemetry is compared.

The second is production monitoring with defined drift thresholds. The operator runs continuous monitoring against the behavioural baseline using production telemetry. When output distributions shift beyond a defined threshold, when refusal rates change significantly, or when tool-use patterns diverge from the certified profile, a drift alert is triggered. The agent's certification status moves to "monitoring alert" pending investigation.

The third is a formal change control process for re-certification triggers. Rather than relying on scheduled reviews, the certification is tied to event triggers. The four principal triggers are detailed in the section below. When a trigger fires, the operator initiates a re-certification process against the affected dimensions before the agent returns to certified status.

This model mirrors the post-market monitoring obligations in EU AI Act Article 72, which requires providers of high-risk AI systems to maintain "a post-market monitoring system that actively and systematically collects, documents and analyses relevant data" throughout the system's lifecycle. Continuous certification is the technical implementation of that statutory requirement at the agent level.

Where ISO/IEC 42001 and the EU AI Act fit into this picture

ISO/IEC 42001:2023 and Regulation (EU) 2024/1689 (the EU AI Act) provide the regulatory and standards context within which agent certification operates. Neither instrument resolves the specific technical challenges of certifying generative agents, but both establish obligations that continuous certification must satisfy.

ISO/IEC 42001 is a management system standard. Its Clause 9 on performance evaluation and Clause 10 on improvement provide the organisational scaffolding for a continuous monitoring regime: defined metrics, internal audit cycles, management review, and corrective action processes. The standard does not specify what those metrics should be for a generative agent or how drift thresholds should be calibrated. That is the system-level gap that an agent-specific framework must fill.

The EU AI Act is most relevant through three articles for operators of high-risk generative agents. Article 9 requires a continuous risk management system that is "updated and reviewed regularly." Article 12 requires logging of events sufficient to enable post-incident investigation, which for a generative agent means capturing the reasoning trace, tool calls, and output at a sufficient level of detail. Article 72 requires post-market monitoring throughout the system's operational lifecycle. These obligations align with the continuous certification model but do not specify the technical implementation. The Act's implementing acts and the forthcoming harmonised standards under CEN-CENELEC JTC 21 will add specificity over time, but operators deploying agents before those standards are finalised need a working methodology now.

NIST AI 600-1 (Generative AI Profile, July 2024) is the most technically specific of the three instruments for this problem domain. It identifies twelve primary risk categories for generative AI systems, including confabulation, data privacy, human-AI configuration failure, and value chain and component integration. These categories map closely to the four agentic failure modes described in this article. The NIST AI RMF Govern, Map, Measure, and Manage functions provide a risk management lifecycle that is compatible with the continuous certification model, with the Measure and Manage functions corresponding to production monitoring and re-certification trigger processes respectively.

The Agent Certified methodology's approach

The Agent Certified methodology addresses the specific challenges of generative agent certification through its seven-dimension framework, with the most direct treatment in Dimension 6 (AI Integration) and Dimension 7 (Autonomy Envelope).

Dimension 6 evaluates how responsibly the agent sits inside existing systems of record, identity, approval, and escalation. For a generative agent, this includes assessing whether model updates trigger a review process, whether tool additions go through a documented change control workflow, and whether the agent's outputs are logged at a level of granularity sufficient for drift detection and incident investigation. An agent that uses a third-party model API without any monitoring for provider-side changes scores significantly lower on Dimension 6 than one that implements active baseline comparison and drift alerting. The weight of Dimension 6 in the overall score (12 points) reflects that integration quality is a leading indicator of certification durability: a well-integrated agent is one where drift and capability changes will be detected before they cause harm.

Dimension 7 evaluates the Autonomy Envelope: the documented set of conditions under which the agent may act without human confirmation, the triggers that cause escalation, and the monitoring that confirms the agent is operating within its defined boundaries. For a generative agent, the autonomy envelope is inherently probabilistic rather than categorical. The agent will occasionally take actions that fall outside its intended scope not because of a discrete policy failure but because its probabilistic reasoning produces an unexpected output. Dimension 7 requires applicants to document the envelope boundaries, the escalation conditions, and the monitoring regime that detects boundary violations in production. The weight of Dimension 7 (14 points in the framework) reflects that autonomy boundary failures are the primary source of consequential harm from deployed agents.

The seven-dimension framework as a whole establishes the assessment surface. The continuous monitoring requirements under Dimensions 6 and 7 establish the certification maintenance mechanism. Together, they provide a structure where certification is not a credential issued once but a state that is maintained through ongoing operational discipline and tested against defined re-certification triggers.

Re-certification triggers: what should prompt a reassessment

Defining re-certification triggers precisely is the most operationally demanding part of continuous certification. Triggers that are too sensitive produce assessment fatigue. Triggers that are too coarse allow meaningful risk changes to pass unreviewed. The following table sets out the four principal trigger classes and the assessment scope they activate.

Trigger class Threshold for activation Dimensions requiring re-assessment Urgency
Model update Any major version change to the underlying model; provider disclosure of infrastructure change affecting output behaviour; detected drift above threshold in baseline comparison Dimensions 1 (Trust and Safety), 6 (AI Integration), 7 (Autonomy Envelope) Before returning to certified status
New tool or API addition Any addition to the agent's accessible action space, including read access to new data sources Dimensions 3 (Distribution Control), 6 (AI Integration), 7 (Autonomy Envelope) Before enabling in production
New user population Deployment to a user group with materially different characteristics (sector, language, technical sophistication, risk profile) Dimensions 1 (Trust and Safety), 2 (Context Integrity), 7 (Autonomy Envelope) Before or within 30 days of deployment
Behavioural drift detection Production monitoring detects output distribution shift above defined threshold, refusal rate change greater than 15% from baseline, or novel tool-use patterns not present at certification All dimensions relevant to the drifted behaviour; full re-assessment if drift is diffuse Within 14 days of detection

The threshold for behavioural drift detection (15% refusal rate change as an example) is illustrative. The actual threshold should be established at the time of initial certification based on the agent's deployment context and risk profile. An agent deployed in a high-risk context under EU AI Act Annex III categories should use tighter thresholds than one deployed in a lower-risk environment. The thresholds are part of the certification documentation and are reviewed as part of each re-certification cycle.

Organisations planning their re-certification processes should note that the scope of a re-certification is normally narrower than a full initial assessment. If the trigger is a new tool addition that affects only Dimensions 3, 6, and 7, the re-certification does not require re-assessment of Dimensions 1, 2, 4, and 5 if those are unaffected. Scoped re-certification is faster and less resource-intensive than full re-certification, which is why the trigger framework specifies which dimensions are activated by each event class.

For operators managing multiple concurrent agent deployments, the trigger framework should be embedded in the change management process. Tool addition approvals and model version selection decisions should automatically generate a re-certification flag as part of the approval workflow, rather than requiring manual identification of re-certification obligations after the fact.