Methodology v2.0 | The Agent Certified Framework for AI Agent Certification, 2026 Edition

1. Purpose and Scope

Why certification exists for AI agents

The deployment of autonomous AI agents into production environments is accelerating across European organisations. Agents now write and execute code, transact on behalf of businesses, process personal data at scale, advise on medical and financial matters, and manage operational workflows with minimal human review. The speed of deployment has outpaced the governance frameworks designed to contain the associated risks.

Three structural gaps have emerged. First, insurers writing coverage for AI-related liability have no consistent signal by which to assess the risk posture of an agent deployment. Policies are written on proxy indicators rather than direct evidence of controls. Second, regulators and supervisory authorities reviewing deployer compliance under instruments such as the EU AI Act (Regulation (EU) 2024/1689) require structured artefacts to which they can attach their assessments. Point-in-time compliance declarations are insufficient for systems whose behaviour changes continuously. Third, boards, procurement functions, and counterparties need a readable, defensible benchmark that allows them to make risk-informed decisions about the agents they rely on or procure, without themselves having the technical capacity to perform that assessment.

Agent Certified exists to close these gaps. The methodology produces a scored, evidence-backed certification that any of these parties can read, cite, and rely on.

What this methodology covers

This methodology applies to autonomous AI agents and agentic systems deployed in production by professional operators. For the purposes of this framework, an autonomous AI agent is defined as a system that:

uses a large language model or equivalent generative model as its primary reasoning component;
is granted access to one or more external tools, APIs, data sources, or execution environments;
takes actions or produces outputs that have real-world consequences in an organisational or commercial context; and
operates with at least partial autonomy, meaning that not every action requires explicit human approval before execution.

The methodology covers agentic systems regardless of their classification under the EU AI Act. It applies equally to agents that meet the Act's definition of a high-risk AI system and to agents operating below that threshold. An operator seeking to demonstrate responsible deployment of a general-purpose agentic tool falls within scope as much as one deploying a system classified as high-risk under Annex III of the Act.

What this methodology does not cover

The following are outside the scope of this methodology:

One-off, non-deployed AI models evaluated purely for research purposes;
AI systems that produce outputs but take no actions and have no tool access;
General-purpose LLM chat interfaces where no autonomous action or retrieval pipeline is involved;
AI systems deployed by providers rather than operators, where the deployer has no meaningful ability to configure safety controls;
AI systems embedded in regulated medical devices, where the primary regulatory framework is the EU Medical Device Regulation (EU) 2017/745. Such systems may use this methodology as a supplementary governance layer but must treat the MDR as primary.

Editorial principles

The methodology is governed by four editorial principles. Independence: no carrier, vendor, model provider, or operator pays for placement, score adjustment, or favourable framing. Transparency: the methodology is published openly and in full. Reproducibility: two assessors working from the same evidence should produce scores within one point of each other on any dimension. Fairness: every scored party receives a copy of their evidence file and the right of reply before any public disclosure.

2. The Seven Dimensions

Every agent is evaluated across seven dimensions. Each dimension is scored on a scale from 0 to 10. Each carries a weight reflecting its contribution to overall operational safety and regulatory exposure. The seven dimensions and their weights are fixed for the duration of a major version. Changes to weights require a new major version with public consultation.

Definition

Trust and Safety measures the measurable prevention of unsafe, unauthorised, or harmful actions by the agent in production, and the discipline with which unsafe outputs are detected, contained, and remediated. It is the highest-weighted dimension because a failure here is categorically different from a failure in any other dimension: it can cause direct harm to end users, third parties, or the operator's own systems before any other control has the opportunity to intervene.

Why it matters

The EU AI Act places human oversight and accuracy and robustness obligations on deployers of high-risk systems under Articles 14 and 15, and on providers under Articles 9 and 10. For systems outside the high-risk classification, the revised EU Product Liability Directive (Directive 2024/2853/EU) creates liability exposure where unsafe outputs cause damage. NIST AI RMF function Manage directly addresses this dimension through its harm response and corrective action requirements. ISO/IEC 42001 Clause 8.4 requires operators to implement controls proportionate to identified risk. Insurance carriers assessing coverage terms for AI liability treat the presence and quality of guardrails as a primary underwriting variable.

Scoring rubric

Score 0

No safety consideration present. The agent acts up to its technical capability with no documented limit. No detection, no containment, no playbook.

Score 3

Safety is delegated entirely to model defaults. No organisation-level guardrails, no red team exercise conducted, no misuse monitoring beyond provider telemetry.

Score 6

Guardrails documented and deployed. A red team exercise has been conducted at least once and findings are tracked. Detection is ad hoc. A containment playbook exists but has not been drilled.

Score 8

Guardrails tested on a quarterly cadence. Detection is continuous and alerts to an on-call function. Kill switch is verified and accessible to non-engineering staff. Incident response has been drilled within the prior six months.

Score 10

Layered guardrails covering prompt injection, jailbreak, data exfiltration, and unsafe tool use. Continuous adversarial red teaming with findings linked to a remediation register. Real-time misuse detection tied to automatic containment. An incident record demonstrating at least one real, contained safety event with documented retrospective.

Evidence required

Documented guardrail specification, including scope of coverage and update cadence
Red team exercise report with dated findings and remediation evidence
Monitoring dashboard or alert configuration demonstrating real-time misuse detection
Incident response playbook with named roles and verified kill switch access
Incident log or confirmation that no material safety event has occurred, with methodology for determining materiality

Common failure modes

Treating model-level content filters as equivalent to organisation-level guardrails
Red team exercises conducted once at launch with no subsequent cadence
Kill switches that require engineering access to activate, rendering them inaccessible in time-critical events
No distinction between safety incidents and product defects in the incident log

Underpinning standards

NIST AI RMF 1.0: Govern 1.1, Manage 4.1, Manage 4.2. ISO/IEC 42001:2023: Clause 8.4 (AI system impact assessment), Annex A.6 (Risk treatment). EU AI Act: Article 9 (Risk management system), Article 14 (Human oversight), Article 15 (Accuracy, robustness and cybersecurity). EIOPA Supervisory Statement on the use of AI by insurers (2024): Section 4.3.

Definition

Context Integrity measures the quality of the information the agent reasons over. An agent's output is only as reliable as its inputs. This dimension covers provenance, freshness, lineage, and the controls that prevent poisoned, stale, or unauthorised data from entering the agent's working memory or retrieval pipeline.

Why it matters

Agents that operate on unverified or incorrectly attributed information produce outputs that carry hidden liability. Data provenance obligations appear in EU AI Act Article 10 (Training, validation and testing data governance) and apply by extension to retrieval-augmented generation pipelines where the agent reasons over external documents. Indirect prompt injection through retrieved content is now a recognised attack surface cited by OWASP and referenced in the NIST AI RMF Govern function. ISO/IEC 42001 Annex A.7 addresses data management requirements for AI systems. Insurers treating AI liability policies must assess whether an agent's factual claims are traceable; they cannot price unexplained hallucination risk.

Scoring rubric

Score 0

Context sources are unknown to the operator. No inventory, no provenance, no staleness detection.

Score 3

Sources are implicit or informally understood. No provenance metadata. Staleness is undetected. User-supplied content reaches the agent without validation.

Score 6

Sources catalogued in a documented inventory. Refresh is manual and periodic. Provenance metadata exists for primary sources but not all retrievals. Basic input validation in place for user-supplied content.

Score 8

Full provenance on all retrieved documents. Automated freshness checks with staleness alerts. Input validation covers external and user-supplied content. Data lineage is documented from source to agent decision.

Score 10

End-to-end data lineage from source to decision trace, verified by automated tooling. Tested resistance to prompt injection through retrieved content. Automatic quarantining of unverified or untrusted sources. Source integrity monitoring in production.

Evidence required

Documented inventory of all knowledge sources and retrieval surfaces used by the agent
Provenance metadata specification or schema applied to retrieved documents
Refresh cadence policy and evidence of staleness alerting in production
Input validation configuration covering user-supplied content
Data lineage diagram from source to reasoning output, with version control

Common failure modes

Retrieval pipelines where document origin is not recorded, preventing attribution of erroneous outputs
Live data sources with no freshness checks, leading to the agent presenting outdated information as current
User-supplied context passed directly into retrieval or reasoning without sanitisation, enabling indirect injection
No distinction between primary sources and synthesised summaries in the agent's context window

Underpinning standards

NIST AI RMF 1.0: Map 1.6, Map 3.1, Govern 6.1. ISO/IEC 42001:2023: Annex A.7 (Data for AI systems). EU AI Act: Article 10 (Data and data governance). OWASP Top 10 for LLM Applications: LLM02 (Insecure Output Handling), LLM06 (Sensitive Information Disclosure).

Definition

Distribution Control covers the controls that determine who can invoke the agent, under what authority, and how its downstream actions are bounded. It is the dimension where identity, authorisation, and blast radius meet. A weak Distribution Control posture means that the agent can be called by parties who should not have access to it, with a scope of action larger than those parties are entitled to authorise.

Why it matters

EU AI Act Article 26 places deployer obligations on the organisations that make AI systems available to users. Those obligations include ensuring that the agent operates within the scope intended and that access is proportionate to the authorisation structure of the deployer organisation. The revised GDPR enforcement posture under the European Data Protection Board's AI guidelines reinforces the principle of data minimisation, which in an agent context translates directly to blast radius limitation. ISO/IEC 42001 Annex A.9 addresses access controls for AI systems. From an insurance standpoint, an undifferentiated invocation posture is equivalent to leaving the keys in the ignition: the underwriter cannot assess the probable maximum loss without knowing who can call the agent and what they can make it do.

Scoring rubric

Score 0

Open invocation. No authentication. No per-caller limits. Any party with network access can call the agent with full capability.

Score 3

Basic authentication is in place but credentials are shared. No per-caller rate limits. Environments are not segregated.

Score 6

Authenticated calls with individual credentials. Basic rate limits in place. Development and production environments are separated. Blast radius is informally understood.

Score 8

Role-based authorisation tied to the organisation's identity provider. Per-caller spend caps and tool quotas enforced. Blast radius for every tool is documented and measured. Environment segregation is enforced by infrastructure controls, not only by convention.

Score 10

Zero trust invocation model. Real-time quota enforcement. Blast radius for every tool has been tested through controlled chaos exercises. Per-call audit trails written to an immutable log. Principle of least privilege enforced at the tool level, not only at the invocation level.

Evidence required

Identity and access management configuration for agent invocation, including credential type and rotation policy
Role-based access control policy with mapping to organisational roles
Rate limit and spend cap configuration per caller type
Environment segregation architecture diagram
Blast radius assessment for each tool the agent can call, with maximum impact quantified

Common failure modes

API keys shared across multiple callers, preventing attribution of misuse to an individual
Development agent instances with production-level tool access
Blast radius assessed only in theory, never tested against actual tool behaviour under adversarial conditions
Rate limits set so high as to be meaningless for cost and abuse containment

Underpinning standards

NIST AI RMF 1.0: Govern 1.7, Map 2.2. ISO/IEC 42001:2023: Annex A.9 (Use of AI systems). EU AI Act: Article 26 (Obligations of deployers). NIST SP 800-207 (Zero Trust Architecture): Section 2.1 (Zero trust tenets applied to AI system access).

Definition

Product Maturity measures the degree to which the agent behaves as a production-grade system rather than a prototype. It covers reliability, regression discipline, evaluation coverage, and the engineering practices that keep behaviour predictable over time. An agent that scores highly on Product Maturity is one whose operator can state, with evidence, what the agent does, how reliably it does it, and how changes to the system are managed.

Why it matters

EU AI Act Article 15 requires high-risk AI systems to achieve appropriate levels of accuracy, robustness, and cybersecurity. For agents below the high-risk threshold, the same standard applies as a practical matter wherever counterparties, insurers, or boards are asked to rely on agent outputs. ISO/IEC 42001 Clause 9 requires operators to evaluate performance against defined criteria. From an insurance standpoint, the absence of versioning or regression testing means that a coverage decision made at bind may not hold for the system the insured is actually running three months later, a risk that underwriters increasingly price into premiums or exclude.

Scoring rubric

Score 0

Prototype. No versioning. No uptime measurement. No evaluation suite. Behaviour is not monitored between deployments.

Score 3

Version control is in place for code but not for prompts or model configuration. Uptime is informally tracked. No formal regression suite.

Score 6

Prompts and model versions are versioned under change control. Uptime is measured. A partial regression evaluation suite is run on major changes. A change log is maintained but reactively rather than proactively.

Score 8

Published service level objectives. Regression evaluation required on every change. Observability at the reasoning trace level, not only at the response level. Behaviour change log communicated proactively to stakeholders.

Score 10

SLOs enforced with automated alerting. Canary or staged deployment for all changes. Evaluation coverage reviewed and updated quarterly. Full drift detection across all output dimensions, with automatic flagging of distributional shifts.

Evidence required

Version history for prompts, model configuration, and tool definitions, with change attribution
Uptime and latency metrics for the prior 90 days, with SLO definition
Regression evaluation suite specification, including test case count, coverage rationale, and run frequency
Observability configuration demonstrating trace-level monitoring
Behaviour change log with dated entries covering the prior 12 months

Common failure modes

Prompt versioning absent despite code versioning being in place, meaning that the most consequential configuration variable is uncontrolled
Regression suites that test only happy-path scenarios, missing edge and adversarial cases
Observability limited to response-level logging, making it impossible to diagnose reasoning errors in production
Model upgrades applied automatically by provider without triggering an operator-side regression run

Underpinning standards

NIST AI RMF 1.0: Measure 1.1, Measure 2.5, Manage 2.2. ISO/IEC 42001:2023: Clause 9.1 (Monitoring, measurement, analysis and evaluation), Annex A.6 (AI system lifecycle). EU AI Act: Article 15 (Accuracy, robustness and cybersecurity), Article 12 (Record-keeping). IEEE 7000-2021 (IEEE Standard Model Process for Addressing Ethical Concerns during System Design): Section 5.5 (Continuous verification).

Definition

Governance is the institutional scaffolding around the agent. It is the evidence that the agent is known to the board, owned by a named accountable senior role, policed by documented policy, and logged in a way that will survive an audit. Strong governance does not make an agent safe in itself, but it ensures that when something goes wrong, the right people know immediately, the right authorities are empowered to act, and a complete record exists to support investigation, remediation, and regulatory response.

Why it matters

EU AI Act Article 9 mandates risk management systems for high-risk AI. Article 27 requires deployers to register certain high-risk AI systems in the EU database and to designate responsible persons. Article 12 mandates automatic logging for systems that meet specified conditions. The Network and Information Security Directive 2 (NIS2, Directive 2022/2555) requires senior management accountability for cybersecurity risk, a principle that regulators are extending to AI systems by supervisory guidance. ISO/IEC 42001 Clause 5 (Leadership) and Clause 6 (Planning) directly address governance requirements. EIOPA's 2024 supervisory statement specifies that insurers must be able to demonstrate board-level oversight of AI in their operations, a requirement that extends to any agent an insurer deploys or relies on.

Scoring rubric

Score 0

No formal ownership of the agent. No policy referencing it. Board unaware of its operating scope or associated risk.

Score 3

Agent is known to management but not to the board. No named owner. No risk register entry. Audit trail is incomplete.

Score 6

Named senior owner with documented accountability. AI risk policy exists and references the agent or agentic systems category. Board has been informed at least annually. Audit trail is partial but consistent.

Score 8

Risk register entry with current rating and documented mitigations, reviewed at least twice yearly. Board review on a defined cadence. Supplier and model due diligence documented and current. Audit trail meets sector retention requirements.

Score 10

AI governance embedded in enterprise risk management, not treated as a separate workstream. Board-level review on a quarterly cadence with written minutes. Independent assurance already performed by an external party. Full audit trail, retained, tested, and accessible to regulators on request.

Evidence required

Named accountable owner, role title, and scope of accountability documented
AI risk policy referencing agentic systems, with effective date and review history
Board or risk committee minutes referencing the agent or AI risk category, within the prior 12 months
Risk register extract showing the agent as an active entry with current rating and mitigations
Vendor and model supplier due diligence records, current within 12 months
Audit trail specification, including retention period, access controls, and evidence of log completeness

Common failure modes

Accountability assigned to the team that built the agent, rather than to a senior business owner who can be held responsible for operating outcomes
AI risk policy written at a generic level that does not address autonomous agent behaviour specifically
Board minutes that acknowledge AI as a topic without documenting what the board reviewed or decided
Audit trails that capture outputs but not reasoning steps, making root cause analysis impossible

Underpinning standards

NIST AI RMF 1.0: Govern 1.1, Govern 1.2, Govern 4.1, Govern 5.1. ISO/IEC 42001:2023: Clause 5 (Leadership), Clause 6 (Planning), Annex A.2 (Policies for AI). EU AI Act: Article 9 (Risk management system), Article 12 (Record keeping), Article 27 (Obligations of deployers of high-risk AI systems). NIS2 Directive (2022/2555/EU): Article 20 (Governance). EIOPA Supervisory Statement on AI by insurers (2024): Section 5.

Definition

AI Integration measures how the agent sits inside the organisation's existing systems of record, identity, approval, and escalation. Integration maturity determines whether the agent extends institutional memory or bypasses it. An agent with strong integration is indistinguishable, from an audit perspective, from a trusted internal operator: its actions are attributed, its escalations follow the right channels, and its outputs enter systems of record with full provenance.

Why it matters

The legal and regulatory risk from agent actions scales with the degree to which those actions are traceable to a responsible party. An agent that writes to systems of record under a shared service account makes attribution of errors or harmful actions impossible after the fact. EU AI Act Article 26 requires deployers to ensure that natural persons using AI systems are informed and able to oversee outputs. The emerging European AI Liability Directive (proposed 2022/0303/COD) introduces a disclosure-of-evidence mechanism that presupposes that evidence exists: an unintegrated agent leaves no evidence trail. ISO/IEC 42001 Annex A.8 addresses human oversight mechanisms, which in practice require that escalation paths are defined and functional within the organisational structure, not merely in a policy document.

Scoring rubric

Score 0

The agent operates entirely in parallel to core systems. Writes are ad hoc or to shadow systems. Identity is collapsed into a single service account. No escalation path exists.

Score 3

Some outputs reach systems of record but without attribution. Escalation paths exist informally. The agent's log is separate from the organisation's observability stack.

Score 6

Partial integration. Most writes to systems of record are attributed. Approval flows exist but contain bypass paths. Escalations route to a named team, not generic inboxes. Logs are partially centralised.

Score 8

Integration follows the organisational authority chain. Identity is propagated end to end, not collapsed. Escalations route to named individual reviewers. Logs are written to the centralised observability stack.

Score 10

Full integration. The agent is indistinguishable from a trusted internal operator in audit. Every write, approval, escalation, and decision is attributed, timestamped, and co-located with the organisation's existing records. Tested through a simulated audit drill within the prior 12 months.

Evidence required

Integration architecture diagram showing how the agent connects to systems of record, identity provider, and approval workflow
Evidence that writes to systems of record carry agent identity, timestamp, and action provenance
Escalation routing configuration with named reviewers and verified notification paths
Log centralisation evidence showing agent logs co-located with organisational observability stack
Sample audit trail extract from a real agent-initiated action showing end-to-end attribution

Common failure modes

Agent actions written to a proprietary log that is inaccessible to the compliance or risk function
Escalation paths that route to generic email addresses, producing no named accountability
Identity propagation that collapses at the API boundary, meaning all downstream systems see a single service identity rather than the originating user
Approval workflows that contain an unconditional bypass for the agent, defeating the control

Underpinning standards

NIST AI RMF 1.0: Govern 6.2, Measure 4.1, Manage 3.1. ISO/IEC 42001:2023: Annex A.8 (AI system impact assessment review), Annex A.9 (Use of AI systems). EU AI Act: Article 14 (Human oversight), Article 26 (Obligations of deployers). Proposed AI Liability Directive (2022/0303/COD): Article 3 (Disclosure of evidence).

Definition

The Autonomy Envelope is the explicit, documented boundary between what the agent may do without human confirmation and what requires a human in the loop. It is the single clearest determinant of the agent's operational risk profile and the first element that insurers and regulators examine when assessing an agent's fitness for deployment. An agent without a defined Autonomy Envelope has no meaningful risk boundary: its scope of action is limited only by its technical capability and its access to tools.

Why it matters

EU AI Act Article 14 requires deployers of high-risk AI systems to implement human oversight measures proportionate to the risk of the system. The requirement is not satisfied by a general statement that humans can override the agent; it requires documented thresholds, accessible controls, and evidence that oversight is exercised in practice. For agents outside the high-risk classification, Article 26 still requires deployers to maintain oversight where appropriate. The NIST AI RMF Manage function specifically addresses the definition of human oversight thresholds. From an insurance standpoint, the Autonomy Envelope is the closest available proxy for probable maximum loss: a tightly defined envelope limits the financial damage the agent can cause unilaterally, and underwriters price coverage accordingly.

Scoring rubric

Score 0

No envelope defined. The agent acts up to its technical capability without any documented limit. Autonomy is assumed unless a technical error prevents action.

Score 3

An informal understanding exists of what the agent should not do autonomously, but it is not documented, enforced, or reviewable by non-technical stakeholders.

Score 6

Autonomy policy is written and referenced in governance documentation. Some thresholds are enforced in code. Revocation requires engineering access. Rollback of agent-initiated actions is technically possible but untested.

Score 8

Envelope is enforced in code, not only in policy. Non-technical revocation is accessible to named roles within two minutes. Rollback has been tested for at least one action class. Hard stops for action classes that are never delegated are documented with rationale.

Score 10

The Autonomy Envelope is the operating contract. Reviewed and formally signed off on a quarterly cadence. Tied to insurance policy wording. Every action class that the agent can take is classified as either fully autonomous, threshold-gated with human confirmation, or permanently prohibited. Classification rationale is documented and version-controlled.

Evidence required

Written Autonomy Envelope policy classifying every action class the agent can take as fully autonomous, threshold-gated, or permanently prohibited
Technical enforcement configuration demonstrating that thresholds are enforced in code
Revocation procedure accessible to named non-technical roles, with verified access timing
Rollback procedure and evidence of at least one test of that procedure against a real action class
Hard stop register listing permanently prohibited actions with documented rationale
Evidence of quarterly review: dated sign-off from named accountable owner

Common failure modes

Autonomy policy written at a high level of abstraction that does not map to actual tool capabilities, making it unenforceable
Human-in-the-loop thresholds set at levels of convenience rather than impact, allowing consequential actions to proceed without review
Revocation documented but requiring engineering access, making it inaccessible during an incident occurring outside business hours
No distinction between action classes that should never be delegated and those that may be delegated with appropriate controls
Envelope reviewed at launch but not subsequently, leaving it outdated when new tools are added to the agent

Underpinning standards

NIST AI RMF 1.0: Govern 1.4, Govern 1.5, Manage 1.3, Manage 4.1. ISO/IEC 42001:2023: Annex A.8 (Human oversight of AI systems). EU AI Act: Article 14 (Human oversight measures), Article 26 (Obligations of deployers). Council of Europe Framework Convention on AI (CETS No. 225): Article 14 (Safeguards in the context of interactions with AI systems).

3. Scoring and Weighting

Each of the seven dimensions is scored on a 0 to 10 integer scale. Scores of 0 are reserved for cases where no controls or evidence exist. Scores of 10 require positive evidence of best-practice implementation and tested effectiveness. Assessors must record the evidence cited for each score alongside the score itself.

Dimension weights

Weights reflect the relative contribution of each dimension to overall operational risk exposure. They are fixed for the duration of a major version and are not adjusted for sector, agent type, or operator size. Adjustments for sector or autonomy level are applied at the normalisation stage, not the weighting stage.

Dimension	Weight	Max weighted score
D1 Trust & Safety	18	180
D2 Context Integrity	14	140
D3 Distribution Control	12	120
D4 Product Maturity	14	140
D5 Governance	16	160
D6 AI Integration	12	120
D7 Autonomy Envelope	14	140
Total	100	1,000

Score formula

The weighted raw score for each dimension is: raw score (0 to 10) multiplied by the dimension weight. These are summed to produce a total weighted raw score. The maximum achievable total weighted raw score is 1,000 (a score of 10 on every dimension multiplied by the total weight of 100). The overall score is normalised to a 100-point scale:

Formula Overall Score = ( Σ_d=1..7 ( raw_d × weight_d ) / 1000 ) × 100
Where raw_d ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and weight_d is the published dimension weight.
Overall Score is rounded to the nearest integer.

Floor rule

A certification tier also sets a minimum raw score on every individual dimension. Certified tier requires a raw score of 4 or above on every dimension. Advanced tier requires 6 or above. Elite tier requires 8 or above. An agent that achieves a high overall score but scores below the floor on any single dimension is capped at the tier immediately below the one its overall score would indicate, until the floor is met. This rule prevents an agent from being certified at a tier that overstates its actual control posture in any assessed area.

Worked examples

Example A

Customer service chatbot with structured escalation

D1 Trust & Safety	6 × 18 = 108
D2 Context Integrity	5 × 14 = 70
D3 Distribution Control	7 × 12 = 84
D4 Product Maturity	6 × 14 = 84
D5 Governance	7 × 16 = 112
D6 AI Integration	6 × 12 = 72
D7 Autonomy Envelope	7 × 14 = 98

Total / 1000 × 100 62.8 → 63 (Advanced)

Example B

Financial recommendation agent with limited oversight

D1 Trust & Safety	4 × 18 = 72
D2 Context Integrity	6 × 14 = 84
D3 Distribution Control	5 × 12 = 60
D4 Product Maturity	7 × 14 = 98
D5 Governance	5 × 16 = 80
D6 AI Integration	4 × 12 = 48
D7 Autonomy Envelope	4 × 14 = 56

Total / 1000 × 100 49.8 → 50 (Certified)

Example C

Fully autonomous trading agent, investment-grade controls

D1 Trust & Safety	9 × 18 = 162
D2 Context Integrity	9 × 14 = 126
D3 Distribution Control	10 × 12 = 120
D4 Product Maturity	9 × 14 = 126
D5 Governance	10 × 16 = 160
D6 AI Integration	9 × 12 = 108
D7 Autonomy Envelope	9 × 14 = 126

Total / 1000 × 100 92.8 → 93 (Elite)

4. Certification Levels

Every scored agent is placed into one of five certification tiers. Tiers are determined by the overall score, subject to the floor rule defined in Section 3. The tier communicates the agent's risk posture to the parties relying on it, using a shared vocabulary that does not require the reader to understand the underlying scoring mechanics.

Tier 1

Elite — Score 75 to 100

Overall score: 75 or above. Per-dimension floor: 8 on every dimension.

Elite certification signals an agent deployment operating at or near the current frontier of responsible practice. Controls are not only present and documented but tested and effective, with evidence from real production operation. Governance is embedded in institutional risk management, not siloed in a technology function. The Autonomy Envelope is formally reviewed on a quarterly cadence.

Who tends to score here: mature technology organisations, financial institutions subject to significant supervisory scrutiny, and operators whose agents take consequential irreversible actions at scale. Sector overlays are most likely to be necessary here, particularly for medical, legal, and financial advice-generating agents.

Re-certification cadence: annual, plus immediate trigger review on any of the events specified in Section 6. The certification mark is valid for 12 months from issuance date.

Insurance market signal: underwriters treating Elite-certified agents as a separate risk category from uncertified deployments, with material implications for premium and coverage scope where such differentiation is available.

Tier 2

Advanced — Score 55 to 74

Overall score: 55 to 74. Per-dimension floor: 6 on every dimension.

Advanced certification signals a deployment with strong foundational controls and a clear trajectory toward Elite. One or two dimensions are well-managed; the gap between the current score and Elite typically reflects the absence of continuous monitoring, quarterly review cadences, or independent assurance rather than the absence of controls entirely.

Who tends to score here: mid-market technology and professional services operators who have moved beyond prototype deployments and established governance structures, but have not yet invested in continuous evidence generation and independent review. Advanced is the most common tier for operators seeking their first certification.

Re-certification cadence: annual. Operators are encouraged to conduct an interim self-assessment at six months to track progress toward Elite.

Insurance market signal: sufficient for most standard AI liability coverage applications. Some carriers apply more favourable terms to Advanced and Elite over Certified for high-autonomy deployments.

Tier 3

Certified — Score 35 to 54

Overall score: 35 to 54. Per-dimension floor: 4 on every dimension.

Certified tier signals that the operator has established the essential controls and governance foundations. Policy exists, ownership is assigned, and the agent operates within a defined scope. Gaps at this tier typically relate to continuous monitoring, regression discipline, and the maturity of the Autonomy Envelope rather than the absence of foundational controls.

Who tends to score here: operators who have recently formalised their agent deployment, moved from informal prototype to structured production system, or who are operating simpler agents with lower autonomy profiles. Certified is the entry threshold for the right to carry the Agent Certified mark.

Re-certification cadence: annual. Operators scoring in the lower half of this band are advised to plan for an interim review at six months.

Insurance market signal: meets baseline requirements for coverage applications at most carriers. Some carriers require supplementary questionnaires for Certified-tier deployments above specified autonomy levels.

Tier 4

In Progress — Score 20 to 34

Overall score: 20 to 34. The Agent Certified mark is not awarded at this tier.

In Progress reflects a deployment where foundational work has begun but essential controls in one or more dimensions are still absent or insufficiently evidenced. The scored report identifies specifically which dimensions are below the Certified floor and what evidence or action would move them above it.

Who tends to score here: operators who have made a genuine start on governance and technical controls but are deploying a more complex or higher-autonomy agent than their current maturity level can fully support. In Progress is not a failure; it is an accurate picture of where the operator stands and a roadmap for what comes next.

Path to Certified: the scored report provides a prioritised remediation plan. Most In Progress operators reach Certified tier within three to six months of directed remediation effort. The highest-leverage dimensions at this tier are typically D5 Governance and D7 Autonomy Envelope.

Tier 5

Pre-Assessment — Score below 20

Overall score: below 20. No certification mark issued.

Pre-Assessment reflects deployments where the evidence base for a full assessment is not yet in place. This may indicate an early-stage deployment, an agent being moved from research to production, or an operator who has not yet established the governance and technical foundations the methodology requires.

A Pre-Assessment outcome is accompanied by a gap analysis identifying the minimum steps required to reach In Progress. Pre-Assessment operators are encouraged to treat the methodology as a design checklist rather than a retrospective audit, incorporating controls from the earliest stage of deployment.

Insurance market signal: most AI liability coverage applications require at least Certified tier. Pre-Assessment deployments may face significant exclusions or coverage unavailability for autonomous-action risks.

5. The Assessment Process

The standard assessment follows a six-week process from initial engagement to issuance of the scored report. The process combines operator self-disclosure with independent review. Certifications issued on the basis of self-disclosure alone are not valid under this methodology.

Phase 1: Self-assessment (Weeks 1 and 2)

The operator completes the Agent Certified readiness questionnaire. The questionnaire maps directly to the criteria listed under each dimension in Section 2. For each criterion, the operator provides a yes/partial/no response and identifies the specific document, configuration artefact, or operational evidence that supports the response. Operators are advised not to provide self-assessment scores; scoring is performed by the assessor based on the evidence provided.

Evidence submitted at this phase must be current, meaning produced within the prior 12 months or, for continuously updated artefacts such as risk registers, as of the submission date. Historical evidence may be submitted as supporting context but does not substitute for current evidence.

Phase 2: Evidence collection (Week 3)

The assessor reviews the self-assessment submission and identifies evidence gaps. The operator is notified of gaps within five business days of submission. A supplementary evidence request is issued for any criterion where the submitted evidence is insufficient to assign a score above 3. The operator has five business days to provide supplementary evidence. A second gap request is not issued: if evidence remains insufficient after the supplementary submission, the assessor applies the rubric to the evidence available, which typically results in a score of 0 to 3 for the affected criterion.

Named evidence types required across all dimensions include: policy documents with effective dates and named owners; architecture diagrams with version numbers; configuration exports or screenshots with timestamps; evaluation suite specifications; audit trail extracts; board or risk committee minutes; vendor due diligence records; and incident logs or affirmations of no material incident with methodology for determining materiality.

Phase 3: Independent review (Weeks 4 and 5)

The assessor reviews all submitted evidence against each dimension's rubric. Scores are assigned at the dimension level, not at the criterion level. The assessor records the score, the evidence cited, and any evidence that was noted but not determinative. Where the evidence for a dimension is internally inconsistent, the assessor applies the rubric to the most conservative reading of the evidence and notes the inconsistency in the report.

The operator has the right to review the draft score and the evidence citations before the report is finalised. The operator may submit a factual correction if a score is based on a misreading of the evidence, but may not submit new evidence at this stage. The correction window is three business days.

Phase 4: Issuance of scored report (Week 6)

The scored report sets out: the overall score and certification tier; the dimension-level scores and weights; the evidence cited for each score; any evidence inconsistencies noted; any dimension floors that cap the tier; and a prioritised remediation roadmap if the operator has not achieved the highest tier they sought. The report is issued to the operator and held in the Agent Certified registry. The operator may authorise publication of the tier and overall score; the full scored report requires operator consent for publication.

Phase 5: Continuous monitoring (Ongoing)

Certified operators agree to notify Agent Certified of any event that triggers re-certification under Section 6, within 30 days of that event occurring. Operators at Advanced and Elite tiers commit to quarterly self-attestation confirming that no trigger event has occurred and that the evidence base supporting the last scored report remains materially accurate. Failure to submit a required quarterly attestation causes the certification to move to a suspended status until the attestation is received.

Week 1 Operator completes readiness questionnaire and submits primary evidence package.
Week 2 Assessor reviews submission, identifies gaps, issues supplementary evidence request.
Week 3 Operator submits supplementary evidence. Evidence collection closes.
Week 4 Assessor completes dimension-level scoring. Internal quality review.
Week 5 Draft score shared with operator. Factual correction window (3 business days).
Week 6 Final scored report issued. Certification mark released (if Certified or above).

6. Re-Certification and Drift

AI agent certification is not a point-in-time event that remains valid indefinitely. The behaviour of an agent changes continuously: through model updates, prompt revisions, tool additions, changes to retrieval pipelines, and drift in the distribution of inputs. A certification issued against a snapshot of the agent's configuration and controls may not accurately reflect the agent's risk posture six months later. This section defines the triggers that require re-certification and the monitoring requirements that allow operators to maintain the accuracy of their certification between scheduled reviews.

Triggers requiring re-certification

Any of the following events require the operator to notify Agent Certified and initiate a re-certification assessment within 90 days of the triggering event:

A change of underlying model, including a provider-initiated version update where the operator has the technical ability to remain on the previous version and chooses not to
The addition of a new tool, API, or external integration that expands the agent's action surface or data access
A new data source or knowledge base connected to the agent's retrieval pipeline
An expansion of the user or caller population that materially changes the agent's exposure profile
A change in the agent's deployment context that materially alters the scope or consequence of its actions
A change in the regulatory framework that directly affects any of the seven dimensions, where the operator and assessor jointly determine that re-assessment is necessary
A material safety, data, or operational incident attributable to the agent
A change of named senior accountable owner where governance continuity cannot be demonstrated

Continuous monitoring requirements

Between scheduled re-certifications, operators are required to maintain the evidence base that supports their certification. In practice this means:

Ongoing logging at the level specified in the scored report, with log retention meeting sector requirements
Red team or adversarial testing conducted at least once per quarter for Advanced and Elite tiers; at least once per six months for Certified tier
Regression evaluation run on every material change to prompts, model configuration, or tool definitions
Quarterly review of the Autonomy Envelope with documented sign-off from the named accountable owner
Annual vendor and model supplier due diligence refresh
Quarterly self-attestation submitted to Agent Certified for Advanced and Elite tiers

Point-in-time certification and continuous compliance

This methodology produces a point-in-time score based on evidence available at the time of assessment. The continuous monitoring requirements extend the value of that score by ensuring that the underlying controls remain in place between assessments. The relationship between point-in-time certification and continuous compliance is analogous to the relationship between an annual financial audit and the internal controls that the audited entity operates year-round: the audit provides an independent validation; the controls provide the ongoing assurance.

Insurers and regulators relying on an Agent Certified score should note the date of the scored report and the quarterly self-attestation record when evaluating whether the score remains current. A score issued more than 12 months ago without a subsequent re-certification is not considered current for the purposes of this methodology, regardless of the operator's assertions about the stability of their deployment.

7. Crosswalk to Standards

The seven dimensions of the Agent Certified methodology are grounded in, and consistent with, the primary international standards and regulatory instruments that govern AI system governance and risk management. The crosswalk below maps each dimension to the relevant function or function group in the NIST AI RMF 1.0, the relevant control area in ISO/IEC 42001:2023, and the relevant article or set of articles in the EU AI Act (Regulation (EU) 2024/1689). Where the AI Underwriting Consortium framework (AIUC-1) provides a relevant area, this is also noted.

Dimension	NIST AI RMF 1.0	ISO/IEC 42001:2023	EU AI Act	AIUC-1
D1 Trust & Safety	Govern 1.1, Manage 4.1, Manage 4.2	Clause 8.4, Annex A.6	Art. 9 (Risk management), Art. 14 (Human oversight), Art. 15 (Accuracy)	Safety and Robustness (SR-1 through SR-4)
D2 Context Integrity	Map 1.6, Map 3.1, Govern 6.1	Annex A.7	Art. 10 (Data governance)	Data and Input Quality (DIQ-1 through DIQ-3)
D3 Distribution Control	Govern 1.7, Map 2.2	Annex A.9	Art. 26 (Deployer obligations)	Access and Scope Control (ASC-1, ASC-2)
D4 Product Maturity	Measure 1.1, Measure 2.5, Manage 2.2	Clause 9.1, Annex A.6	Art. 15 (Accuracy, robustness), Art. 12 (Record-keeping)	Operational Reliability (OR-1 through OR-3)
D5 Governance	Govern 1.1, Govern 1.2, Govern 4.1, Govern 5.1	Clause 5 (Leadership), Clause 6 (Planning), Annex A.2	Art. 9 (Risk management), Art. 12 (Record-keeping), Art. 27 (Deployer obligations, high-risk)	Governance and Accountability (GA-1 through GA-5)
D6 AI Integration	Govern 6.2, Measure 4.1, Manage 3.1	Annex A.8, Annex A.9	Art. 14 (Human oversight), Art. 26 (Deployer obligations)	System Integration and Attribution (SIA-1, SIA-2)
D7 Autonomy Envelope	Govern 1.4, Govern 1.5, Manage 1.3, Manage 4.1	Annex A.8	Art. 14 (Human oversight measures), Art. 26 (Deployer obligations)	Autonomy and Override Controls (AOC-1 through AOC-4)

Note on AIUC-1

The AI Underwriting Consortium framework (AIUC-1) is an industry reference developed by AI liability underwriters for use in risk assessment for AI-related insurance coverage. It is not a formal standard body publication. References to AIUC-1 areas reflect the terminology in use in the insurance market as of Q1 2026 and are subject to revision as the consortium updates its framework. The NIST AI RMF and ISO/IEC 42001 mappings are the primary standards references.

8. Editorial Firewall and Independence

The credibility of the Agent Certified methodology depends entirely on the independence of the assessments it produces. The following policies are in force for the duration of this methodology version and may only be amended through a major version revision with public consultation.

No pay-for-placement

No carrier, vendor, model provider, technology partner, or operator pays for placement, score adjustment, or favourable framing in any Agent Certified assessment or publication. The methodology and scoring rubric are applied consistently to all assessed agents regardless of the operator's commercial relationship with Future Proof Intelligence or any affiliated entity.

Open methodology

The complete methodology, including all scoring rubrics, dimension weights, evidence requirements, and certification level definitions, is published openly at agentcertified.eu/methodology-v2.html and licensed under CC-BY 4.0. Any party may use, reproduce, or build upon the methodology text with attribution. The certification mark itself is reserved and may not be used without a current, valid assessment issued by Agent Certified.

Public consultation on revisions

Major version revisions to the methodology, including changes to dimension weights, certification level thresholds, or assessment process requirements, are subject to a 30-day public consultation period before taking effect. Minor version revisions in response to regulatory or market developments that do not alter the fundamental scoring structure may take effect without consultation but are published with a 14-day notice period.

Conflicts of interest

An assessor with a current or prior commercial relationship with an operator under assessment must declare that relationship before the assessment commences. If the relationship is material, the assessment is assigned to a different assessor. Agent Certified maintains an internal register of declared conflicts, reviewed quarterly by the editorial board.

Right of reply

Any organisation whose agent has been scored has the right of reply before any public disclosure of its score. The reply window is three business days from receipt of the draft scored report. A factual correction accepted by the assessor results in a revised score. A factual correction rejected by the assessor is noted in the final report alongside the assessor's reasoning. Disagreements about methodology interpretation, as distinct from factual errors, do not constitute grounds for score adjustment under the right of reply but may be submitted as feedback to inform future version revisions.

9. Versioning and Update Cadence

Version history

This is version 2.0 of the Agent Certified methodology, published on 24 April 2026. Version 1.0 was published on 17 April 2026.

Major and minor versions

Major versions of the methodology are designated by the integer before the decimal (v1, v2, v3). A major version change indicates a revision to dimension weights, certification level thresholds, the assessment process, or any other element that materially affects the comparability of scores produced under different major versions. Major versions are issued on an annual review cycle, with the review process commencing in the preceding quarter and including a 30-day public consultation period.

Minor versions are designated by the digit after the decimal (v2.1, v2.2). Minor versions reflect quarterly updates in response to regulatory or market developments, clarifications to rubric language, or additions to the evidence requirements where the underlying scoring criteria are unchanged. Minor versions do not require public consultation but are published with a 14-day notice period.

Changelog: v1.0 to v2.0

The following changes were made from version 1.0 to version 2.0:

Extended dimension write-ups. Each dimension now includes an explicit statement of why it matters, with specific references to regulatory and insurance market obligations. v1.0 addressed assessed criteria and rubrics only.
Score formula clarification. The formula in v1.0 referenced a denominator of 720, reflecting an early scoring model. v2.0 corrects this to 1,000, consistent with a 0-to-10 per-dimension scale multiplied by weights totalling 100. Scores produced under v1.0 should be treated as indicative only and reassessed under v2.0.
Floor rule formalised. The per-dimension floor rule was implicit in v1.0 practice. v2.0 states it explicitly with defined floor scores per tier (4 for Certified, 6 for Advanced, 8 for Elite).
Assessment process timeline. v2.0 introduces the six-week assessment timeline and formalises the supplementary evidence request process and operator correction window. These were unspecified in v1.0.
Re-certification triggers. v1.0 noted quarterly review without defining triggers. v2.0 enumerates eight specific trigger events and states the 90-day notification and re-assessment requirement.
Standards crosswalk table. Added in v2.0. v1.0 included a reference table but did not map to NIST AI RMF functions or ISO/IEC 42001 clause level.
Editorial firewall section. Added in v2.0 to formalise independence policies that were in practice but unwritten in v1.0.
Worked examples. Three scored examples added in v2.0 to illustrate the application of the formula across different agent types.

Scheduled review

The next scheduled major version review for v3.0 commences in Q3 2026, with public consultation in August and September 2026 and publication targeted for 1 October 2026. Areas identified for examination in the v3.0 review include: the potential introduction of sector-specific overlays for medical device, financial advice, and legal advice agents; the maturation of the Autonomy Envelope rubric in light of incident data from certified deployments; and the alignment of the dimension weights with emerging supervisory guidance from the EU AI Office, expected in H2 2026.

10. Limitations and Acknowledged Gaps

Qualitative assessment and assessor judgement

This methodology sets out fixed rubrics and evidence requirements with the aim of maximising reproducibility. Nevertheless, several scoring decisions require assessor judgement that cannot be fully specified in advance. The determination of whether a guardrail is genuinely effective (D1) or whether an Autonomy Envelope classification is reasonable given the agent's action surface (D7) involves a degree of expert interpretation. The methodology mitigates this through internal quality review of scored reports before issuance, but two assessors working from the same evidence may differ by one point on a dimension where judgement is required. Operators who believe a score is the product of an unreasonable interpretation may invoke the right of reply process described in Section 8.

Sector-specific gaps

The current methodology does not include sector-specific overlays. An agent advising on medical treatment plans carries risks that differ qualitatively from an agent summarising legal precedents or recommending financial products, even if both agents score identically across the seven dimensions. The v3.0 review will examine whether sector-specific overlays are necessary and, if so, how they interact with the base scoring framework. Until sector overlays are available, operators of agents in regulated sectors should treat the Agent Certified score as a necessary but not sufficient assessment of their compliance posture and consult sector-specific regulatory guidance directly.

Multi-agent and pipeline architectures

The methodology is designed for assessment of a single agent or agentic system. Where multiple agents operate in a pipeline or orchestrated architecture, each agent may be assessed independently, but the methodology does not currently address the emergent risks arising from agent-to-agent interaction, including trust propagation, instruction injection between agents in a pipeline, and the distribution of accountability across agents with different control owners. This is an acknowledged gap that the v3.0 review will address.

Model-level versus deployment-level controls

The methodology assesses deployment-level controls implemented by the operator. It does not assess the underlying model or the adequacy of provider-level safety measures. An agent deployed on a model that has passed provider-level safety evaluations may still score poorly on D1 if the operator has not implemented organisation-level guardrails. Conversely, a strong D1 score does not imply any judgement on the adequacy of provider-level controls. Parties seeking an assessment of the underlying model should consult model evaluation frameworks such as METR's task difficulty evaluations or the EU AI Office's model evaluation methodology under the AI Act.

Plans for v3.0

The v3.0 review will specifically address: sector-specific overlays for medical device, financial advice, and legal advice agents; multi-agent pipeline assessment methodology; revised Autonomy Envelope rubric anchors reflecting two years of incident data from certified deployments; and alignment with any binding technical standards adopted under the EU AI Act by the European Artificial Intelligence Board.

11. Citation and Use

How to cite this methodology

Suggested citation for academic, legal, regulatory, and professional use:

Suggested citation

Future Proof Intelligence. (2026). The Agent Certified Methodology: A Published Framework for AI Agent Certification, 2026 Edition. Version 2.0. Agent Certified (agentcertified.eu). Published 24 April 2026. Available at: https://agentcertified.eu/methodology-v2.html

For in-text reference in legal or regulatory filings, the following short-form citation is acceptable: Agent Certified Methodology v2.0 (Future Proof Intelligence, 24 April 2026).

Use licence

The text of this methodology is published under a Creative Commons Attribution 4.0 International licence (CC-BY 4.0). Any party may reproduce, adapt, translate, or build upon the methodology text, including for commercial purposes, provided that appropriate attribution is given to Agent Certified and Future Proof Intelligence, a link to the licence is provided, and any modifications are indicated. The CC-BY 4.0 licence does not extend to the Agent Certified certification mark, the Agent Certified registry, or any scored report produced by Agent Certified. These are reserved and may not be reproduced without written consent.

Inquiries

Methodology inquiries, including questions about the application of the rubric, requests to submit feedback for the v3.0 review, and press inquiries about the framework: methodology@agentcertified.eu

Assessment and certification inquiries: assessments@agentcertified.eu

Registry and certification mark inquiries: registry@agentcertified.eu

The Agent Certified Methodology. A Published Framework for AI Agent Certification, 2026 Edition.

1. Purpose and Scope

Why certification exists for AI agents

What this methodology covers

What this methodology does not cover

Editorial principles

2. The Seven Dimensions

Trust & Safety

Definition

Why it matters

Scoring rubric

Evidence required

Common failure modes

Underpinning standards

Context Integrity

Definition

Why it matters

Scoring rubric

Evidence required

Common failure modes

Underpinning standards

Distribution Control

Definition

Why it matters

Scoring rubric

Evidence required

Common failure modes

Underpinning standards

Product Maturity

Definition

Why it matters

Scoring rubric

Evidence required

Common failure modes

Underpinning standards

Governance

Definition

Why it matters

Scoring rubric

Evidence required

Common failure modes

Underpinning standards

AI Integration

Definition

Why it matters

Scoring rubric

Evidence required

Common failure modes

Underpinning standards

Autonomy Envelope

Definition

Why it matters

Scoring rubric

Evidence required

Common failure modes

Underpinning standards

3. Scoring and Weighting

Dimension weights

Score formula

Worked examples

4. Certification Levels

Elite — Score 75 to 100

Advanced — Score 55 to 74

Certified — Score 35 to 54

In Progress — Score 20 to 34

Pre-Assessment — Score below 20

5. The Assessment Process

Phase 1: Self-assessment (Weeks 1 and 2)

Phase 2: Evidence collection (Week 3)

Phase 3: Independent review (Weeks 4 and 5)

Phase 4: Issuance of scored report (Week 6)

Phase 5: Continuous monitoring (Ongoing)

6. Re-Certification and Drift

Triggers requiring re-certification

Continuous monitoring requirements

Point-in-time certification and continuous compliance

7. Crosswalk to Standards

8. Editorial Firewall and Independence

No pay-for-placement

Open methodology