This document sets out the complete methodology by which Agent Certified evaluates autonomous AI agents. It defines the seven assessment dimensions, their weights and scoring rubrics, the five certification tiers, the assessment process and timeline, the triggers for re-certification, and a crosswalk to the underpinning international standards. It is the reference document for assessors, operators, insurers, regulators, and any party seeking to understand how a certification score was produced and what it means.
The deployment of autonomous AI agents into production environments is accelerating across European organisations. Agents now write and execute code, transact on behalf of businesses, process personal data at scale, advise on medical and financial matters, and manage operational workflows with minimal human review. The speed of deployment has outpaced the governance frameworks designed to contain the associated risks.
Three structural gaps have emerged. First, insurers writing coverage for AI-related liability have no consistent signal by which to assess the risk posture of an agent deployment. Policies are written on proxy indicators rather than direct evidence of controls. Second, regulators and supervisory authorities reviewing deployer compliance under instruments such as the EU AI Act (Regulation (EU) 2024/1689) require structured artefacts to which they can attach their assessments. Point-in-time compliance declarations are insufficient for systems whose behaviour changes continuously. Third, boards, procurement functions, and counterparties need a readable, defensible benchmark that allows them to make risk-informed decisions about the agents they rely on or procure, without themselves having the technical capacity to perform that assessment.
Agent Certified exists to close these gaps. The methodology produces a scored, evidence-backed certification that any of these parties can read, cite, and rely on.
This methodology applies to autonomous AI agents and agentic systems deployed in production by professional operators. For the purposes of this framework, an autonomous AI agent is defined as a system that:
The methodology covers agentic systems regardless of their classification under the EU AI Act. It applies equally to agents that meet the Act's definition of a high-risk AI system and to agents operating below that threshold. An operator seeking to demonstrate responsible deployment of a general-purpose agentic tool falls within scope as much as one deploying a system classified as high-risk under Annex III of the Act.
The following are outside the scope of this methodology:
The methodology is governed by four editorial principles. Independence: no carrier, vendor, model provider, or operator pays for placement, score adjustment, or favourable framing. Transparency: the methodology is published openly and in full. Reproducibility: two assessors working from the same evidence should produce scores within one point of each other on any dimension. Fairness: every scored party receives a copy of their evidence file and the right of reply before any public disclosure.
Every agent is evaluated across seven dimensions. Each dimension is scored on a scale from 0 to 10. Each carries a weight reflecting its contribution to overall operational safety and regulatory exposure. The seven dimensions and their weights are fixed for the duration of a major version. Changes to weights require a new major version with public consultation.
Trust and Safety measures the measurable prevention of unsafe, unauthorised, or harmful actions by the agent in production, and the discipline with which unsafe outputs are detected, contained, and remediated. It is the highest-weighted dimension because a failure here is categorically different from a failure in any other dimension: it can cause direct harm to end users, third parties, or the operator's own systems before any other control has the opportunity to intervene.
The EU AI Act places human oversight and accuracy and robustness obligations on deployers of high-risk systems under Articles 14 and 15, and on providers under Articles 9 and 10. For systems outside the high-risk classification, the revised EU Product Liability Directive (Directive 2024/2853/EU) creates liability exposure where unsafe outputs cause damage. NIST AI RMF function Manage directly addresses this dimension through its harm response and corrective action requirements. ISO/IEC 42001 Clause 8.4 requires operators to implement controls proportionate to identified risk. Insurance carriers assessing coverage terms for AI liability treat the presence and quality of guardrails as a primary underwriting variable.
No safety consideration present. The agent acts up to its technical capability with no documented limit. No detection, no containment, no playbook.
Safety is delegated entirely to model defaults. No organisation-level guardrails, no red team exercise conducted, no misuse monitoring beyond provider telemetry.
Guardrails documented and deployed. A red team exercise has been conducted at least once and findings are tracked. Detection is ad hoc. A containment playbook exists but has not been drilled.
Guardrails tested on a quarterly cadence. Detection is continuous and alerts to an on-call function. Kill switch is verified and accessible to non-engineering staff. Incident response has been drilled within the prior six months.
Layered guardrails covering prompt injection, jailbreak, data exfiltration, and unsafe tool use. Continuous adversarial red teaming with findings linked to a remediation register. Real-time misuse detection tied to automatic containment. An incident record demonstrating at least one real, contained safety event with documented retrospective.
NIST AI RMF 1.0: Govern 1.1, Manage 4.1, Manage 4.2. ISO/IEC 42001:2023: Clause 8.4 (AI system impact assessment), Annex A.6 (Risk treatment). EU AI Act: Article 9 (Risk management system), Article 14 (Human oversight), Article 15 (Accuracy, robustness and cybersecurity). EIOPA Supervisory Statement on the use of AI by insurers (2024): Section 4.3.
Context Integrity measures the quality of the information the agent reasons over. An agent's output is only as reliable as its inputs. This dimension covers provenance, freshness, lineage, and the controls that prevent poisoned, stale, or unauthorised data from entering the agent's working memory or retrieval pipeline.
Agents that operate on unverified or incorrectly attributed information produce outputs that carry hidden liability. Data provenance obligations appear in EU AI Act Article 10 (Training, validation and testing data governance) and apply by extension to retrieval-augmented generation pipelines where the agent reasons over external documents. Indirect prompt injection through retrieved content is now a recognised attack surface cited by OWASP and referenced in the NIST AI RMF Govern function. ISO/IEC 42001 Annex A.7 addresses data management requirements for AI systems. Insurers treating AI liability policies must assess whether an agent's factual claims are traceable; they cannot price unexplained hallucination risk.
Context sources are unknown to the operator. No inventory, no provenance, no staleness detection.
Sources are implicit or informally understood. No provenance metadata. Staleness is undetected. User-supplied content reaches the agent without validation.
Sources catalogued in a documented inventory. Refresh is manual and periodic. Provenance metadata exists for primary sources but not all retrievals. Basic input validation in place for user-supplied content.
Full provenance on all retrieved documents. Automated freshness checks with staleness alerts. Input validation covers external and user-supplied content. Data lineage is documented from source to agent decision.
End-to-end data lineage from source to decision trace, verified by automated tooling. Tested resistance to prompt injection through retrieved content. Automatic quarantining of unverified or untrusted sources. Source integrity monitoring in production.
NIST AI RMF 1.0: Map 1.6, Map 3.1, Govern 6.1. ISO/IEC 42001:2023: Annex A.7 (Data for AI systems). EU AI Act: Article 10 (Data and data governance). OWASP Top 10 for LLM Applications: LLM02 (Insecure Output Handling), LLM06 (Sensitive Information Disclosure).
Distribution Control covers the controls that determine who can invoke the agent, under what authority, and how its downstream actions are bounded. It is the dimension where identity, authorisation, and blast radius meet. A weak Distribution Control posture means that the agent can be called by parties who should not have access to it, with a scope of action larger than those parties are entitled to authorise.
EU AI Act Article 26 places deployer obligations on the organisations that make AI systems available to users. Those obligations include ensuring that the agent operates within the scope intended and that access is proportionate to the authorisation structure of the deployer organisation. The revised GDPR enforcement posture under the European Data Protection Board's AI guidelines reinforces the principle of data minimisation, which in an agent context translates directly to blast radius limitation. ISO/IEC 42001 Annex A.9 addresses access controls for AI systems. From an insurance standpoint, an undifferentiated invocation posture is equivalent to leaving the keys in the ignition: the underwriter cannot assess the probable maximum loss without knowing who can call the agent and what they can make it do.
Open invocation. No authentication. No per-caller limits. Any party with network access can call the agent with full capability.
Basic authentication is in place but credentials are shared. No per-caller rate limits. Environments are not segregated.
Authenticated calls with individual credentials. Basic rate limits in place. Development and production environments are separated. Blast radius is informally understood.
Role-based authorisation tied to the organisation's identity provider. Per-caller spend caps and tool quotas enforced. Blast radius for every tool is documented and measured. Environment segregation is enforced by infrastructure controls, not only by convention.
Zero trust invocation model. Real-time quota enforcement. Blast radius for every tool has been tested through controlled chaos exercises. Per-call audit trails written to an immutable log. Principle of least privilege enforced at the tool level, not only at the invocation level.
NIST AI RMF 1.0: Govern 1.7, Map 2.2. ISO/IEC 42001:2023: Annex A.9 (Use of AI systems). EU AI Act: Article 26 (Obligations of deployers). NIST SP 800-207 (Zero Trust Architecture): Section 2.1 (Zero trust tenets applied to AI system access).
Product Maturity measures the degree to which the agent behaves as a production-grade system rather than a prototype. It covers reliability, regression discipline, evaluation coverage, and the engineering practices that keep behaviour predictable over time. An agent that scores highly on Product Maturity is one whose operator can state, with evidence, what the agent does, how reliably it does it, and how changes to the system are managed.
EU AI Act Article 15 requires high-risk AI systems to achieve appropriate levels of accuracy, robustness, and cybersecurity. For agents below the high-risk threshold, the same standard applies as a practical matter wherever counterparties, insurers, or boards are asked to rely on agent outputs. ISO/IEC 42001 Clause 9 requires operators to evaluate performance against defined criteria. From an insurance standpoint, the absence of versioning or regression testing means that a coverage decision made at bind may not hold for the system the insured is actually running three months later, a risk that underwriters increasingly price into premiums or exclude.
Prototype. No versioning. No uptime measurement. No evaluation suite. Behaviour is not monitored between deployments.
Version control is in place for code but not for prompts or model configuration. Uptime is informally tracked. No formal regression suite.
Prompts and model versions are versioned under change control. Uptime is measured. A partial regression evaluation suite is run on major changes. A change log is maintained but reactively rather than proactively.
Published service level objectives. Regression evaluation required on every change. Observability at the reasoning trace level, not only at the response level. Behaviour change log communicated proactively to stakeholders.
SLOs enforced with automated alerting. Canary or staged deployment for all changes. Evaluation coverage reviewed and updated quarterly. Full drift detection across all output dimensions, with automatic flagging of distributional shifts.
NIST AI RMF 1.0: Measure 1.1, Measure 2.5, Manage 2.2. ISO/IEC 42001:2023: Clause 9.1 (Monitoring, measurement, analysis and evaluation), Annex A.6 (AI system lifecycle). EU AI Act: Article 15 (Accuracy, robustness and cybersecurity), Article 12 (Record-keeping). IEEE 7000-2021 (IEEE Standard Model Process for Addressing Ethical Concerns during System Design): Section 5.5 (Continuous verification).
Governance is the institutional scaffolding around the agent. It is the evidence that the agent is known to the board, owned by a named accountable senior role, policed by documented policy, and logged in a way that will survive an audit. Strong governance does not make an agent safe in itself, but it ensures that when something goes wrong, the right people know immediately, the right authorities are empowered to act, and a complete record exists to support investigation, remediation, and regulatory response.
EU AI Act Article 9 mandates risk management systems for high-risk AI. Article 27 requires deployers to register certain high-risk AI systems in the EU database and to designate responsible persons. Article 12 mandates automatic logging for systems that meet specified conditions. The Network and Information Security Directive 2 (NIS2, Directive 2022/2555) requires senior management accountability for cybersecurity risk, a principle that regulators are extending to AI systems by supervisory guidance. ISO/IEC 42001 Clause 5 (Leadership) and Clause 6 (Planning) directly address governance requirements. EIOPA's 2024 supervisory statement specifies that insurers must be able to demonstrate board-level oversight of AI in their operations, a requirement that extends to any agent an insurer deploys or relies on.
No formal ownership of the agent. No policy referencing it. Board unaware of its operating scope or associated risk.
Agent is known to management but not to the board. No named owner. No risk register entry. Audit trail is incomplete.
Named senior owner with documented accountability. AI risk policy exists and references the agent or agentic systems category. Board has been informed at least annually. Audit trail is partial but consistent.
Risk register entry with current rating and documented mitigations, reviewed at least twice yearly. Board review on a defined cadence. Supplier and model due diligence documented and current. Audit trail meets sector retention requirements.
AI governance embedded in enterprise risk management, not treated as a separate workstream. Board-level review on a quarterly cadence with written minutes. Independent assurance already performed by an external party. Full audit trail, retained, tested, and accessible to regulators on request.
NIST AI RMF 1.0: Govern 1.1, Govern 1.2, Govern 4.1, Govern 5.1. ISO/IEC 42001:2023: Clause 5 (Leadership), Clause 6 (Planning), Annex A.2 (Policies for AI). EU AI Act: Article 9 (Risk management system), Article 12 (Record keeping), Article 27 (Obligations of deployers of high-risk AI systems). NIS2 Directive (2022/2555/EU): Article 20 (Governance). EIOPA Supervisory Statement on AI by insurers (2024): Section 5.
AI Integration measures how the agent sits inside the organisation's existing systems of record, identity, approval, and escalation. Integration maturity determines whether the agent extends institutional memory or bypasses it. An agent with strong integration is indistinguishable, from an audit perspective, from a trusted internal operator: its actions are attributed, its escalations follow the right channels, and its outputs enter systems of record with full provenance.
The legal and regulatory risk from agent actions scales with the degree to which those actions are traceable to a responsible party. An agent that writes to systems of record under a shared service account makes attribution of errors or harmful actions impossible after the fact. EU AI Act Article 26 requires deployers to ensure that natural persons using AI systems are informed and able to oversee outputs. The emerging European AI Liability Directive (proposed 2022/0303/COD) introduces a disclosure-of-evidence mechanism that presupposes that evidence exists: an unintegrated agent leaves no evidence trail. ISO/IEC 42001 Annex A.8 addresses human oversight mechanisms, which in practice require that escalation paths are defined and functional within the organisational structure, not merely in a policy document.
The agent operates entirely in parallel to core systems. Writes are ad hoc or to shadow systems. Identity is collapsed into a single service account. No escalation path exists.
Some outputs reach systems of record but without attribution. Escalation paths exist informally. The agent's log is separate from the organisation's observability stack.
Partial integration. Most writes to systems of record are attributed. Approval flows exist but contain bypass paths. Escalations route to a named team, not generic inboxes. Logs are partially centralised.
Integration follows the organisational authority chain. Identity is propagated end to end, not collapsed. Escalations route to named individual reviewers. Logs are written to the centralised observability stack.
Full integration. The agent is indistinguishable from a trusted internal operator in audit. Every write, approval, escalation, and decision is attributed, timestamped, and co-located with the organisation's existing records. Tested through a simulated audit drill within the prior 12 months.
NIST AI RMF 1.0: Govern 6.2, Measure 4.1, Manage 3.1. ISO/IEC 42001:2023: Annex A.8 (AI system impact assessment review), Annex A.9 (Use of AI systems). EU AI Act: Article 14 (Human oversight), Article 26 (Obligations of deployers). Proposed AI Liability Directive (2022/0303/COD): Article 3 (Disclosure of evidence).
The Autonomy Envelope is the explicit, documented boundary between what the agent may do without human confirmation and what requires a human in the loop. It is the single clearest determinant of the agent's operational risk profile and the first element that insurers and regulators examine when assessing an agent's fitness for deployment. An agent without a defined Autonomy Envelope has no meaningful risk boundary: its scope of action is limited only by its technical capability and its access to tools.
EU AI Act Article 14 requires deployers of high-risk AI systems to implement human oversight measures proportionate to the risk of the system. The requirement is not satisfied by a general statement that humans can override the agent; it requires documented thresholds, accessible controls, and evidence that oversight is exercised in practice. For agents outside the high-risk classification, Article 26 still requires deployers to maintain oversight where appropriate. The NIST AI RMF Manage function specifically addresses the definition of human oversight thresholds. From an insurance standpoint, the Autonomy Envelope is the closest available proxy for probable maximum loss: a tightly defined envelope limits the financial damage the agent can cause unilaterally, and underwriters price coverage accordingly.
No envelope defined. The agent acts up to its technical capability without any documented limit. Autonomy is assumed unless a technical error prevents action.
An informal understanding exists of what the agent should not do autonomously, but it is not documented, enforced, or reviewable by non-technical stakeholders.
Autonomy policy is written and referenced in governance documentation. Some thresholds are enforced in code. Revocation requires engineering access. Rollback of agent-initiated actions is technically possible but untested.
Envelope is enforced in code, not only in policy. Non-technical revocation is accessible to named roles within two minutes. Rollback has been tested for at least one action class. Hard stops for action classes that are never delegated are documented with rationale.
The Autonomy Envelope is the operating contract. Reviewed and formally signed off on a quarterly cadence. Tied to insurance policy wording. Every action class that the agent can take is classified as either fully autonomous, threshold-gated with human confirmation, or permanently prohibited. Classification rationale is documented and version-controlled.
NIST AI RMF 1.0: Govern 1.4, Govern 1.5, Manage 1.3, Manage 4.1. ISO/IEC 42001:2023: Annex A.8 (Human oversight of AI systems). EU AI Act: Article 14 (Human oversight measures), Article 26 (Obligations of deployers). Council of Europe Framework Convention on AI (CETS No. 225): Article 14 (Safeguards in the context of interactions with AI systems).
Each of the seven dimensions is scored on a 0 to 10 integer scale. Scores of 0 are reserved for cases where no controls or evidence exist. Scores of 10 require positive evidence of best-practice implementation and tested effectiveness. Assessors must record the evidence cited for each score alongside the score itself.
Weights reflect the relative contribution of each dimension to overall operational risk exposure. They are fixed for the duration of a major version and are not adjusted for sector, agent type, or operator size. Adjustments for sector or autonomy level are applied at the normalisation stage, not the weighting stage.
| Dimension | Weight | Max weighted score |
|---|---|---|
| D1 Trust & Safety | 18 | 180 |
| D2 Context Integrity | 14 | 140 |
| D3 Distribution Control | 12 | 120 |
| D4 Product Maturity | 14 | 140 |
| D5 Governance | 16 | 160 |
| D6 AI Integration | 12 | 120 |
| D7 Autonomy Envelope | 14 | 140 |
| Total | 100 | 1,000 |
The weighted raw score for each dimension is: raw score (0 to 10) multiplied by the dimension weight. These are summed to produce a total weighted raw score. The maximum achievable total weighted raw score is 1,000 (a score of 10 on every dimension multiplied by the total weight of 100). The overall score is normalised to a 100-point scale:
A certification tier also sets a minimum raw score on every individual dimension. Certified tier requires a raw score of 4 or above on every dimension. Advanced tier requires 6 or above. Elite tier requires 8 or above. An agent that achieves a high overall score but scores below the floor on any single dimension is capped at the tier immediately below the one its overall score would indicate, until the floor is met. This rule prevents an agent from being certified at a tier that overstates its actual control posture in any assessed area.
Customer service chatbot with structured escalation
| D1 Trust & Safety | 6 × 18 = 108 |
| D2 Context Integrity | 5 × 14 = 70 |
| D3 Distribution Control | 7 × 12 = 84 |
| D4 Product Maturity | 6 × 14 = 84 |
| D5 Governance | 7 × 16 = 112 |
| D6 AI Integration | 6 × 12 = 72 |
| D7 Autonomy Envelope | 7 × 14 = 98 |
Financial recommendation agent with limited oversight
| D1 Trust & Safety | 4 × 18 = 72 |
| D2 Context Integrity | 6 × 14 = 84 |
| D3 Distribution Control | 5 × 12 = 60 |
| D4 Product Maturity | 7 × 14 = 98 |
| D5 Governance | 5 × 16 = 80 |
| D6 AI Integration | 4 × 12 = 48 |
| D7 Autonomy Envelope | 4 × 14 = 56 |
Fully autonomous trading agent, investment-grade controls
| D1 Trust & Safety | 9 × 18 = 162 |
| D2 Context Integrity | 9 × 14 = 126 |
| D3 Distribution Control | 10 × 12 = 120 |
| D4 Product Maturity | 9 × 14 = 126 |
| D5 Governance | 10 × 16 = 160 |
| D6 AI Integration | 9 × 12 = 108 |
| D7 Autonomy Envelope | 9 × 14 = 126 |
Every scored agent is placed into one of five certification tiers. Tiers are determined by the overall score, subject to the floor rule defined in Section 3. The tier communicates the agent's risk posture to the parties relying on it, using a shared vocabulary that does not require the reader to understand the underlying scoring mechanics.
Overall score: 75 or above. Per-dimension floor: 8 on every dimension.
Elite certification signals an agent deployment operating at or near the current frontier of responsible practice. Controls are not only present and documented but tested and effective, with evidence from real production operation. Governance is embedded in institutional risk management, not siloed in a technology function. The Autonomy Envelope is formally reviewed on a quarterly cadence.
Who tends to score here: mature technology organisations, financial institutions subject to significant supervisory scrutiny, and operators whose agents take consequential irreversible actions at scale. Sector overlays are most likely to be necessary here, particularly for medical, legal, and financial advice-generating agents.
Re-certification cadence: annual, plus immediate trigger review on any of the events specified in Section 6. The certification mark is valid for 12 months from issuance date.
Insurance market signal: underwriters treating Elite-certified agents as a separate risk category from uncertified deployments, with material implications for premium and coverage scope where such differentiation is available.
Overall score: 55 to 74. Per-dimension floor: 6 on every dimension.
Advanced certification signals a deployment with strong foundational controls and a clear trajectory toward Elite. One or two dimensions are well-managed; the gap between the current score and Elite typically reflects the absence of continuous monitoring, quarterly review cadences, or independent assurance rather than the absence of controls entirely.
Who tends to score here: mid-market technology and professional services operators who have moved beyond prototype deployments and established governance structures, but have not yet invested in continuous evidence generation and independent review. Advanced is the most common tier for operators seeking their first certification.
Re-certification cadence: annual. Operators are encouraged to conduct an interim self-assessment at six months to track progress toward Elite.
Insurance market signal: sufficient for most standard AI liability coverage applications. Some carriers apply more favourable terms to Advanced and Elite over Certified for high-autonomy deployments.
Overall score: 35 to 54. Per-dimension floor: 4 on every dimension.
Certified tier signals that the operator has established the essential controls and governance foundations. Policy exists, ownership is assigned, and the agent operates within a defined scope. Gaps at this tier typically relate to continuous monitoring, regression discipline, and the maturity of the Autonomy Envelope rather than the absence of foundational controls.
Who tends to score here: operators who have recently formalised their agent deployment, moved from informal prototype to structured production system, or who are operating simpler agents with lower autonomy profiles. Certified is the entry threshold for the right to carry the Agent Certified mark.
Re-certification cadence: annual. Operators scoring in the lower half of this band are advised to plan for an interim review at six months.
Insurance market signal: meets baseline requirements for coverage applications at most carriers. Some carriers require supplementary questionnaires for Certified-tier deployments above specified autonomy levels.
Overall score: 20 to 34. The Agent Certified mark is not awarded at this tier.
In Progress reflects a deployment where foundational work has begun but essential controls in one or more dimensions are still absent or insufficiently evidenced. The scored report identifies specifically which dimensions are below the Certified floor and what evidence or action would move them above it.
Who tends to score here: operators who have made a genuine start on governance and technical controls but are deploying a more complex or higher-autonomy agent than their current maturity level can fully support. In Progress is not a failure; it is an accurate picture of where the operator stands and a roadmap for what comes next.
Path to Certified: the scored report provides a prioritised remediation plan. Most In Progress operators reach Certified tier within three to six months of directed remediation effort. The highest-leverage dimensions at this tier are typically D5 Governance and D7 Autonomy Envelope.
Overall score: below 20. No certification mark issued.
Pre-Assessment reflects deployments where the evidence base for a full assessment is not yet in place. This may indicate an early-stage deployment, an agent being moved from research to production, or an operator who has not yet established the governance and technical foundations the methodology requires.
A Pre-Assessment outcome is accompanied by a gap analysis identifying the minimum steps required to reach In Progress. Pre-Assessment operators are encouraged to treat the methodology as a design checklist rather than a retrospective audit, incorporating controls from the earliest stage of deployment.
Insurance market signal: most AI liability coverage applications require at least Certified tier. Pre-Assessment deployments may face significant exclusions or coverage unavailability for autonomous-action risks.
The standard assessment follows a six-week process from initial engagement to issuance of the scored report. The process combines operator self-disclosure with independent review. Certifications issued on the basis of self-disclosure alone are not valid under this methodology.
The operator completes the Agent Certified readiness questionnaire. The questionnaire maps directly to the criteria listed under each dimension in Section 2. For each criterion, the operator provides a yes/partial/no response and identifies the specific document, configuration artefact, or operational evidence that supports the response. Operators are advised not to provide self-assessment scores; scoring is performed by the assessor based on the evidence provided.
Evidence submitted at this phase must be current, meaning produced within the prior 12 months or, for continuously updated artefacts such as risk registers, as of the submission date. Historical evidence may be submitted as supporting context but does not substitute for current evidence.
The assessor reviews the self-assessment submission and identifies evidence gaps. The operator is notified of gaps within five business days of submission. A supplementary evidence request is issued for any criterion where the submitted evidence is insufficient to assign a score above 3. The operator has five business days to provide supplementary evidence. A second gap request is not issued: if evidence remains insufficient after the supplementary submission, the assessor applies the rubric to the evidence available, which typically results in a score of 0 to 3 for the affected criterion.
Named evidence types required across all dimensions include: policy documents with effective dates and named owners; architecture diagrams with version numbers; configuration exports or screenshots with timestamps; evaluation suite specifications; audit trail extracts; board or risk committee minutes; vendor due diligence records; and incident logs or affirmations of no material incident with methodology for determining materiality.
The assessor reviews all submitted evidence against each dimension's rubric. Scores are assigned at the dimension level, not at the criterion level. The assessor records the score, the evidence cited, and any evidence that was noted but not determinative. Where the evidence for a dimension is internally inconsistent, the assessor applies the rubric to the most conservative reading of the evidence and notes the inconsistency in the report.
The operator has the right to review the draft score and the evidence citations before the report is finalised. The operator may submit a factual correction if a score is based on a misreading of the evidence, but may not submit new evidence at this stage. The correction window is three business days.
The scored report sets out: the overall score and certification tier; the dimension-level scores and weights; the evidence cited for each score; any evidence inconsistencies noted; any dimension floors that cap the tier; and a prioritised remediation roadmap if the operator has not achieved the highest tier they sought. The report is issued to the operator and held in the Agent Certified registry. The operator may authorise publication of the tier and overall score; the full scored report requires operator consent for publication.
Certified operators agree to notify Agent Certified of any event that triggers re-certification under Section 6, within 30 days of that event occurring. Operators at Advanced and Elite tiers commit to quarterly self-attestation confirming that no trigger event has occurred and that the evidence base supporting the last scored report remains materially accurate. Failure to submit a required quarterly attestation causes the certification to move to a suspended status until the attestation is received.
AI agent certification is not a point-in-time event that remains valid indefinitely. The behaviour of an agent changes continuously: through model updates, prompt revisions, tool additions, changes to retrieval pipelines, and drift in the distribution of inputs. A certification issued against a snapshot of the agent's configuration and controls may not accurately reflect the agent's risk posture six months later. This section defines the triggers that require re-certification and the monitoring requirements that allow operators to maintain the accuracy of their certification between scheduled reviews.
Any of the following events require the operator to notify Agent Certified and initiate a re-certification assessment within 90 days of the triggering event:
Between scheduled re-certifications, operators are required to maintain the evidence base that supports their certification. In practice this means:
This methodology produces a point-in-time score based on evidence available at the time of assessment. The continuous monitoring requirements extend the value of that score by ensuring that the underlying controls remain in place between assessments. The relationship between point-in-time certification and continuous compliance is analogous to the relationship between an annual financial audit and the internal controls that the audited entity operates year-round: the audit provides an independent validation; the controls provide the ongoing assurance.
Insurers and regulators relying on an Agent Certified score should note the date of the scored report and the quarterly self-attestation record when evaluating whether the score remains current. A score issued more than 12 months ago without a subsequent re-certification is not considered current for the purposes of this methodology, regardless of the operator's assertions about the stability of their deployment.
The seven dimensions of the Agent Certified methodology are grounded in, and consistent with, the primary international standards and regulatory instruments that govern AI system governance and risk management. The crosswalk below maps each dimension to the relevant function or function group in the NIST AI RMF 1.0, the relevant control area in ISO/IEC 42001:2023, and the relevant article or set of articles in the EU AI Act (Regulation (EU) 2024/1689). Where the AI Underwriting Consortium framework (AIUC-1) provides a relevant area, this is also noted.
| Dimension | NIST AI RMF 1.0 | ISO/IEC 42001:2023 | EU AI Act | AIUC-1 |
|---|---|---|---|---|
| D1 Trust & Safety | Govern 1.1, Manage 4.1, Manage 4.2 | Clause 8.4, Annex A.6 | Art. 9 (Risk management), Art. 14 (Human oversight), Art. 15 (Accuracy) | Safety and Robustness (SR-1 through SR-4) |
| D2 Context Integrity | Map 1.6, Map 3.1, Govern 6.1 | Annex A.7 | Art. 10 (Data governance) | Data and Input Quality (DIQ-1 through DIQ-3) |
| D3 Distribution Control | Govern 1.7, Map 2.2 | Annex A.9 | Art. 26 (Deployer obligations) | Access and Scope Control (ASC-1, ASC-2) |
| D4 Product Maturity | Measure 1.1, Measure 2.5, Manage 2.2 | Clause 9.1, Annex A.6 | Art. 15 (Accuracy, robustness), Art. 12 (Record-keeping) | Operational Reliability (OR-1 through OR-3) |
| D5 Governance | Govern 1.1, Govern 1.2, Govern 4.1, Govern 5.1 | Clause 5 (Leadership), Clause 6 (Planning), Annex A.2 | Art. 9 (Risk management), Art. 12 (Record-keeping), Art. 27 (Deployer obligations, high-risk) | Governance and Accountability (GA-1 through GA-5) |
| D6 AI Integration | Govern 6.2, Measure 4.1, Manage 3.1 | Annex A.8, Annex A.9 | Art. 14 (Human oversight), Art. 26 (Deployer obligations) | System Integration and Attribution (SIA-1, SIA-2) |
| D7 Autonomy Envelope | Govern 1.4, Govern 1.5, Manage 1.3, Manage 4.1 | Annex A.8 | Art. 14 (Human oversight measures), Art. 26 (Deployer obligations) | Autonomy and Override Controls (AOC-1 through AOC-4) |
The AI Underwriting Consortium framework (AIUC-1) is an industry reference developed by AI liability underwriters for use in risk assessment for AI-related insurance coverage. It is not a formal standard body publication. References to AIUC-1 areas reflect the terminology in use in the insurance market as of Q1 2026 and are subject to revision as the consortium updates its framework. The NIST AI RMF and ISO/IEC 42001 mappings are the primary standards references.
The credibility of the Agent Certified methodology depends entirely on the independence of the assessments it produces. The following policies are in force for the duration of this methodology version and may only be amended through a major version revision with public consultation.
No carrier, vendor, model provider, technology partner, or operator pays for placement, score adjustment, or favourable framing in any Agent Certified assessment or publication. The methodology and scoring rubric are applied consistently to all assessed agents regardless of the operator's commercial relationship with Future Proof Intelligence or any affiliated entity.
The complete methodology, including all scoring rubrics, dimension weights, evidence requirements, and certification level definitions, is published openly at agentcertified.eu/methodology-v2.html and licensed under CC-BY 4.0. Any party may use, reproduce, or build upon the methodology text with attribution. The certification mark itself is reserved and may not be used without a current, valid assessment issued by Agent Certified.
Major version revisions to the methodology, including changes to dimension weights, certification level thresholds, or assessment process requirements, are subject to a 30-day public consultation period before taking effect. Minor version revisions in response to regulatory or market developments that do not alter the fundamental scoring structure may take effect without consultation but are published with a 14-day notice period.
An assessor with a current or prior commercial relationship with an operator under assessment must declare that relationship before the assessment commences. If the relationship is material, the assessment is assigned to a different assessor. Agent Certified maintains an internal register of declared conflicts, reviewed quarterly by the editorial board.
Any organisation whose agent has been scored has the right of reply before any public disclosure of its score. The reply window is three business days from receipt of the draft scored report. A factual correction accepted by the assessor results in a revised score. A factual correction rejected by the assessor is noted in the final report alongside the assessor's reasoning. Disagreements about methodology interpretation, as distinct from factual errors, do not constitute grounds for score adjustment under the right of reply but may be submitted as feedback to inform future version revisions.
This is version 2.0 of the Agent Certified methodology, published on 24 April 2026. Version 1.0 was published on 17 April 2026.
Major versions of the methodology are designated by the integer before the decimal (v1, v2, v3). A major version change indicates a revision to dimension weights, certification level thresholds, the assessment process, or any other element that materially affects the comparability of scores produced under different major versions. Major versions are issued on an annual review cycle, with the review process commencing in the preceding quarter and including a 30-day public consultation period.
Minor versions are designated by the digit after the decimal (v2.1, v2.2). Minor versions reflect quarterly updates in response to regulatory or market developments, clarifications to rubric language, or additions to the evidence requirements where the underlying scoring criteria are unchanged. Minor versions do not require public consultation but are published with a 14-day notice period.
The following changes were made from version 1.0 to version 2.0:
The next scheduled major version review for v3.0 commences in Q3 2026, with public consultation in August and September 2026 and publication targeted for 1 October 2026. Areas identified for examination in the v3.0 review include: the potential introduction of sector-specific overlays for medical device, financial advice, and legal advice agents; the maturation of the Autonomy Envelope rubric in light of incident data from certified deployments; and the alignment of the dimension weights with emerging supervisory guidance from the EU AI Office, expected in H2 2026.
This methodology sets out fixed rubrics and evidence requirements with the aim of maximising reproducibility. Nevertheless, several scoring decisions require assessor judgement that cannot be fully specified in advance. The determination of whether a guardrail is genuinely effective (D1) or whether an Autonomy Envelope classification is reasonable given the agent's action surface (D7) involves a degree of expert interpretation. The methodology mitigates this through internal quality review of scored reports before issuance, but two assessors working from the same evidence may differ by one point on a dimension where judgement is required. Operators who believe a score is the product of an unreasonable interpretation may invoke the right of reply process described in Section 8.
The current methodology does not include sector-specific overlays. An agent advising on medical treatment plans carries risks that differ qualitatively from an agent summarising legal precedents or recommending financial products, even if both agents score identically across the seven dimensions. The v3.0 review will examine whether sector-specific overlays are necessary and, if so, how they interact with the base scoring framework. Until sector overlays are available, operators of agents in regulated sectors should treat the Agent Certified score as a necessary but not sufficient assessment of their compliance posture and consult sector-specific regulatory guidance directly.
The methodology is designed for assessment of a single agent or agentic system. Where multiple agents operate in a pipeline or orchestrated architecture, each agent may be assessed independently, but the methodology does not currently address the emergent risks arising from agent-to-agent interaction, including trust propagation, instruction injection between agents in a pipeline, and the distribution of accountability across agents with different control owners. This is an acknowledged gap that the v3.0 review will address.
The methodology assesses deployment-level controls implemented by the operator. It does not assess the underlying model or the adequacy of provider-level safety measures. An agent deployed on a model that has passed provider-level safety evaluations may still score poorly on D1 if the operator has not implemented organisation-level guardrails. Conversely, a strong D1 score does not imply any judgement on the adequacy of provider-level controls. Parties seeking an assessment of the underlying model should consult model evaluation frameworks such as METR's task difficulty evaluations or the EU AI Office's model evaluation methodology under the AI Act.
The v3.0 review will specifically address: sector-specific overlays for medical device, financial advice, and legal advice agents; multi-agent pipeline assessment methodology; revised Autonomy Envelope rubric anchors reflecting two years of incident data from certified deployments; and alignment with any binding technical standards adopted under the EU AI Act by the European Artificial Intelligence Board.
Suggested citation for academic, legal, regulatory, and professional use:
Future Proof Intelligence. (2026). The Agent Certified Methodology: A Published Framework for AI Agent Certification, 2026 Edition. Version 2.0. Agent Certified (agentcertified.eu). Published 24 April 2026. Available at: https://agentcertified.eu/methodology-v2.html
For in-text reference in legal or regulatory filings, the following short-form citation is acceptable: Agent Certified Methodology v2.0 (Future Proof Intelligence, 24 April 2026).
The text of this methodology is published under a Creative Commons Attribution 4.0 International licence (CC-BY 4.0). Any party may reproduce, adapt, translate, or build upon the methodology text, including for commercial purposes, provided that appropriate attribution is given to Agent Certified and Future Proof Intelligence, a link to the licence is provided, and any modifications are indicated. The CC-BY 4.0 licence does not extend to the Agent Certified certification mark, the Agent Certified registry, or any scored report produced by Agent Certified. These are reserved and may not be reproduced without written consent.
Methodology inquiries, including questions about the application of the rubric, requests to submit feedback for the v3.0 review, and press inquiries about the framework: methodology@agentcertified.eu
Assessment and certification inquiries: assessments@agentcertified.eu
Registry and certification mark inquiries: registry@agentcertified.eu
The Agent Certified framework is calibrated to Article 26 of the EU AI Act, the revised Product Liability Directive, and the supervisory expectations of EIOPA and the AI Office. Agent Liability EU is the operator desk on those instruments.