Does the assessment cover the EU AI Act conformity assessment requirements for high-risk AI systems?

The Agent Certified assessment is complementary to but not equivalent to the EU AI Act conformity assessment process mandated under Regulation (EU) 2024/1689 for high-risk AI systems. It evaluates an agent against the seven-dimension framework and maps findings to relevant EU AI Act articles including Articles 9, 10, 14, 15, 17 and 26. Operators of high-risk AI systems should treat a strong Agent Certified result as significant preparatory evidence for a conformity assessment, not as a substitute for it. The high-risk obligations under Regulation (EU) 2024/1689 were originally scheduled for 2 August 2026; a provisional deferral to 2 December 2027 was agreed at EU Digital Omnibus trilogue on 7 May 2026 but is not yet formally adopted.

Process · Orientation

The Agent Certified assessment process, step by step

Q: Can the assessment report be shared with an insurance carrier?

Yes. The final report is structured specifically to support insurer use. Section five of the standard report is an underwriting summary formatted to align with Munich Re aiSure, Armilla, and Lloyd's supplemental AI questionnaires currently used by leading carriers. The operator controls disclosure: the report is issued to the operator and is shared with third parties at the operator's discretion. The underwriting summary can be provided separately from the full report if the operator prefers.

What an assessment involves: intake, evidence gathering across the seven dimensions, scoring, the tier outcome, the report, and how the result feeds an insurer underwriting file. Written for a risk lead preparing for their first formal assessment.

By Future Proof Intelligence Published 14 June 2026 Reading time 12 minutes

Key takeaways

An Agent Certified assessment runs in five stages: intake and scoping, evidence gathering across seven dimensions, a structured scoring panel, a formal tier determination, and a written report with an underwriting-ready summary. The full cycle takes four to six weeks.
The assessment requires a named senior accountability owner, a technical lead, and a compliance representative. Interviews are load-bearing: documentary evidence alone is not sufficient for any dimension above the Pre Assessment tier.
Scoring is normalised to one hundred points across seven weighted dimensions. Every tier sets a minimum raw score per dimension: a single weak dimension caps the tier regardless of the weighted total.
The final report includes a dedicated underwriting summary structured to align with insurer supplemental questionnaires used by carriers including Munich Re aiSure, Armilla, and Lloyd's AI-risk syndicates. The operator controls third-party disclosure.
An Agent Certified result is complementary to but not equivalent to the EU AI Act conformity assessment for high-risk AI systems under Regulation (EU) 2024/1689. It constitutes significant preparatory evidence, not a substitute.

Risk leads preparing for an Agent Certified assessment frequently ask the same two questions: what exactly happens during the assessment, and what will the output enable? This article answers both. It describes each stage in sequence, explains what the assessor is looking for at each point, and sets out how the final report is structured to serve both internal governance purposes and insurer due diligence.

The article does not reproduce the full scoring rubric, which is available at the methodology page. It does not describe what documentation to assemble in advance, which is covered in the companion article on preparing for an assessment. The focus here is the process itself: what happens, when, and why.

Stage one: Intake and scoping

The assessment begins with an intake session. The purpose is to scope the assessment to a specific, named AI agent and to establish the organisational context around it. Assessments do not cover a platform, a suite, or an organisation's AI posture in general. They cover one agent, identified by name and version, in its current production state.

What the intake session covers

The intake session typically runs ninety minutes and involves the senior accountability owner and the technical lead. The assessor works through five topics.

First, agent identification. The agent is recorded by name, version, model provider, deployment environment, and the specific task it is authorised to perform. If the agent uses multiple foundation models or changes models between pipeline stages, each dependency is noted.

Second, use case and deployment context. The assessor records who invokes the agent, under what authority, in what environment (internal tool, customer-facing product, business process automation), and what the downstream consequences of an error look like. A customer-facing agent that can execute financial transactions carries different baseline risk from an internal agent that drafts documents for human review.

Third, existing governance documentation. The assessor asks which governance artefacts already exist: AI risk policy, risk register entries, incident records, board papers, data governance documentation. The intake session is not an evidence review; it is a map of what exists so the evidence gathering stage can be planned efficiently.

Fourth, applicable regulatory classification. The assessor asks whether the operator has made a determination under Regulation (EU) 2024/1689 as to whether the agent constitutes a high-risk AI system under Annex III, or falls within the GPAI model provisions under Title III. The assessment proceeds regardless of that determination, but the regulatory context affects how findings are framed in the final report.

Fifth, counterparty requirements. Many operators come to assessment because a customer, insurer, or board has requested evidence of AI governance maturity. The intake session records those requirements so the final report can be structured to address them directly.

Output of the intake stage

The output is a scoping document confirming the agent under assessment, the evidence gathering schedule, the named participants for each dimension interview, and the target completion date. The operator countersigns the scoping document before evidence gathering begins.

Stage two: Evidence gathering across seven dimensions

Evidence gathering is the most intensive stage of the assessment. It runs two to three weeks for most operators and involves both document review and structured interviews. The two activities run in parallel: the assessor reviews submitted documentation while scheduling dimension interviews.

How evidence is submitted

Operators submit documentation through a secure document portal. No document is retained beyond the assessment window without explicit written consent. The submission checklist maps to the seven dimensions; the most efficient operators submit documents grouped by dimension rather than chronologically or by internal document type.

Common document types across all dimensions include: AI risk policy and version history, risk register entries, board papers or minutes referencing the agent, incident records from the prior twelve months, data governance documentation including data lineage maps, vendor due diligence records for model providers, monitoring and observability dashboards (screenshots or exports), autonomy policy with version date, and penetration test or red team reports.

Not all evidence is documentary. Technical telemetry, live system walkthroughs, and demonstration of specific controls in a test environment all count as evidence. Assessors will request a live walkthrough for any dimension where documentary evidence is thin.

Dimension interviews

Each of the seven dimensions has a structured interview. Interviews are sixty minutes each and involve the named participant for that dimension. The assessor works from a fixed question set, but follows up based on the evidence already submitted. Interviews are not a repeat of document review: they test whether the documented controls are in active use and whether the named participants understand them.

The interview for the Trust and Safety dimension is typically the longest and most technically detailed. It covers the guardrail implementation, the incident detection and containment process, the red team schedule, and the operator's demonstrated understanding of their own attack surface. Assessors will ask about real incidents. If the operator has had no incidents in the prior twelve months, the assessor will ask how the operator knows that, which is itself an evidence point.

The interview for the Governance dimension is typically the most sensitive. It requires the senior accountability owner to speak to board-level awareness of the agent, the AI risk policy's status in the governance cycle, and the operator's vendor due diligence process for the model provider. Gaps in board awareness are one of the most common shortfalls found at this stage. Article 17 of Regulation (EU) 2024/1689 requires documented quality management systems for high-risk AI systems; the Governance dimension interview tests whether equivalent discipline exists regardless of high-risk classification.

The Autonomy Envelope interview is the dimension that most consistently surprises operators. Many organisations have an autonomy policy in principle but have not documented the specific impact thresholds that trigger human-in-the-loop requirements, and have not tested whether non-engineering staff can actually exercise revocation. Article 14 of Regulation (EU) 2024/1689 requires human oversight measures for high-risk systems; the Autonomy Envelope dimension applies the same discipline to all agents.

Stage three: Scoring

After evidence gathering is complete, the lead assessor scores each dimension against the published rubric. Scoring is not a single person's judgment: the methodology requires a second assessor to review the scores for any dimension where the lead assessor's raw score is above eight or below three. That review panel operates on a brief consensus model: the panel produces a single agreed score for each dimension, with a note on any dimension where the reviewers disagreed before consensus.

How scores are calculated

Each dimension receives a raw score from one to ten based on the scoring rubric. The raw score is multiplied by the dimension weight (Trust and Safety 18, Governance 16, Context Integrity 14, Product Maturity 14, Autonomy Envelope 14, Distribution Control 12, AI Integration 12) and summed. The result is normalised to a one hundred point scale.

The five tiers and their weighted score thresholds are: Pre Assessment (below 20), In Progress (20 to 34), Certified (35 to 54), Advanced (55 to 74), Elite (75 and above). These thresholds are necessary but not sufficient conditions for a tier. Every tier also sets a minimum raw score per dimension. Certified requires a minimum raw score of four on every dimension. Advanced requires six. Elite requires eight. An operator scoring strongly across six dimensions but scoring two on the Trust and Safety dimension will not achieve Certified tier regardless of the weighted total. The framework does not reward lopsided agents.

How the framework handles uncertainty

Where evidence is ambiguous, the methodology instructs assessors to score the lower of the two plausible values. This is a deliberate design choice. The value of a certification result rests on its reliability as a signal. A framework that is generous with ambiguous evidence produces scores that are hard to rely on. Operators with genuinely strong controls will produce documentation that eliminates ambiguity. Where documentation is absent and the interview does not resolve the question, the absence is itself an evidence point.

Stage four: Tier determination and quality review

The tier determination is not mechanical. Once the weighted total and per-dimension floors have been applied, the lead assessor writes a brief narrative for each dimension. The narrative records what the operator did well, where the evidence was strong, and where the gaps were. The tier determination is then reviewed by a second assessor who has not seen the dimension narratives before. The reviewing assessor checks the tier against the scores and the narratives, and flags any dimension where the narrative and the score appear inconsistent.

This quality step is the point at which systematic assessor bias is most likely to be caught. Assessors who consistently score operators one point higher than the rubric supports across a specific dimension type will be identified through this review. The process does not eliminate error, but it makes systematic drift visible.

Stage five: The final report

The final report is the deliverable. It is a structured document, typically twenty to thirty pages, issued to the named operator contact. It has six sections.

Report section one: Executive summary

One page. States the agent assessed, the date range of the assessment, the tier result, the weighted total score, and the per-dimension scores in a summary table. Written for a non-technical reader: a board member or a Chief Risk Officer should be able to read the executive summary in five minutes and understand the result without reading the rest of the report.

Report section two: Dimension findings

Seven subsections, one per dimension. Each subsection states the dimension score, the evidence reviewed, the interview findings, and a narrative assessment. The narrative distinguishes between what the operator has in place, what the assessor observed in practice, and where gaps were identified. The language is factual and specific: it names the control that is missing or weak, not merely the dimension that scored low.

Report section three: Standards crosswalk

Maps the assessment findings to relevant reference instruments. For each dimension finding, the crosswalk identifies the corresponding NIST AI Risk Management Framework function and category, the ISO/IEC 42001:2023 clause most directly relevant, and the EU AI Act article that addresses the same concern. This section is used by operators comparing their Agent Certified result to other compliance commitments and by legal and compliance teams preparing regulatory submissions.

Report section four: Priority gap list

A prioritised list of all shortfalls identified during the assessment. Each item in the gap list states the dimension, the specific evidence item that is missing or weak, the scoring impact (which rubric level the current state corresponds to, and what would be required to reach the next level), and an indicative effort estimate (low, medium, high) for closing the gap. The gap list is ordered by weighted impact: the item that would most improve the weighted total if closed is listed first.

The gap list is the practical planning tool for operators who want to move to a higher tier. It is written for the team that will action the gaps, not for the board. Operators consistently report that the gap list is the most used section of the report after the executive summary.

Report section five: Underwriting summary

This section is structured specifically for insurer use. It is two to three pages and is formatted to align with the supplemental AI questionnaires used by carriers currently active in the European AI liability market, including Munich Re aiSure, Armilla, Lloyd's AI-risk syndicates. It states the tier result, summarises the Autonomy Envelope and Trust and Safety dimension findings in language carriers recognise, and confirms the assessment scope, methodology version, and assessment date.

Insurers requesting AI governance evidence as part of their underwriting process will typically have asked the operator a set of open-ended questions about the agent. The underwriting summary translates the structured assessment result into a form that answers those questions without requiring the carrier to read the full report. The operator controls disclosure; the underwriting summary is issued to the operator and shared with carriers at the operator's discretion.

For a detailed account of how certification evidence affects insurance outcomes, see the companion article on how AI certification feeds into insurance underwriting. The connection between Article 26 of Regulation (EU) 2024/1689 (deployer obligations) and the product liability framework under Directive (EU) 2024/2853 is a relevant backdrop for risk leads approaching insurers: both instruments place documentation obligations on deployers that structured certification evidence directly addresses.

Report section six: Certification statement

A formal statement recording the tier result, the assessment period, the methodology version, the agent scoped, and the date of issue. The certification statement is the document most commonly shared with third parties. It states the result without the detail that operators may wish to keep internal. The statement is valid for twelve months from the date of issue, after which a reassessment is required to maintain currency.

After the report: Reassessment and monitoring

A certification result has a twelve-month validity window. The agent, the infrastructure around it, and the governance context all change faster than an annual reassessment cycle can track. Operators who want their certification result to remain current between annual assessments should implement continuous monitoring on the dimensions most likely to drift: Trust and Safety (guardrail degradation, new attack vectors), Governance (staff changes, board paper gaps), and the Autonomy Envelope (scope creep, undocumented extensions to agent authority).

Operators who change the underlying model, move to a new model provider, or deploy the agent into a materially different use case should request an interim reassessment rather than waiting for the annual cycle. A model change is not a minor technical update: it can materially affect Trust and Safety and Context Integrity dimension scores.

Operators who receive an In Progress or Pre Assessment result on the first assessment can request a targeted reassessment covering only the dimensions where the gap list identified shortfalls. A targeted reassessment takes one to two weeks and focuses exclusively on the evidence relevant to the affected dimensions. It does not re-examine dimensions that were already scored at or above the tier floor.

Practical timeline for a risk lead

A risk lead coordinating their first Agent Certified assessment should plan for the following sequence.

Week one: Submit the assessment request through the assessment request page. Nominate the senior accountability owner, technical lead, and compliance representative. Schedule the intake session.

Week one to two: Hold the intake session. Receive and countersign the scoping document. Begin assembling documentation using the seven-dimension submission checklist. The companion article on preparing for an assessment is the most detailed guide for this stage.

Weeks two to four: Submit documentation through the document portal. Attend dimension interviews as scheduled. Flag any dimension where the operator anticipates a weak score so the assessor can plan the evidence gathering stage accordingly. Operators consistently report that proactive disclosure of known weaknesses produces more useful findings than attempting to minimise gaps during the interview stage.

Week four to five: Receive draft dimension narratives for factual review. The review window is five business days. The operator may correct factual errors in the narratives. The operator may not request changes to scores. Disputed scores are escalated to the review panel.

Week five to six: Receive the final report. Distribute internally. Extract the underwriting summary for insurer distribution if required. Begin action planning against the priority gap list.

What the assessment does not cover

Two points of scope are consistently misunderstood by operators approaching their first assessment.

The assessment covers the agent in its current production state. It does not cover a planned future state, a staging environment, or a development version. If the operator intends to deploy a materially different version of the agent within the validity period, they should disclose that at intake so the scoping document reflects it.

The assessment is not a legal opinion on EU AI Act compliance. It is a structured evaluation of an agent against the seven-dimension framework, with findings mapped to relevant regulatory instruments. Operators of high-risk AI systems under Regulation (EU) 2024/1689 Annex III will need a conformity assessment as defined in Article 43 of that regulation, which is a separate process conducted by a notified body or, for most high-risk categories, by internal assessment procedures. An Agent Certified result constitutes significant preparatory evidence for a conformity assessment but is not a substitute for it. See the companion article on EU AI Act conformity assessment for high-risk AI systems for the distinction in detail.

The Moffatt v Air Canada case [Federal Court of Canada, 2024] and Mata v Avianca [S.D.N.Y., 2023] both illustrate the kind of accountability gap that structured certification is designed to surface: in both cases, the deploying organisation lacked documentation of what the AI system was authorised to do, who was accountable for its outputs, and how its limits had been tested. These are precisely the conditions that an Agent Certified assessment probes. The cases are cited here not as legal precedent but as concrete illustrations of why governance documentation matters before an incident, not after.

Frequently asked questions

How long does a full Agent Certified assessment take from intake to report?

A full assessment for a single production agent takes approximately four to six weeks from intake to final report. Intake and scoping take one week. Evidence gathering runs two to three weeks in parallel with structured interviews. Scoring and panel review take one week. Report finalisation and handoff take three to five business days. Operators with well-organised documentation consistently finish closer to four weeks. Operators assembling documentation for the first time during the assessment should budget six weeks.

Who participates in the assessment from the operator side?

The assessment requires three named participants as a minimum: the senior accountability owner (typically a Chief Risk Officer, Head of AI Governance, or equivalent named in the AI risk policy), the technical lead responsible for the agent build and deployment, and a compliance or legal representative who can speak to data governance, incident reporting obligations, and the operator's EU AI Act classification determination. Additional participants join for dimension-specific interviews. The assessment is not self-service; the interviews are load-bearing.

What happens if the agent scores below Certified tier on the first assessment?

An In Progress or Pre Assessment result is reported with a prioritised gap list. The gap list maps each shortfall to the scoring rubric, identifies the specific evidence items missing, and orders gaps by weighted impact. Operators typically address the highest-impact gaps within four to eight weeks, then request a targeted reassessment covering only the affected dimensions rather than running the full process again. A targeted reassessment takes one to two weeks.

Can the assessment report be shared with an insurance carrier?

Yes. The final report includes an underwriting summary formatted to align with supplemental AI questionnaires used by carriers including Munich Re aiSure, Armilla, and Lloyd's AI-risk syndicates. The operator controls disclosure: the report is issued to the operator and shared with third parties at the operator's discretion. The underwriting summary can be provided separately from the full report if the operator prefers.

Does the assessment cover EU AI Act conformity assessment requirements for high-risk AI systems?

The Agent Certified assessment is complementary to but not equivalent to the conformity assessment process for high-risk AI systems under Regulation (EU) 2024/1689. It maps findings to relevant EU AI Act articles including Articles 9, 10, 14, 15, 17 and 26, and constitutes significant preparatory evidence for a conformity assessment. It is not a substitute for the conformity assessment required under Article 43. The high-risk obligations were originally scheduled for 2 August 2026; a provisional deferral to 2 December 2027 was agreed at EU Digital Omnibus trilogue on 7 May 2026 but is not yet formally adopted.

What evidence does the Autonomy Envelope dimension require?

The Autonomy Envelope dimension requires a written autonomy policy specifying which action classes the agent may execute without human confirmation, the impact thresholds that trigger a human-in-the-loop requirement, documentation that revocation is exercisable by non-engineering staff, evidence of rollback capability for agent-initiated actions where technically feasible, and a record of at least one autonomy boundary test conducted in the prior twelve months. This dimension maps directly to Article 14 of Regulation (EU) 2024/1689 on human oversight and Article 26 on deployer obligations.

How does the Agent Certified score translate to a better insurance outcome?

Insurers pricing AI agent risk currently lack structured data about the agent they are covering. A certified result reduces the underwriting friction caused by that information asymmetry. In practice, carriers have used structured certification evidence to accelerate placement timelines, reduce supplemental questionnaire burden, and support pricing at the lower end of the bracket for a given risk class. A Certified tier result addresses minimum insurability bars for some carriers. Advanced and Elite tier results support broader coverage terms and lower premiums where the carrier uses certification as a pricing input.

References

European Parliament and Council. Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (EU AI Act). Articles 9, 10, 14, 15, 17, 26 and 43. Official Journal of the European Union, 2024.
European Parliament and Council. Directive (EU) 2024/2853 on liability for defective products. Official Journal of the European Union, 2024.
National Institute of Standards and Technology. AI Risk Management Framework 1.0 (NIST AI 100-1). Gaithersburg, January 2023.
International Organization for Standardization. ISO/IEC 42001:2023, Information technology, Artificial intelligence, Management system. Geneva, 2023.
European Insurance and Occupational Pensions Authority. Supervisory statement on the use of artificial intelligence in the insurance sector. Frankfurt, 2024.
Moffatt v Air Canada (British Columbia Civil Resolution Tribunal, 2024). AI chatbot liability, accountability documentation gaps.
Mata v Avianca Inc., Case No. 22-cv-1461 (S.D.N.Y. 2023). AI-generated legal submissions, documenting deployer accountability obligations.
Munich Re. aiSure AI Liability product. Availability and coverage scope vary by market; confirm current terms with Munich Re or a specialist broker. Munich Re, 2024.
Armilla AI. Warranty and liability products for AI model providers and deployers. Availability varies by market; confirm current EU terms with Armilla or a specialist broker. 2025.
Lloyd's of London. Artificial intelligence and autonomous systems: underwriting considerations. Lloyd's of London, 2023.
HSB (Hartford Steam Boiler). AI business interruption and liability cover. Product scope and availability vary by market; confirm current terms with HSB or a specialist broker. 2024.
Agent Certified. Methodology specification, June 2026 version. Published at agentcertified.eu/methodology.html.