Key takeaways
  • An Agent Certified assessment runs in five stages: intake and scoping, evidence gathering across seven dimensions, a structured scoring panel, a formal tier determination, and a written report with an underwriting-ready summary. The full cycle takes four to six weeks.
  • The assessment requires a named senior accountability owner, a technical lead, and a compliance representative. Interviews are load-bearing: documentary evidence alone is not sufficient for any dimension above the Pre Assessment tier.
  • Scoring is normalised to one hundred points across seven weighted dimensions. Every tier sets a minimum raw score per dimension: a single weak dimension caps the tier regardless of the weighted total.
  • The final report includes a dedicated underwriting summary structured to align with insurer supplemental questionnaires used by carriers including Munich Re aiSure, Armilla, and Lloyd's AI-risk syndicates. The operator controls third-party disclosure.
  • An Agent Certified result is complementary to but not equivalent to the EU AI Act conformity assessment for high-risk AI systems under Regulation (EU) 2024/1689. It constitutes significant preparatory evidence, not a substitute.

Risk leads preparing for an Agent Certified assessment frequently ask the same two questions: what exactly happens during the assessment, and what will the output enable? This article answers both. It describes each stage in sequence, explains what the assessor is looking for at each point, and sets out how the final report is structured to serve both internal governance purposes and insurer due diligence.

The article does not reproduce the full scoring rubric, which is available at the methodology page. It does not describe what documentation to assemble in advance, which is covered in the companion article on preparing for an assessment. The focus here is the process itself: what happens, when, and why.

Stage one: Intake and scoping

The assessment begins with an intake session. The purpose is to scope the assessment to a specific, named AI agent and to establish the organisational context around it. Assessments do not cover a platform, a suite, or an organisation's AI posture in general. They cover one agent, identified by name and version, in its current production state.

What the intake session covers

The intake session typically runs ninety minutes and involves the senior accountability owner and the technical lead. The assessor works through five topics.

First, agent identification. The agent is recorded by name, version, model provider, deployment environment, and the specific task it is authorised to perform. If the agent uses multiple foundation models or changes models between pipeline stages, each dependency is noted.

Second, use case and deployment context. The assessor records who invokes the agent, under what authority, in what environment (internal tool, customer-facing product, business process automation), and what the downstream consequences of an error look like. A customer-facing agent that can execute financial transactions carries different baseline risk from an internal agent that drafts documents for human review.

Third, existing governance documentation. The assessor asks which governance artefacts already exist: AI risk policy, risk register entries, incident records, board papers, data governance documentation. The intake session is not an evidence review; it is a map of what exists so the evidence gathering stage can be planned efficiently.

Fourth, applicable regulatory classification. The assessor asks whether the operator has made a determination under Regulation (EU) 2024/1689 as to whether the agent constitutes a high-risk AI system under Annex III, or falls within the GPAI model provisions under Title III. The assessment proceeds regardless of that determination, but the regulatory context affects how findings are framed in the final report.

Fifth, counterparty requirements. Many operators come to assessment because a customer, insurer, or board has requested evidence of AI governance maturity. The intake session records those requirements so the final report can be structured to address them directly.

Output of the intake stage

The output is a scoping document confirming the agent under assessment, the evidence gathering schedule, the named participants for each dimension interview, and the target completion date. The operator countersigns the scoping document before evidence gathering begins.

Stage two: Evidence gathering across seven dimensions

Evidence gathering is the most intensive stage of the assessment. It runs two to three weeks for most operators and involves both document review and structured interviews. The two activities run in parallel: the assessor reviews submitted documentation while scheduling dimension interviews.

How evidence is submitted

Operators submit documentation through a secure document portal. No document is retained beyond the assessment window without explicit written consent. The submission checklist maps to the seven dimensions; the most efficient operators submit documents grouped by dimension rather than chronologically or by internal document type.

Common document types across all dimensions include: AI risk policy and version history, risk register entries, board papers or minutes referencing the agent, incident records from the prior twelve months, data governance documentation including data lineage maps, vendor due diligence records for model providers, monitoring and observability dashboards (screenshots or exports), autonomy policy with version date, and penetration test or red team reports.

Not all evidence is documentary. Technical telemetry, live system walkthroughs, and demonstration of specific controls in a test environment all count as evidence. Assessors will request a live walkthrough for any dimension where documentary evidence is thin.

Dimension interviews

Each of the seven dimensions has a structured interview. Interviews are sixty minutes each and involve the named participant for that dimension. The assessor works from a fixed question set, but follows up based on the evidence already submitted. Interviews are not a repeat of document review: they test whether the documented controls are in active use and whether the named participants understand them.

The interview for the Trust and Safety dimension is typically the longest and most technically detailed. It covers the guardrail implementation, the incident detection and containment process, the red team schedule, and the operator's demonstrated understanding of their own attack surface. Assessors will ask about real incidents. If the operator has had no incidents in the prior twelve months, the assessor will ask how the operator knows that, which is itself an evidence point.

The interview for the Governance dimension is typically the most sensitive. It requires the senior accountability owner to speak to board-level awareness of the agent, the AI risk policy's status in the governance cycle, and the operator's vendor due diligence process for the model provider. Gaps in board awareness are one of the most common shortfalls found at this stage. Article 17 of Regulation (EU) 2024/1689 requires documented quality management systems for high-risk AI systems; the Governance dimension interview tests whether equivalent discipline exists regardless of high-risk classification.

The Autonomy Envelope interview is the dimension that most consistently surprises operators. Many organisations have an autonomy policy in principle but have not documented the specific impact thresholds that trigger human-in-the-loop requirements, and have not tested whether non-engineering staff can actually exercise revocation. Article 14 of Regulation (EU) 2024/1689 requires human oversight measures for high-risk systems; the Autonomy Envelope dimension applies the same discipline to all agents.

Stage three: Scoring

After evidence gathering is complete, the lead assessor scores each dimension against the published rubric. Scoring is not a single person's judgment: the methodology requires a second assessor to review the scores for any dimension where the lead assessor's raw score is above eight or below three. That review panel operates on a brief consensus model: the panel produces a single agreed score for each dimension, with a note on any dimension where the reviewers disagreed before consensus.

How scores are calculated

Each dimension receives a raw score from one to ten based on the scoring rubric. The raw score is multiplied by the dimension weight (Trust and Safety 18, Governance 16, Context Integrity 14, Product Maturity 14, Autonomy Envelope 14, Distribution Control 12, AI Integration 12) and summed. The result is normalised to a one hundred point scale.

The five tiers and their weighted score thresholds are: Pre Assessment (below 20), In Progress (20 to 34), Certified (35 to 54), Advanced (55 to 74), Elite (75 and above). These thresholds are necessary but not sufficient conditions for a tier. Every tier also sets a minimum raw score per dimension. Certified requires a minimum raw score of four on every dimension. Advanced requires six. Elite requires eight. An operator scoring strongly across six dimensions but scoring two on the Trust and Safety dimension will not achieve Certified tier regardless of the weighted total. The framework does not reward lopsided agents.

How the framework handles uncertainty

Where evidence is ambiguous, the methodology instructs assessors to score the lower of the two plausible values. This is a deliberate design choice. The value of a certification result rests on its reliability as a signal. A framework that is generous with ambiguous evidence produces scores that are hard to rely on. Operators with genuinely strong controls will produce documentation that eliminates ambiguity. Where documentation is absent and the interview does not resolve the question, the absence is itself an evidence point.

Stage four: Tier determination and quality review

The tier determination is not mechanical. Once the weighted total and per-dimension floors have been applied, the lead assessor writes a brief narrative for each dimension. The narrative records what the operator did well, where the evidence was strong, and where the gaps were. The tier determination is then reviewed by a second assessor who has not seen the dimension narratives before. The reviewing assessor checks the tier against the scores and the narratives, and flags any dimension where the narrative and the score appear inconsistent.

This quality step is the point at which systematic assessor bias is most likely to be caught. Assessors who consistently score operators one point higher than the rubric supports across a specific dimension type will be identified through this review. The process does not eliminate error, but it makes systematic drift visible.

Stage five: The final report

The final report is the deliverable. It is a structured document, typically twenty to thirty pages, issued to the named operator contact. It has six sections.

Report section one: Executive summary

One page. States the agent assessed, the date range of the assessment, the tier result, the weighted total score, and the per-dimension scores in a summary table. Written for a non-technical reader: a board member or a Chief Risk Officer should be able to read the executive summary in five minutes and understand the result without reading the rest of the report.

Report section two: Dimension findings

Seven subsections, one per dimension. Each subsection states the dimension score, the evidence reviewed, the interview findings, and a narrative assessment. The narrative distinguishes between what the operator has in place, what the assessor observed in practice, and where gaps were identified. The language is factual and specific: it names the control that is missing or weak, not merely the dimension that scored low.

Report section three: Standards crosswalk

Maps the assessment findings to relevant reference instruments. For each dimension finding, the crosswalk identifies the corresponding NIST AI Risk Management Framework function and category, the ISO/IEC 42001:2023 clause most directly relevant, and the EU AI Act article that addresses the same concern. This section is used by operators comparing their Agent Certified result to other compliance commitments and by legal and compliance teams preparing regulatory submissions.

Report section four: Priority gap list

A prioritised list of all shortfalls identified during the assessment. Each item in the gap list states the dimension, the specific evidence item that is missing or weak, the scoring impact (which rubric level the current state corresponds to, and what would be required to reach the next level), and an indicative effort estimate (low, medium, high) for closing the gap. The gap list is ordered by weighted impact: the item that would most improve the weighted total if closed is listed first.

The gap list is the practical planning tool for operators who want to move to a higher tier. It is written for the team that will action the gaps, not for the board. Operators consistently report that the gap list is the most used section of the report after the executive summary.

Report section five: Underwriting summary

This section is structured specifically for insurer use. It is two to three pages and is formatted to align with the supplemental AI questionnaires used by carriers currently active in the European AI liability market, including Munich Re aiSure, Armilla, Lloyd's AI-risk syndicates. It states the tier result, summarises the Autonomy Envelope and Trust and Safety dimension findings in language carriers recognise, and confirms the assessment scope, methodology version, and assessment date.

Insurers requesting AI governance evidence as part of their underwriting process will typically have asked the operator a set of open-ended questions about the agent. The underwriting summary translates the structured assessment result into a form that answers those questions without requiring the carrier to read the full report. The operator controls disclosure; the underwriting summary is issued to the operator and shared with carriers at the operator's discretion.

For a detailed account of how certification evidence affects insurance outcomes, see the companion article on how AI certification feeds into insurance underwriting. The connection between Article 26 of Regulation (EU) 2024/1689 (deployer obligations) and the product liability framework under Directive (EU) 2024/2853 is a relevant backdrop for risk leads approaching insurers: both instruments place documentation obligations on deployers that structured certification evidence directly addresses.

Report section six: Certification statement

A formal statement recording the tier result, the assessment period, the methodology version, the agent scoped, and the date of issue. The certification statement is the document most commonly shared with third parties. It states the result without the detail that operators may wish to keep internal. The statement is valid for twelve months from the date of issue, after which a reassessment is required to maintain currency.

After the report: Reassessment and monitoring

A certification result has a twelve-month validity window. The agent, the infrastructure around it, and the governance context all change faster than an annual reassessment cycle can track. Operators who want their certification result to remain current between annual assessments should implement continuous monitoring on the dimensions most likely to drift: Trust and Safety (guardrail degradation, new attack vectors), Governance (staff changes, board paper gaps), and the Autonomy Envelope (scope creep, undocumented extensions to agent authority).

Operators who change the underlying model, move to a new model provider, or deploy the agent into a materially different use case should request an interim reassessment rather than waiting for the annual cycle. A model change is not a minor technical update: it can materially affect Trust and Safety and Context Integrity dimension scores.

Operators who receive an In Progress or Pre Assessment result on the first assessment can request a targeted reassessment covering only the dimensions where the gap list identified shortfalls. A targeted reassessment takes one to two weeks and focuses exclusively on the evidence relevant to the affected dimensions. It does not re-examine dimensions that were already scored at or above the tier floor.

Practical timeline for a risk lead

A risk lead coordinating their first Agent Certified assessment should plan for the following sequence.

Week one: Submit the assessment request through the assessment request page. Nominate the senior accountability owner, technical lead, and compliance representative. Schedule the intake session.

Week one to two: Hold the intake session. Receive and countersign the scoping document. Begin assembling documentation using the seven-dimension submission checklist. The companion article on preparing for an assessment is the most detailed guide for this stage.

Weeks two to four: Submit documentation through the document portal. Attend dimension interviews as scheduled. Flag any dimension where the operator anticipates a weak score so the assessor can plan the evidence gathering stage accordingly. Operators consistently report that proactive disclosure of known weaknesses produces more useful findings than attempting to minimise gaps during the interview stage.

Week four to five: Receive draft dimension narratives for factual review. The review window is five business days. The operator may correct factual errors in the narratives. The operator may not request changes to scores. Disputed scores are escalated to the review panel.

Week five to six: Receive the final report. Distribute internally. Extract the underwriting summary for insurer distribution if required. Begin action planning against the priority gap list.

What the assessment does not cover

Two points of scope are consistently misunderstood by operators approaching their first assessment.

The assessment covers the agent in its current production state. It does not cover a planned future state, a staging environment, or a development version. If the operator intends to deploy a materially different version of the agent within the validity period, they should disclose that at intake so the scoping document reflects it.

The assessment is not a legal opinion on EU AI Act compliance. It is a structured evaluation of an agent against the seven-dimension framework, with findings mapped to relevant regulatory instruments. Operators of high-risk AI systems under Regulation (EU) 2024/1689 Annex III will need a conformity assessment as defined in Article 43 of that regulation, which is a separate process conducted by a notified body or, for most high-risk categories, by internal assessment procedures. An Agent Certified result constitutes significant preparatory evidence for a conformity assessment but is not a substitute for it. See the companion article on EU AI Act conformity assessment for high-risk AI systems for the distinction in detail.

The Moffatt v Air Canada case [Federal Court of Canada, 2024] and Mata v Avianca [S.D.N.Y., 2023] both illustrate the kind of accountability gap that structured certification is designed to surface: in both cases, the deploying organisation lacked documentation of what the AI system was authorised to do, who was accountable for its outputs, and how its limits had been tested. These are precisely the conditions that an Agent Certified assessment probes. The cases are cited here not as legal precedent but as concrete illustrations of why governance documentation matters before an incident, not after.