Key takeaways
  • The performance and reliability dimension covers four sub-dimensions: accuracy against documented baselines, robustness to adversarial and distributional inputs, drift detection and response procedures, and failure mode documentation. Each is scored separately; the composite score forms the dimension result.
  • Accuracy assessment requires metrics appropriate to the system's function: precision, recall, and F1 for classifiers; mean absolute error for predictors; task completion rate and factual grounding rate for generative agents. Metrics must be reported against an evaluation dataset representing the production input distribution, not a curated test set.
  • Robustness scoring assesses performance under three categories of deviation: adversarial inputs designed to cause failures, edge cases at the margins of the system's intended operational envelope, and distributional shift from training data to production data. Systems that have not been tested against all three categories score lower regardless of their headline accuracy.
  • Drift detection scoring requires both monitoring infrastructure and a defined response procedure: a documented escalation path that triggers human review, deployment pause, or model refresh when drift metrics cross defined thresholds. Monitoring without a response protocol satisfies the detection obligation only partially.
  • The performance dimension score is the most direct certification signal for Munich Re aiSure eligibility. Parametric coverage triggers require pre-agreed performance thresholds, which can only be defined if the operator has established baselines, implemented monitoring, and documented the acceptable operational range.

Why performance and reliability form a single dimension

Performance and reliability are grouped into a single dimension because they are different aspects of the same underlying question: can this system be trusted to do what it is supposed to do, consistently, across the range of conditions it will actually encounter? Performance addresses the question at a point in time: does the system meet its accuracy specifications under current conditions? Reliability addresses it over time: does the system maintain those specifications as conditions change, as data drifts, as adversarial users probe its edges?

An AI agent that performs well under controlled evaluation conditions but degrades rapidly under production variability is not reliable, even if its benchmark accuracy is impressive. An agent that maintains stable performance over time but whose baseline accuracy was never adequately established is not well-assessed, even if it has been running in production for months. The dimension requires both: documented baseline performance, and evidence that the system maintains that baseline under the actual conditions of production deployment.

The regulatory basis for this grouping is Article 15 of Regulation (EU) 2024/1689, which requires high-risk AI systems to achieve appropriate levels of accuracy, robustness, and cybersecurity throughout their lifecycle, including in the face of errors, faults, inconsistencies, and unexpected situations. The lifecycle requirement is what makes performance and reliability a single regulatory obligation rather than two separate ones. A system that satisfies the accuracy requirement at launch but has no mechanism for detecting or responding to degradation over time has not satisfied the lifecycle dimension of Article 15.

For the relationship between this dimension and the governance infrastructure that supports it, see the governance dimension certification requirements. For the connection between performance monitoring and the trust and transparency dimension's human oversight requirements, see the trust and transparency dimension analysis.

Sub-dimension 1: accuracy baselines

Accuracy assessment in the certification context is not a question of whether the system is accurate in some general sense. It is a question of whether the system's accuracy has been measured, documented, and reported at a level of specificity that allows meaningful comparison against operator claims and regulatory requirements. The distinction matters because "this system is 95 percent accurate" is not a meaningful statement without knowing what 95 percent accurate means for this specific system type, measured on which dataset, against which ground truth, across which input categories.

For classification systems, the dimension assesses whether the operator has established and reports precision, recall, and F1 score for each decision class, not just overall accuracy. A binary classifier that achieves 95 percent overall accuracy by correctly classifying 99 percent of the majority class while misclassifying 50 percent of the minority class has a headline accuracy that overstates its utility and masks a significant performance problem. Certification assessment requires disaggregated reporting: separate metrics for each output class, including the classes where errors are most consequential for affected individuals.

False positive and false negative rates are assessed separately for systems where the two error types have different consequences. An employment screening system that misclassifies qualified candidates as unqualified (false negatives) creates one type of harm; a system that misclassifies unqualified candidates as qualified (false positives) creates another. Both matter to the assessment, but the relative weight of each depends on the deployment context and the consequences for affected individuals.

For regression and prediction systems, the relevant metrics are mean absolute error, root mean squared error, or equivalent measures appropriate to the specific prediction task. For time-series forecasting systems used in demand planning or resource allocation, the dimension also assesses whether the operator reports accuracy at different forecast horizons, since accuracy at short horizons frequently differs substantially from accuracy at longer horizons relevant to actual decision-making.

For generative AI agents, accuracy assessment focuses on task completion rate, factual grounding rate, and refusal rate. Task completion rate measures the proportion of user queries the agent successfully addresses within its intended operational scope. Factual grounding rate measures the proportion of factual claims in agent outputs that can be verified against identified source material. Refusal rate measures the proportion of out-of-scope queries the agent correctly declines to address rather than attempting to answer outside its competence. A generative agent with a high task completion rate but a low factual grounding rate is completing tasks through confabulation rather than reliable information retrieval, which is a performance failure even if the headline completion rate looks strong.

The evaluation dataset requirement is one of the most consequential aspects of accuracy assessment. Certification scoring treats metrics reported against a curated test set that does not represent the production input distribution as less reliable than metrics reported against an evaluation dataset constructed to reflect the actual range of inputs the system will encounter in production. A curated test set optimized for benchmark performance provides weaker evidence of production accuracy than a held-out sample of real production queries evaluated against validated ground truth. First assessments frequently reveal that providers have documented accuracy against benchmark datasets rather than production-representative evaluation sets, which is a common and significant documentation gap.

Sub-dimension 2: robustness testing

Robustness assessment addresses the question that accuracy assessment alone cannot answer: does the system perform acceptably under conditions that deviate from the distribution its accuracy metrics were measured on? Robustness testing covers three categories of deviation that certification assessment treats as distinct.

The first category is adversarial inputs: inputs specifically designed to cause the system to fail, produce incorrect outputs, or behave in ways that differ from its intended function. For language model-based agents, adversarial testing includes prompt injection attempts, jailbreak attempts, and inputs that attempt to redirect the agent from its intended function. For classification systems, adversarial inputs include perturbed inputs that differ minimally from legitimate inputs but are designed to cross decision boundaries incorrectly. Certification scoring does not require that a system be impervious to all adversarial inputs, which is not achievable for most current AI architectures. It requires that the operator has tested the system against a defined set of adversarial inputs, documented the results, and implemented mitigations for failure modes that would be unacceptable in the deployment context.

The second category is edge cases: inputs at the margins of the system's intended operational envelope that were not well-represented in training data. Edge cases differ from adversarial inputs in that they are not designed to cause failures. They arise naturally from the diversity of real-world inputs. A customer service agent trained primarily on standard query types may encounter unusual formulations, technical vocabulary outside its training domain, or queries that combine standard elements in unexpected ways. Edge case testing documents how the system handles these inputs: does it fail gracefully and route to a human, or does it produce confident but incorrect outputs? The graceful failure requirement is explicit in the certification rubric: a system that recognises its own limitations and escalates appropriately scores higher on edge case robustness than one that produces low-confidence outputs without signalling uncertainty to the user or the oversight operator.

The third category is distributional shift: the difference between the distribution of inputs the system was trained on and the distribution of inputs it encounters in production. Distributional shift is the most common and most underappreciated robustness challenge for deployed AI systems. A model trained on data from one time period, one geographic market, or one customer population will encounter distributional differences as the real-world environment changes. Language patterns change. Customer demographics shift. Product catalogues update. Regulations change the questions customers ask. Each of these changes is a form of distributional shift that can degrade performance without any single identifiable failure event. Certification scoring for distributional shift assesses whether the operator has tested the system against a dataset that reflects the expected production distribution, documented the expected magnitude of distributional differences, and implemented monitoring that can detect when actual production inputs diverge from the expected distribution.

Sub-dimension 3: drift detection and response

Drift detection addresses the temporal dimension of performance: the fact that AI system performance can degrade over time as the world changes and the gap between training data and production data widens. Certification scoring for drift detection assesses three elements that together constitute a complete drift management programme.

Input drift monitoring tracks changes in the statistical properties of production inputs. A simple form of input drift monitoring measures the statistical distance between production input distributions and the training distribution using metrics such as population stability index, Kullback-Leibler divergence, or Jensen-Shannon divergence. More sophisticated monitoring tracks specific features that are known to be predictive of the model's outputs and that are likely to shift over time. When the monitored features drift beyond a defined threshold, the monitoring system generates an alert that triggers the response procedure. An operator who monitors production input distributions and can show alert history is in a substantially better position than one who does not monitor at all.

Output drift monitoring tracks changes in the statistical properties of the system's outputs. Output drift can occur even when input drift is not detected, particularly when the mapping from inputs to outputs changes due to model degradation, infrastructure changes, or model updates in upstream components. Output drift monitoring can be implemented by tracking the distribution of output classes over time for classification systems, tracking the distribution of confidence scores, or for generative systems, sampling and human-reviewing a proportion of outputs on a defined schedule to assess whether output quality is stable. The certification dimension does not specify a single monitoring approach because the appropriate approach depends on the system's architecture and output type. What it requires is that some output monitoring exists, that it is implemented and running in production, and that its results are reviewed on a defined schedule.

The response procedure is the element most frequently missing from otherwise adequate drift monitoring programmes. An operator who monitors drift metrics, generates alerts, and then has no defined process for responding to those alerts has monitoring infrastructure but not a functioning drift management programme. The certification scoring for this sub-dimension requires that the operator document a specific response procedure: who receives the alert, what review they are expected to conduct, what decision criteria determine whether to continue deployment, pause the system, or initiate a model refresh, and how the response and its rationale are documented. The documentation of responses to past alerts is the strongest possible evidence for this sub-dimension: it demonstrates that the procedure is not only written but executed.

Sub-dimension 4: failure mode documentation

Failure mode documentation is the most qualitative sub-dimension in the performance and reliability assessment. It requires the operator to identify, document, and communicate to deployers and oversight staff the specific ways in which the system is known to fail. Every AI system has failure modes: conditions under which it produces incorrect outputs, behaves unpredictably, or fails to complete its intended function. A system that has no documented failure modes has either never failed or has failed without those failures being systematically analysed. The first is implausible for any system with meaningful production history. The second is a governance failure.

The certification assessment treats failure mode documentation as an indicator of organizational maturity in AI operations. Organizations that have systematically analysed their AI systems' failure patterns, documented those patterns, communicated them to relevant stakeholders, and incorporated them into the risk management and human oversight design are better positioned to prevent harm than those who treat failures as isolated incidents rather than as signals of systemic patterns. This is the logic behind the Article 72 post-market monitoring requirements in Regulation (EU) 2024/1689, which require providers of high-risk AI systems to collect, record, and analyse data on the performance of their systems throughout their operational lifecycle.

Failure mode documentation at the level the dimension requires includes four components. First, a typology of failure modes: a structured categorisation of the ways the system can produce incorrect or harmful outputs, derived from testing, production incident review, and analysis of near-misses. Second, documented frequency and severity estimates for each failure mode: how often does each type of failure occur under current conditions, and what is the potential impact on affected individuals when it does. Third, the mitigation measures in place for each failure mode: the design choices, operational controls, and human oversight arrangements that reduce the frequency or severity of each type of failure. Fourth, the residual risk assessment: the remaining failure probability and impact after mitigations are applied, which determines the oversight sensitivity the deployment requires.

This documentation is not only a certification requirement. It is the operational specification for human oversight. Oversight staff who know the system's failure modes are better positioned to detect and respond to failures than staff who have received only general training on AI risks. The certification process for generative AI agents describes how failure mode documentation applies specifically to systems whose outputs vary dynamically rather than being drawn from a fixed decision set.

Mapping to Article 15 and NIST AI RMF MEASURE

Article 15 of Regulation (EU) 2024/1689 sets out the accuracy, robustness, and cybersecurity requirements for high-risk AI systems. Article 15(1) requires high-risk AI systems to achieve appropriate levels of accuracy, robustness, and cybersecurity, and to perform consistently with respect to those levels throughout their lifecycle. Article 15(2) requires that systems be resilient against errors, faults, or inconsistencies occurring within the system or its environment. Article 15(3) requires that accuracy levels and, where relevant, accuracy metrics, be declared in the instructions for use accompanying the system.

The performance and reliability dimension maps directly to each of these requirements. The accuracy sub-dimension satisfies Article 15(3) by requiring documented, metric-specific accuracy declarations against production-representative evaluation data. The robustness sub-dimension satisfies Article 15(2) by requiring testing against adversarial inputs, edge cases, and distributional shift. The drift detection sub-dimension satisfies Article 15(1)'s lifecycle requirement by requiring monitoring and response procedures that maintain performance levels over time. The failure mode documentation sub-dimension satisfies Article 15(2)'s resilience requirement by requiring systematic analysis and communication of known failure patterns.

The NIST AI Risk Management Framework's MEASURE function, described in NIST AI 100-1 (January 2023) and extended for generative AI in NIST AI 600-1 (July 2024), addresses the same operational territory. The MEASURE function requires organisations to evaluate, assess, and track AI risks and benefits, and to maintain monitoring of AI systems in production. MEASURE subcategory 2.5 addresses the collection and analysis of performance data in production. MEASURE subcategory 2.6 addresses the documentation and communication of risks and benefits. MEASURE subcategory 4.1 addresses the development of monitoring plans. The performance and reliability dimension certification requirements are consistent with and in many cases more specific than the NIST AI RMF MEASURE subcategory requirements, making a strong performance dimension score useful evidence of NIST alignment for operators seeking to demonstrate it to enterprise customers or government procurement assessors.

ISO/IEC 42001:2023, the AI management system standard, addresses performance evaluation in Clause 9. Clause 9.1 requires organizations to determine what to monitor and measure regarding AI systems' performance and effectiveness, at what intervals, and how to analyse and evaluate the results. The clause explicitly requires that monitoring cover the system's performance against defined objectives throughout its operational lifecycle. Certification assessment for the performance and reliability dimension is consistent with Clause 9.1 requirements and produces documentation that satisfies the evidence requirements a Clause 9.1 audit would apply. For a detailed analysis of how ISO 42001 implementation maps to the certification framework, see the ISO 42001 implementation guide.

Performance dimension and parametric insurance coverage

The connection between the performance and reliability dimension and insurance coverage is more direct than for any other dimension in the certification framework. Munich Re aiSure, the parametric AI performance insurance product from Munich Re's Special Enterprise Risks division, settles claims on measurable performance data rather than on traditional loss adjustment. The product's coverage structure requires that trigger conditions be defined before coverage begins: specific performance thresholds that, when breached, activate the policy's payment mechanism.

This structure is elegant for insurers and valuable for operators: it removes the uncertainty of claims adjustment, provides fast settlement when triggers are breached, and aligns insurer and operator incentives around maintaining defined performance levels. But it has one absolute prerequisite. The performance thresholds that trigger coverage can only be defined if the operator has established the baseline performance metrics against which deviation is measured. An operator without documented accuracy baselines cannot propose coverage triggers. An operator without drift monitoring infrastructure cannot demonstrate that trigger conditions are detectable. An operator without failure mode documentation cannot credibly represent that the trigger conditions capture the relevant risk scenarios.

The performance and reliability dimension score is, in practice, the most direct certification signal for aiSure eligibility. Operators who score at level 7 or above across the dimension's four sub-dimensions have the baseline documentation, monitoring infrastructure, and failure mode analysis needed to engage in parametric trigger design with Munich Re's Special Enterprise Risks team. Operators who score below 5 on any sub-dimension typically face a preliminary work requirement before trigger design can proceed, because the informational basis for trigger setting is incomplete.

Beyond Munich Re aiSure, the performance dimension evidence base is relevant to any insurer considering coverage for an AI deployment. Underwriters pricing AI liability coverage need to assess the probability and severity of performance failures to price the policy appropriately. Operators who can present documented performance baselines, robustness testing results, drift monitoring data, and failure mode analyses give underwriters the information they need to price more precisely and to offer better terms than operators who present only qualitative governance descriptions. For the full analysis of how certification scores feed into the underwriting submission, see how certification feeds insurance underwriting.

Preparing for the performance and reliability assessment

Five steps structure an effective preparation process for the performance and reliability dimension assessment.

Step 1: establish your accuracy metrics and evaluation datasets. For each AI agent in scope, identify the metrics appropriate to the system's function. Confirm that those metrics are reported against evaluation datasets that represent the production input distribution. Where the current evaluation dataset is a benchmark or a curated test set, invest in constructing or commissioning a production-representative evaluation sample before the assessment. The gap between benchmark and production-representative accuracy is consistently the most significant finding in first assessments and the one with the most direct impact on scores.

Step 2: document your robustness testing programme. For each AI agent, list the adversarial testing, edge case testing, and distributional shift testing that has been conducted. Collect the results and, critically, the mitigations that were implemented in response. An assessment that presents only testing results without the mitigation actions that followed is incomplete: the dimension requires evidence that testing findings were acted on, not just recorded.

Step 3: implement and document drift monitoring. If production monitoring is not in place, implement it before requesting an assessment. The dimension cannot be scored adequately without evidence of monitoring infrastructure. Identify the input and output metrics that are most predictive of performance drift for your specific system. Establish alert thresholds. Define the response procedure. Conduct and document a test of the response procedure to confirm it works as designed.

Step 4: conduct a failure mode analysis. Review production incident logs, testing records, and any near-miss reports from oversight staff. Identify the patterns. Create a typology of failure modes, with frequency and severity estimates and the mitigations currently in place. This analysis does not need to be exhaustive, but it needs to be systematic: a failure mode typology derived from structured review of evidence is more credible and more useful than one assembled from memory.

Step 5: assemble the evidence bundle. Before requesting the assessment, assemble the documentation for each sub-dimension: evaluation datasets and metric reports for accuracy, testing records and mitigation documentation for robustness, monitoring configuration and alert records for drift detection, and the failure mode typology for failure mode documentation. Assessment without a prepared evidence bundle produces lower scores not because capability is absent but because the certification process cannot verify undocumented capability. The evidence bundle is the artefact that converts operational practice into certifiable performance.

Frequently asked questions

What does the performance and reliability dimension of AI agent certification assess?

The performance and reliability dimension assesses four sub-dimensions: accuracy against documented baselines with appropriate metrics for the system type; robustness to adversarial inputs, edge cases, and distributional shift; drift detection procedures with defined monitoring infrastructure and response protocols; and failure mode documentation with a structured typology of known failure patterns, frequency and severity estimates, mitigation measures, and residual risk assessment. Each sub-dimension is scored separately; the composite forms the dimension result.

How does the performance and reliability dimension map to EU AI Act Article 15?

Article 15 of Regulation (EU) 2024/1689 requires high-risk AI systems to achieve appropriate levels of accuracy, robustness, and cybersecurity throughout their lifecycle, and to be resilient against errors, faults, and inconsistencies. The accuracy sub-dimension satisfies Article 15(3)'s declaration requirement. The robustness sub-dimension satisfies Article 15(2)'s resilience requirement. The drift detection sub-dimension satisfies Article 15(1)'s lifecycle requirement. The failure mode documentation sub-dimension satisfies Article 15(2)'s resilience requirement by requiring systematic analysis and communication of known failure patterns.

What accuracy metrics does the performance and reliability dimension require?

The appropriate metrics depend on the system's function. For classifiers: precision, recall, F1 score, and disaggregated false positive and false negative rates per decision class. For regression systems: mean absolute error or root mean squared error. For generative agents: task completion rate, factual grounding rate, and refusal rate for out-of-scope queries. All metrics must be reported against evaluation datasets representing the production input distribution, not against curated benchmark datasets that do not reflect real-world conditions.

How does drift detection scoring work in the performance dimension?

Drift detection scoring assesses three elements: input drift monitoring (tracking changes in production input distribution), output drift monitoring (tracking changes in output distribution or quality), and a defined response procedure (a documented escalation path that triggers human review, deployment pause, or model refresh when drift metrics cross defined thresholds). A monitoring system that detects drift but has no defined response procedure satisfies the detection obligation only partially and scores lower than one with a documented and tested response protocol.

Why does the performance dimension determine parametric insurance eligibility?

Munich Re aiSure's parametric structure requires pre-agreed performance thresholds as coverage triggers. Those triggers can only be defined if the operator has established baseline performance metrics, implemented monitoring infrastructure, and documented the acceptable operational range. An operator without these elements cannot engage in parametric trigger design and cannot access parametric AI insurance. The performance and reliability dimension score is the most direct certification signal for aiSure eligibility: operators who score at level 7 or above across all four sub-dimensions have the documentation base needed to proceed to trigger design with Munich Re's Special Enterprise Risks team.