Security and Resilience in AI Agent Certification

Security and resilience is the certification dimension that most often surprises operators during assessment. It is not primarily a question about whether the underlying model is safe. It is a question about whether the deployed system, in its specific operational context, is robust against interference, manipulation, and failure. The distinction matters because a model that passes safety benchmarks at training time may fail catastrophically when deployed in a real-world environment with adversarial users, third-party integrations, and operational conditions that differ from its training distribution.

Key takeaways

The security and resilience dimension evaluates four properties: adversarial robustness, cyberattack resistance, failsafe behaviour, and incident response capacity. It maps directly to Article 15 of Regulation (EU) 2024/1689.
Prompt injection is the most common high-severity finding in this dimension for agentic AI systems that can take real-world actions. Operators frequently underestimate the attack surface created by tool use and external data retrieval.
EU AI Act Article 15(3) specifically requires that high-risk AI systems be resilient against attempts by unauthorised third parties to alter their use, outputs, or performance by exploiting system vulnerabilities. This is a legal requirement, not a best practice.
The NIST AI 600-1 GenAI Profile (July 2024) identifies prompt injection, data poisoning, and model extraction as the three primary adversarial threats specific to generative AI. Each requires a distinct mitigation approach.
A passing score on this dimension requires documented testing evidence, not just architectural controls. Deployers must show that adversarial testing has been conducted and that findings have been remediated or accepted with documented rationale.

What the dimension assesses

The security and resilience dimension in the Agent Certified methodology addresses the question of whether an AI agent can be trusted to behave as intended when operating in an adversarial environment. This is a narrower question than general model safety, and a more operationally specific one. It does not ask whether the model's outputs are accurate on average. It asks whether a determined adversary can make the model produce outputs it should not produce, take actions it should not take, or fail in ways that cause harm to users or to the deploying organisation.

Four sub-dimensions structure the evaluation.

The first sub-dimension is adversarial robustness: resistance to inputs specifically designed to cause the agent to behave outside its intended parameters. This includes prompt injection attacks, jailbreak attempts, adversarial examples in image or audio inputs where applicable, and attempts to extract system prompt contents or bypass operating constraints through indirect instruction.

The second sub-dimension is cyberattack resistance: protection of the AI system's infrastructure against conventional and AI-specific cyberattacks. This includes data poisoning attacks on the operational data pipeline, model inversion attacks that attempt to extract training data from model outputs, membership inference attacks that test whether specific data was in the training set, and API abuse patterns that attempt to enumerate the model's behaviour to find exploitable edge cases.

The third sub-dimension is failsafe behaviour: the agent's response when it encounters inputs, outputs, or operational conditions that fall outside its design domain. A well-designed agent fails safely by refusing to act, escalating to a human supervisor, and generating a log entry that supports post-incident analysis. A poorly designed agent fails catastrophically by hallucinating actions, producing harmful outputs, or continuing to operate on corrupted inputs without flagging the anomaly.

The fourth sub-dimension is incident response capacity: the deployer's ability to detect a security incident involving the AI agent, contain the impact, remediate the vulnerability, and restore normal operation. This sub-dimension evaluates process and organisation rather than the technical properties of the AI system itself. A technically secure AI agent deployed by an organisation with no incident response plan scores lower on this sub-dimension than a less technically secure agent deployed by an organisation with a mature security response function.

Article 15 of the EU AI Act: the regulatory standard

The legal foundation for the security and resilience dimension is Article 15 of Regulation (EU) 2024/1689. The article is structured around three properties: accuracy, robustness, and cybersecurity. All three are required throughout the AI system's lifecycle, not just at initial deployment.

Article 15(1) states that high-risk AI systems shall be designed and developed in such a way that they achieve an appropriate level of accuracy, robustness and cybersecurity, and perform consistently in those respects throughout their lifecycle. The phrase "appropriate level" introduces a proportionality consideration: the standard is not absolute but calibrated to the system's risk profile and use context. A high-risk AI system used in credit scoring is held to a different standard than one used in industrial equipment operation, even though both fall within Annex III.

Article 15(3) is the provision most directly relevant to adversarial robustness. It requires that high-risk AI systems be resilient against unauthorised third-party attempts to alter the system's use, outputs, or performance by exploiting system vulnerabilities. The term "system vulnerabilities" is broad. It covers not only conventional software vulnerabilities but also AI-specific vulnerabilities such as prompt injection attack surfaces, model parameter sensitivity to adversarial inputs, and weaknesses in the system's boundary between instruction-following and user-input-following.

Article 15(4) addresses the robustness requirement specifically in the context of errors, faults, and inconsistencies that may arise within the system or its environment, including adversarial examples. The explicit reference to adversarial examples in the legislative text is significant. It signals that the drafters were aware of the adversarial robustness literature and intended the robustness requirement to encompass AI-specific attack vectors, not just conventional software faults.

The cybersecurity requirement in Article 15 connects to the ENISA cybersecurity framework and to NIS2's requirements for operators of essential services and digital infrastructure. Where a high-risk AI system is part of a larger network and information system that falls within NIS2 scope, the AI Act's cybersecurity requirement and NIS2's security obligations apply in parallel and must be met simultaneously.

Prompt injection: the dominant finding in agentic AI assessments

In assessments conducted against the Agent Certified framework, prompt injection vulnerabilities are the most frequently identified high-severity finding in the security and resilience dimension, specifically for AI agents that use tools, access external data sources, or execute actions in real-world systems.

The attack surface is fundamentally different for agentic AI compared to conversational AI that only generates text. A conversational AI that produces a harmful text response causes harm through the content of its output. An agentic AI that executes a harmful action causes harm through the consequence of that action. The difference in impact severity is substantial: a prompt injection attack that causes an AI agent to send an unauthorised email, execute a database query, or initiate a financial transaction produces real-world consequences that cannot be reversed by simply retraining the model.

NIST AI 600-1, the Generative AI Profile of the NIST AI RMF published in July 2024, identifies prompt injection as a primary adversarial threat for generative AI systems. The Profile distinguishes between direct prompt injection, in which an adversarial user inserts malicious instructions into the conversational input, and indirect prompt injection, in which the malicious instructions are embedded in external data that the agent retrieves and processes as part of its reasoning. Indirect prompt injection is particularly difficult to defend against because it exploits the agent's design intent to process external data rather than a vulnerability in the underlying model.

Effective mitigation for prompt injection requires a defence-in-depth approach. Input validation that checks conversational turns for instruction-injection patterns before they reach the model. System prompt protection that prevents the user from overriding core operating constraints through conversational manipulation. Output filtering that checks generated text for anomalous instruction-following before actions are executed. Tool-access controls that require explicit authorisation for each tool invocation rather than allowing the model to call tools freely based on its own judgement. And audit logging that records every tool call, every action taken, and every external data source accessed, so that a prompt injection incident can be reconstructed in post-incident analysis.

Data poisoning: the supply chain threat

Data poisoning attacks target the training or operational data that shapes an AI agent's behaviour. Unlike prompt injection, which attacks the agent at inference time, data poisoning attacks the agent before deployment by introducing corrupted data into the training pipeline. The goal of a data poisoning attack is typically to create a backdoor in the model: a specific trigger input that causes the model to produce a predetermined harmful output while behaving normally on all other inputs.

For deployers using commercially provided foundation models, the primary data poisoning risk is not in the model's core training, which is managed by the provider, but in the retrieval-augmented generation (RAG) pipeline or fine-tuning dataset that the deployer controls. An adversary who can introduce corrupted documents into the deployer's RAG knowledge base, or corrupt the data used for fine-tuning, can potentially introduce backdoor behaviour into the deployed agent without access to the underlying model parameters.

ISO/IEC 42001:2023 Annex A control A.6.2 addresses AI system development processes and requires documented procedures for data quality management including the identification and mitigation of data quality risks. For agentic AI systems, data quality risk extends to the operational data that the agent retrieves and processes at runtime. Deployers must implement controls not only on their training and fine-tuning datasets but on every external data source the agent accesses during operation.

The Agent Certified assessment evaluates whether the deployer has documented the data sources that influence the agent's behaviour, has assessed each source for poisoning risk, has implemented integrity controls on those sources appropriate to the assessed risk level, and has a monitoring capability to detect anomalous patterns in agent output that might indicate a successful poisoning attack.

Failsafe design: what happens when the agent encounters the unexpected

Failsafe design is the engineering discipline of ensuring that a system's failure modes are predictable, bounded, and non-catastrophic. For AI agents, failsafe design addresses the question of what the agent does when it encounters inputs, contexts, or operational conditions outside the distribution on which it was trained and tested.

Three common failure modes appear frequently in Agent Certified assessments. The first is hallucinated confidence: the agent produces a response with high apparent confidence on a topic where it has no reliable information, rather than expressing uncertainty or declining to respond. This failure mode is particularly harmful in agentic contexts because confident-sounding outputs are more likely to trigger subsequent tool calls or actions based on the hallucination.

The second common failure mode is context boundary confusion: the agent fails to distinguish between its own reasoning and user-provided content, either treating user claims about the world as established facts or treating its own previous outputs as authoritative sources when reasoning about subsequent steps. This failure mode amplifies the impact of prompt injection and is particularly difficult to detect because it may not produce obviously wrong outputs on any single turn.

The third failure mode is action scope creep: the agent interprets its operational mandate broadly and takes actions outside its intended scope when it judges them to be helpful or necessary to complete a task. This failure mode is not an attack but an emergent behaviour from the agent's optimisation objective. An agent tasked with scheduling meetings that begins making changes to calendar settings, inviting third parties, or accessing adjacent systems is exhibiting action scope creep. The harm potential is proportional to the breadth of the agent's tool access.

Failsafe design controls for these failure modes include explicit uncertainty handling that requires the agent to decline rather than guess when confidence falls below a threshold; strict context separation that prevents the agent from treating user input as instruction; scope constraints that limit tool access to the minimum required for the task; and human-in-the-loop checkpoints that require authorisation before the agent takes actions above a defined impact threshold. The Agent Certified framework evaluates whether these controls exist, whether they have been tested, and whether the testing evidence is documented.

How security and resilience connects to certification and insurance

The security and resilience dimension is one of the certification dimensions that insurers are most likely to inquire about during the underwriting process. This reflects the specific nature of the risk it addresses: security failures in AI agents create first-party losses (operational disruption, data breach) as well as third-party losses (harm to users, regulatory penalties). Both loss types are potentially insurable, but the insurability depends on the deployer having implemented controls that a reasonable underwriter would expect.

Munich Re's aiSure product and Armilla's AI liability coverage both consider the deployer's security posture as part of their risk assessment. An operator who can demonstrate documented adversarial testing, a maintained patch cycle for the AI system's infrastructure, a prompt injection mitigation architecture, and a tested incident response procedure is presenting a materially different risk profile from an operator who has not addressed these elements. The Agent Certified assessment provides a structured, evidence-based record of the security controls that have been implemented and tested, which translates directly into the underwriting information that a prospective insurer needs to assess and price the risk.

For the full coverage framework and how certification evidence connects to insurance eligibility, see agentinsured.eu's analysis of certification and insurance underwriting. For the regulatory obligations that underpin the technical security requirements, see the Article 15 deployer guide on agentliability.eu.

Frequently asked questions

What does the security and resilience dimension measure?

The security and resilience dimension evaluates whether an AI agent can maintain its intended performance when subject to adversarial inputs, cyberattacks, and operational failures. It assesses four properties: resistance to prompt injection and adversarial manipulation, protection against data poisoning, failsafe behaviour outside the operational design domain, and the deployer's capacity to detect, contain, and recover from security incidents.

How does Article 15 of the EU AI Act define the security standard for high-risk AI?

Article 15 of Regulation (EU) 2024/1689 requires that high-risk AI systems achieve an appropriate level of accuracy, robustness, and cybersecurity throughout their lifecycle. Article 15(3) requires resilience against attempts by unauthorised third parties to alter the system's use, outputs, or performance by exploiting system vulnerabilities. Article 15(4) addresses robustness against errors and adversarial examples.

What is prompt injection and why does it matter for AI agent security?

Prompt injection is an attack in which a user or an external data source inserts malicious instructions into the input stream of an AI agent, causing it to deviate from its intended behaviour. For agentic AI systems that can take real-world actions such as executing code, sending communications, or accessing databases, prompt injection is a high-severity threat because a successful attack can redirect the agent to perform actions that cause direct harm. The Agent Certified security dimension evaluates whether the deployer has implemented input validation, system prompt protection, output filtering, and tool-access controls sufficient to mitigate prompt injection risk.

How does the security dimension relate to NIST AI RMF and ISO 42001?

The Agent Certified security and resilience dimension maps to NIST AI RMF's GOVERN and MEASURE functions, specifically the MEASURE 2.5 and MEASURE 2.6 sub-categories addressing adversarial testing and system resilience. It also maps to ISO/IEC 42001:2023 Annex A controls A.6.2 and A.8, which require documented processes for assessing and managing AI-specific risks including adversarial manipulation. NIST AI 600-1 (GenAI Profile, July 2024) adds specific guidance on prompt injection, data poisoning, and model extraction threats.

References

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Article 15 (accuracy, robustness and cybersecurity), Article 15(1) (appropriate level standard), Article 15(3) (resilience against unauthorised third-party attacks), Article 15(4) (robustness against adversarial examples). OJ L, 12.7.2024.
NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023. GOVERN and MEASURE functions, sub-categories MEASURE 2.5 and MEASURE 2.6.
NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 2024. Section 2: Generative AI risks, including prompt injection (GV-1), data poisoning (GV-2), and model extraction (GV-3).
ISO/IEC 42001:2023, Information technology: Artificial intelligence: Management system. Annex A, controls A.6.2 (AI system development processes) and A.8 (AI system operation).
AIUC-1, AI Underwriting and Certification Standard, version 1, ElevenLabs and AIUC, 2024. Security and resilience assessment criteria for AI agent underwriting.
Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on digital operational resilience for the financial sector (DORA). Relevant to AI security for financial entities.
Directive (EU) 2022/2555 (NIS2). Article 21 (cybersecurity risk-management measures) for operators of essential services and digital infrastructure deploying AI systems.

Security and resilience in AI agent certification. What the dimension measures and why it matters.