Methodology Analysis · 13 May 2026

Certifying AI agents built on GPAI foundation models. The dependency assessment challenge that standard frameworks miss.

By Future Proof Intelligence Published 13 May 2026 Reading time: 11 minutes

Most enterprise AI agents deployed today do not rely on proprietary models built from scratch. They rely on GPAI foundation models provided by third parties: OpenAI, Anthropic, Google, Mistral, Meta, or one of the growing population of open-weight model providers. The certification challenge for these systems differs from the challenge of certifying a purpose-built AI, because the deployer does not control the model's training, its capability boundaries, or its evolution over time. The seven-dimension Agent Certified framework addresses this challenge directly. This analysis explains how.

Key takeaways

  • A GPAI-based agent depends on a model the deployer does not control. Certification must address not only the agent layer but the adequacy of the supply-chain arrangements that govern that dependency.
  • Regulation (EU) 2024/1689 Articles 51 to 56 impose obligations on GPAI model providers, including documentation of capabilities and limitations, that deployers should obtain and verify before deployment.
  • Capability boundary testing at the agent layer is the deployer's primary tool for understanding what the model does in their specific deployment context. Provider documentation describes what the model can do generally; boundary testing reveals what it actually does in your prompting architecture and use cases.
  • Model version changes are a material change event requiring re-evaluation of affected use cases. A certification status that is not refreshed after a significant model update does not provide valid assurance.
  • Insurance underwriters treat undocumented GPAI dependencies as unbounded risk. Certification evidence demonstrating supply-chain transparency, boundary testing, and change management is the difference between an insurable and an uninsurable deployment.

The GPAI dependency landscape for enterprise agents

The population of enterprise AI agents deployed across European businesses in 2026 is predominantly built on three types of GPAI dependency. The first is a direct API call to a hosted model from a provider such as OpenAI, Anthropic, or Google. The deployer sends prompts, receives completions, and builds the agent's decision logic around the model's output. The second is a fine-tuned or retrieval-augmented version of a foundation model, where the deployer has adapted the model's behaviour for their domain by adding training data or retrieval context, but the underlying model architecture remains a third-party product. The third is an open-weight model deployed on the deployer's own infrastructure, such as a Llama or Mistral model, where the deployer controls the hosting but still relies on a training run they did not conduct.

Each dependency type creates different certification considerations. Direct API users are most exposed to model changes outside their control and least able to verify the model's internal governance. Fine-tuners have some control over behaviour in their domain but inherit the base model's failure modes in areas outside their fine-tuning scope. Open-weight self-hosters have operational control but typically lack the technical documentation that a closed-model provider is required to produce under Article 53 of Regulation (EU) 2024/1689.

What all three types share is the absence of independent technical documentation covering the model's full capability and limitation profile. The deployer knows what the model does in the scenarios they have tested. They do not know what it does in scenarios they have not anticipated, how it behaves under adversarial inputs, or what changes the provider may make to the model's alignment and safety behaviour in future updates.

What the EU AI Act requires from GPAI model providers and deployers

Articles 51 to 56 of Regulation (EU) 2024/1689 establish the obligations specific to GPAI models. Article 53 requires GPAI model providers to prepare and keep updated technical documentation, publish a sufficiently detailed summary of the content used to train the model, and establish policies for downstream deployers specifying the purposes they authorise for the model's use. Article 55 imposes additional obligations on providers of GPAI models with systemic risk, defined as models trained on more than 10^25 floating-point operations, including adversarial testing ("red-teaming") and incident reporting to the European AI Office.

For deployers, the key implication of Articles 51 to 56 is that they should request and verify the Article 53 documentation before deploying a GPAI-based agent at scale, particularly for high-risk use cases. The provider's documentation of the model's known capabilities and limitations, combined with the permitted use policy, defines the envelope within which the deployer is authorised to operate. Deploying the model outside that envelope creates both compliance risk under Article 25(1) (which can convert a deployer into a provider for the modified use) and insurance risk, as the deployment moves outside the scope that any available AI liability policy was written to cover.

For the broader regulatory context on GPAI model obligations, see the analysis on agentliability.eu's briefing on Articles 51 to 56. For the insurance coverage dimension of GPAI deployments, the coverage framework on agentinsured.eu addresses how current policies handle foundation model dependencies.

How the seven dimensions apply to GPAI-based agents

The seven dimensions of the Agent Certified framework each require specific adaptation when the agent being assessed relies on a GPAI foundation model rather than a purpose-built or proprietary system.

The Governance and Accountability dimension requires documented ownership of the decision to use a specific GPAI model, the criteria applied in selecting it, and the ongoing accountability arrangement for monitoring the model relationship. For GPAI-based agents, this must include a named owner responsible for tracking provider change notifications and assessing their implications.

The Risk Management and Documentation dimension requires a risk register that explicitly identifies the model dependency as a category of operational risk, with documented mitigations. The residual risks that the model's known limitations create for the specific deployment context must be identified and either mitigated or accepted with documented rationale.

The Training and Data Governance dimension, which in purpose-built systems addresses the training data used to build the model, applies to GPAI-based agents primarily through the provider's Article 53 documentation. The assessor evaluates whether the deployer has obtained and reviewed that documentation and whether the model's known training data profile creates material risks for the intended use case (for example, a model trained predominantly on English-language data deployed in a multilingual customer service context).

The Autonomy Envelope dimension, addressed in our earlier analysis on specifying autonomous action boundaries, requires particular care for GPAI-based agents because the model's outputs are inherently probabilistic. The deployer must specify not only which categories of action the agent may take autonomously but also the conditions under which the model's output is considered sufficient evidence for an autonomous action. A GPAI model that generates a plausible recommendation is not the same as a deterministic rule engine that produces a verified output. The autonomy specification must reflect that distinction.

The Human Oversight dimension requires that designated persons can meaningfully interpret the agent's outputs and escalation signals. For GPAI-based agents, this means oversight personnel must understand the model's probabilistic nature and known failure modes. An oversight procedure designed for a traditional decision-support tool will not be adequate for an agent that generates reasoning outputs from a stochastic model.

The Monitoring and Logging dimension requires that the agent layer generates its own monitoring record, independent of any logs the model provider generates. The reason is practical: if the provider changes its logging format, changes data retention policies, or deprecates the API version the agent uses, the deployer must not lose their operational monitoring capability. The agent-layer logging must be under the deployer's control and must satisfy the Article 12 retention requirements for any high-risk AI system within scope.

The Supply Chain and Model Dependency dimension is the dimension most specific to GPAI-based agents. It evaluates the completeness of the deployer's supply-chain transparency, including: the model provider's documentation review; the capability boundary testing conducted at the agent layer; the procedure for detecting and responding to model version changes; the fallback arrangement if the provider deprecates the model or changes the API; and the contractual protections in the provider agreement relevant to the deployer's liability exposure.

Capability boundary testing: what it is and how to conduct it

Capability boundary testing is the process of systematically evaluating how the GPAI model behaves at the edges of the intended use cases, under adversarial inputs, and in scenarios the deployer anticipates but has not trained for. It is the deployer's primary tool for understanding what the model actually does in their deployment context, as distinct from what the provider's general documentation says it can do.

Effective boundary testing covers at least four categories. First, normal-case testing: the model should perform the intended task accurately in the expected input distribution, with documented accuracy benchmarks. Second, edge-case testing: inputs that are unusual but plausible, including ambiguous requests, incomplete information, and requests that push the boundaries of the intended use case. Third, adversarial testing: deliberate attempts to produce harmful, misleading, or out-of-scope outputs through prompt injection, jailbreaking attempts, or manipulation of the context window. Fourth, refusal testing: verification that the model declines categories of request it should decline and that the refusal behaviour is consistent and not bypassable through rephrasing.

The test set should be documented, versioned, and re-run when the model version changes. NIST AI 600-1 (the Generative AI Profile, published July 2024) provides a structured framework for evaluating GPAI models that maps onto the boundary testing requirements of the certification framework. The ISO/IEC 42001:2023 management system standard's monitoring and measurement requirements are also relevant.

Model version changes and certification refresh

One of the structural differences between certifying a GPAI-based agent and certifying a purpose-built system is that the model the agent depends on can change without the deployer taking any action. A provider that updates a model's alignment training, changes its safety filters, or modifies its default behaviour may or may not notify deployers in advance, and the notification may or may not specify which use cases are affected.

In the Agent Certified framework, a material model change is treated as triggering a partial reassessment of the affected certification dimensions. The deployer must re-run boundary testing for affected use cases, update the evidence file, and determine whether the change affects the autonomy specification or human oversight procedure. If the reassessment produces evidence that the agent's behaviour in the affected use cases has changed materially, the certification status for those use cases is suspended until the evidence file is updated and verified.

This is not hypothetical. The deployment of Claude 3 replacing Claude 2, GPT-4 replacing GPT-3.5 Turbo at the API level, and successive Llama releases have each produced measurable behavioural changes for downstream deployers. Certification programmes that do not account for model version changes are providing a static assurance about a dynamic system, which is not assurance in any meaningful sense.

Insurance implications of GPAI dependency

Insurance underwriters evaluating AI agent deployments consistently identify GPAI dependency as one of the most significant risk factors they assess. The concern is not simply that foundation models can fail: it is that the failure modes of a GPAI model are harder to characterise than those of a purpose-built system, the deployer's control over those failure modes is limited, and the deployer's evidence of having exercised appropriate care is harder to produce without a systematic certification process.

A deployer presenting for underwriting with a complete Agent Certified evidence file demonstrating supply-chain transparency, capability boundary testing, and a model change management procedure is addressing the primary concerns the underwriter has. A deployer with no documentation of how the GPAI model was evaluated before deployment, no procedure for model changes, and no boundary test results is presenting an unbounded risk that most underwriters will either decline or price at a premium that reflects the absence of controls evidence. For coverage guidance, see the Agent Insured coverage framework.

Frequently asked questions

What is a GPAI-based AI agent and why does the certification challenge differ?

A GPAI-based agent uses a general-purpose AI foundation model from a third-party provider as its core reasoning capability. The certification challenge differs because the deployer does not control the model's training, capability boundaries, or update cadence. Certification must address both the agent layer that the deployer controls and the supply-chain dependency that they do not.

How does the EU AI Act treat GPAI models under Articles 51 to 56?

Articles 51 to 56 of Regulation (EU) 2024/1689 require GPAI model providers to prepare technical documentation, provide capability and limitation summaries, and establish permitted use policies for downstream deployers. Providers of GPAI models with systemic risk (trained on more than 10^25 FLOP) face additional adversarial testing and incident reporting obligations. Deployers should obtain and verify this documentation before deploying GPAI-based agents in high-risk use cases.

What documentation must a deployer produce for a GPAI-based agent to support certification?

The deployer must document the model provider's permitted use policy and evidence that the deployment falls within it; the capability boundary testing results at the agent layer; the prompt architecture and system instructions; the monitoring and logging arrangements; the human oversight procedure; and the model change management process. The complete evidence file forms the basis for both certification and insurance underwriting.

How does the certification framework handle model version changes?

Model version changes are treated as a material change event requiring re-evaluation of affected use cases. The deployer must re-run boundary testing, update the evidence file, and confirm whether the autonomy specification or oversight procedure requires adjustment. Certification status for affected use cases is suspended until the updated evidence file is verified.

Does certification of a GPAI-based agent affect insurance eligibility?

Yes. Insurance underwriters treat undocumented GPAI dependencies as unbounded risk. Certification evidence demonstrating supply-chain transparency, capability boundary testing, and change management procedures addresses the primary underwriting concerns and is typically the difference between an insurable and an uninsurable deployment at commercially viable terms.

References

  1. Regulation (EU) 2024/1689, Articles 51 to 56: obligations for providers of general-purpose AI models.
  2. Regulation (EU) 2024/1689, Article 53: technical documentation and information obligations for GPAI model providers.
  3. Regulation (EU) 2024/1689, Article 25(1): conditions under which a deployer becomes a provider for purposes of a modified system.
  4. National Institute of Standards and Technology. NIST AI 600-1: Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. July 2024.
  5. ISO/IEC 42001:2023. Information technology: Artificial intelligence: Management system. Monitoring, measurement, and performance evaluation requirements.
  6. Agent Certified. Methodology specification, May 2026 version, Supply Chain and Model Dependency dimension. Published at agentcertified.eu/methodology.html.
  7. AIUC-1 reference standard. AI Underwriting Company, 2025. Supply chain and model dependency documentation requirements for coverage eligibility.
  8. Moffatt v. Air Canada. 2024 BCCRT 149. Civil Resolution Tribunal of British Columbia. February 2024. On operator responsibility for AI agent outputs.
Related reading
Full methodology The seven-dimension framework and what each dimension requires in full. The autonomy envelope How to specify and certify the boundaries of autonomous action. Request an assessment Begin the formal assessment process for your GPAI-based agent deployments.