AI Training Data Copyright 2026: IP Risks & EU/US Rules

The EU Regulatory Architecture: AI Act Meets Copyright Directive

General-purpose AI model providers operating in 2026 face a structural compliance condition that did not exist two years ago. Article 53 of Regulation (EU) 2024/1689—the EU AI Act—now imposes enforceable copyright obligations on all GPAI providers, regardless of where training occurs. This is not a guidance document or a voluntary framework. It is binding law with direct applicability across Member States, and its territorial scope extends beyond the Union’s borders in ways that many US and Asian providers have yet to fully assess.

The divergence between the EU’s opt-out-based Text and Data Mining framework and the US fair use defense has ceased to be a theoretical legal debate. For organizations training models under US assumptions and deploying them to EU users, it is now an operational governance crisis. The EU’s position, articulated in Recital 106, holds that copyright obligations attach to models offered in the EU market regardless of the jurisdiction in which the copyright-relevant acts underpinning training took place. This territoriality expansion creates simultaneous, sometimes conflicting obligations for any provider with cross-border operations.

Article 53(1)(c) and (d): The Dual Obligation Framework

Article 53(1)(c) requires GPAI providers to put in place a policy to comply with Union copyright law, specifically including the reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790—the DSM Directive. Article 53(1)(d) requires providers to draw up and make publicly available a sufficiently detailed summary of the content used for training the general-purpose AI model, according to a template provided by the AI Office.

These are dual obligations, not alternatives. The copyright compliance policy must be operational; the training data summary must be public. Both apply to all GPAI providers placing models on the EU market, including those headquartered outside the Union. The relationship to the DSM Directive is explicit: Article 53 does not create new copyright exceptions but mandates compliance with existing ones. The TDM exception under Article 4 of the DSM Directive permits commercial text and data mining on works that rights holders have not reserved in machine-readable form. Where a valid reservation exists, the exception does not apply, and licensing becomes necessary.

The AI Office template, published on July 24, 2025, specifies what “sufficiently detailed” means in practice. Providers must disclose dataset modalities, sizes, languages, acquisition dates, major publicly available datasets with identifiers, and web crawler specifications including identifiers, purposes, and content types. The template requires disclosure of the top 10 percent of domains crawled by volume and any large publicly available dataset constituting more than 3 percent of total training data. It also requires description of measures to avoid or remove illegal content, including blacklists, keyword filters, and model-based classifiers.

A critical safeguard is built into the template: providers may withhold information that would compromise confidential business information or trade secrets, provided the summary remains sufficiently detailed to meet the regulatory purpose. The distinction between “sufficiently detailed” and “work-by-work assessment” is deliberate. The AI Office does not require itemization of every copyrighted work used, but it does require enough granularity to enable rights holders and regulators to assess compliance.

The Text and Data Mining Exception Under Pressure

The scope of the TDM exception under Article 4 of the DSM Directive has become the central legal battleground for generative AI training in Europe. Article 4 permits reproductions and extractions for text and data mining of works lawfully accessible, provided the rights holder has not reserved those rights in an appropriate machine-readable form. Article 3, by contrast, covers scientific research TDM and does not permit opt-out.

The definitional question—whether generative AI training constitutes “text and data mining” within the meaning of the Directive—remains unresolved at the highest level. The Hamburg Higher Regional Court addressed this in its 2025 decision concerning LAION, a non-profit organization that compiled datasets for machine learning research. The court held that the creation of training datasets could fall within the TDM exception, but it also clarified that natural language reservations of rights are insufficient. Only machine-readable opt-out signals, such as robots.txt protocols, meta tags, or the emerging TDM Reservation Protocol, satisfy the Article 4(3) requirement.

A reference for a preliminary ruling is pending before the Court of Justice of the EU in case C-250/25 (Like Company), which may provide authoritative guidance on whether generative AI model training falls within the TDM exception and what constitutes a valid machine-readable reservation. Until the CJEU rules, national courts will continue to interpret the scope differently, creating compliance uncertainty for providers operating across multiple Member States.

The machine-readable standards landscape is evolving. robots.txt remains the most widely implemented mechanism, but its limitations are well-documented: it is not standardized for TDM-specific reservations, it does not cover all content types, and its enforcement relies on crawler compliance. The ai.txt initiative and the TDM Reservation Protocol offer more granular approaches, but adoption remains fragmented. Providers must currently monitor multiple signaling systems and maintain dynamic crawler blocking configurations that can be updated as rights holders change their reservations.

The AI Office Template and Transparency Burden

The July 24, 2025 AI Office template creates a documentation burden that many providers were unprepared to meet. The requirement to identify the top 10 percent of crawled domains by volume means that providers must maintain granular logs of web scraping activities, including domain-level acquisition statistics. The 3 percent threshold for large publicly available datasets requires precise calculation of dataset composition ratios. Web crawler disclosure obligations extend to technical identifiers, stated purposes, and content type specifications.

The gap between template requirements and actual enforcement capacity is a live governance issue. The AI Office has not yet published detailed enforcement protocols, and the first compliance review cycles are still underway. However, the statutory framework is clear: non-compliance with Article 53 exposes providers to the full range of AI Act enforcement mechanisms, including requests for information, gap analysis against the Code of Practice, and potential penalties. The transparency requirements are not contingent on the AI Office’s current staffing or review cadence; they are legal obligations that providers must meet now.

The European Parliament’s March 2026 Resolution: A Signal of Future Stricter Regime

On March 10, 2026, the European Parliament adopted a non-binding resolution through its Committee on Legal Affairs (JURI) that signals a potential shift toward stricter copyright obligations for AI providers. The resolution proposed a flat-rate licensing fee of 5 to 7 percent of global turnover for creative industry compensation, a mechanism that would fundamentally alter the current opt-out framework. While the resolution is not binding on the Commission or Member States, its political significance is substantial. It reflects growing parliamentary pressure to move beyond the TDM exception toward mandatory compensation schemes.

The resolution also reinforced the extraterritorial application of EU copyright rules, stating that models offered in the EU should comply with Union copyright law regardless of where training occurred. For US and Asian providers, this means that fair use determinations or permissive domestic frameworks do not shield them from EU regulatory action. The resolution’s language aligns with Recital 106 of the AI Act, creating a consistent political direction across EU institutions.

The implications for global providers are direct. A US company training models under fair use assumptions and deploying them to EU users must now assess whether its training data practices satisfy EU copyright obligations, including opt-out compliance and transparency requirements. The resolution does not change current law, but it indicates the direction of future legislative review. The 2026 Copyright Directive review process is the first formal opportunity to translate these political signals into binding amendments.

The US Landscape: Fair Use Litigation and Regulatory Vacuum

The United States has no federal statute specifically governing AI training data copyright. The regulatory landscape is defined by litigation, agency guidance, and state-level legislation, creating a patchwork of obligations that contrasts sharply with the EU’s systematic framework. For governance professionals, the US environment presents a different category of risk: not the certainty of regulatory non-compliance, but the uncertainty of litigation outcomes and the absence of clear rules.

The Copyright Office’s May 2025 Report

On May 9, 2025, the US Copyright Office released the pre-publication version of Part 3 of its Copyright and Artificial Intelligence report, titled Generative AI Training. The report concluded that no new statutory exception for AI training is warranted at this time. It recommended that the licensing market for training data develop without government intervention, rejecting proposals for compulsory licensing with fixed royalties. The report acknowledged that fair use uncertainty distorts licensing market dynamics—rights holders are reluctant to negotiate licenses when providers may claim fair use, and providers are reluctant to pay for licenses when they believe fair use applies—but it concluded that statutory intervention is premature.

The report’s practical effect is to preserve the status quo: providers must continue to rely on the four-factor fair use test under 17 U.S.C. § 107, with all its attendant litigation risk. The Copyright Office did not provide categorical guidance on how the four factors apply specifically to generative AI training, instead emphasizing that fair use requires a case-by-case, fact-specific inquiry. For governance teams, this means that US training data copyright risk remains fundamentally a litigation risk, not a compliance risk in the regulatory sense.

Active Litigation Shaping the Contours of Fair Use

Several cases are actively shaping the boundaries of fair use for AI training. In the United Kingdom, the High Court ruled in November 2025 in Getty Images v. Stability AI. The court’s technical analysis addressed whether Stable Diffusion models store or reproduce Getty works as infringing copies. The ruling found that the model architecture did not, in the technical sense, store or reproduce the original works as copies, though trademark and other issues remained live in the litigation. The copyright infringement analysis turned on the technical mechanics of diffusion models rather than a broad fair use determination.

In Germany, the Munich Regional Court ruled in November 2025 in GEMA v. OpenAI that training on unlicensed song lyrics was unlawful under German copyright law. The court held that the TDM exception did not apply because the training was not limited to text and data mining in the statutory sense, and because the use of lyrics in generative outputs went beyond the analytical purpose contemplated by the exception. This decision, issued by a German regional court, may be subject to appeal and illustrates the risk that national courts may interpret TDM exceptions narrowly when generative outputs compete with the original market for the work.

In the United States, the first substantive ruling on AI training and fair use came on February 11, 2025, when Judge Stephanos Bibas of the District of Delaware issued his decision in Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc. The court granted partial summary judgment for Thomson Reuters, holding that Ross infringed 2,243 Westlaw headnotes and that Ross’s fair use defense failed as a matter of law. The court found that Ross’s use was commercial, not transformative, and served the same purpose as Thomson Reuters’s original works—facilitating legal research. The court emphasized that factor four, market effect, was “the single most important element of fair use” and that Ross’s product was a market substitute for Westlaw. The court was careful to note that the case involved non-generative AI, leaving generative AI fair use questions for future decisions. The case is now on appeal to the Third Circuit.

Other pending US cases include Andersen v. Stability AI, in which artists allege that Stability AI’s training on their works constitutes copyright infringement, with fair use as the central defense. The Shanghai Intellectual Property Court’s 2023 decision in the LoRA/Altman case introduced an “analytical use” doctrine that treats AI training as non-infringing if the model’s internal processing does not communicate the original expression to users. This doctrine has limited direct applicability in US courts, but it illustrates the global diversity of judicial approaches to the same technical question.

State-Level Transparency Requirements

California’s Generative AI Training Data Transparency Act, which entered into force in January 2026, creates a state-level obligation that partially fills the federal vacuum. The Act requires providers of generative AI systems to publish high-level summaries of training datasets, including sources or categories of sources and whether datasets include copyrighted or licensed material where known. The requirements are less granular than the EU AI Office template: California does not mandate domain-level disclosure, crawler specifications, or dataset size thresholds. It is a high-level disclosure regime rather than the granular transparency framework required in the EU.

The absence of federal preemption means that providers face a potential patchwork of state obligations. California’s approach may be replicated by other states, creating multiple transparency regimes with different content and format requirements. For providers already complying with the EU AI Office template, California’s requirements are largely subsumed, but the divergence in specificity creates documentation complexity. Governance teams must ensure that public disclosures satisfy the most demanding applicable jurisdiction without inadvertently creating litigation exposure in less demanding ones.

The Absence of a Federal TDM Exception

The United States remains an outlier among major jurisdictions in lacking a statutory text and data mining exception. Japan’s copyright law permits TDM for machine learning without rights holder consent, subject to certain conditions. Singapore’s 2021 copyright amendments introduced a broad TDM exception for computational data analysis. The United Kingdom is currently consulting on reforms that would either strengthen copyright protections, introduce a broad TDM exception with opt-out, or create a statutory licensing scheme.

US providers must instead rely on the transformative use doctrine under the four-factor fair use test: the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect on the potential market for the original work. Each factor presents challenges in the AI training context. The purpose factor favors providers if training is deemed transformative, but the Thomson Reuters v. Ross Intelligence decision demonstrates that courts may reject transformativeness when the AI product competes with the original work’s market. The nature factor disfavors providers when training on expressive works such as photographs, music, or fiction. The amount factor is problematic because training typically uses entire works. The market effect factor is the most contested: if generative AI outputs substitute for licensed uses of original works, fair use becomes harder to sustain, as the Ross decision confirmed.

This asymmetric risk structure means that US providers face greater litigation uncertainty than their counterparts in jurisdictions with explicit TDM exceptions. A provider training in Japan under a statutory TDM exception and deploying in the EU under the Article 4 framework faces a known regulatory path. A US provider training under fair use assumptions and deploying in the EU faces both litigation uncertainty at home—exemplified by the Ross decision—and regulatory non-compliance risk abroad. This asymmetry is the defining feature of the current global governance landscape for AI training data copyright.

Cross-Jurisdictional Friction Points and Territoriality Risk

The most consequential governance challenge for AI training data copyright in 2026 is not the substance of any single jurisdiction’s rules, but the interaction between them. Providers operating across the EU and US must navigate regimes that were designed independently and now impose overlapping, sometimes contradictory obligations. The EU’s territorial expansion under Recital 106 of the AI Act creates a direct conflict with US fair use doctrine: the same training activity may be lawful under one framework and unlawful under the other, with no clear mechanism for resolving the conflict.

The EU’s Expanding Territorial Reach

Recital 106 of Regulation (EU) 2024/1689 states that obligations under Article 53 apply “regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of those models take place.” This is not a jurisdictional claim over the training activity itself; it is a market-access condition. A model trained in the United States under fair use assumptions and offered to EU users must comply with EU copyright law, including the TDM exception and opt-out requirements, as a condition of market access.

The practical implications are substantial for US-trained models deployed in the EU. A provider that scraped web content without checking for machine-readable opt-out signals, relying on US fair use doctrine, may find that the same content is protected under EU law because the rights holder reserved TDM rights. The provider cannot defend its EU operations by pointing to US fair use determinations; fair use is a US doctrine with no direct EU equivalent, and the EU’s market-access regulation does not incorporate foreign copyright exceptions.

The potential for direct conflict with US fair use determinations arises when a US court holds that training on specific works is fair use, while EU law requires licensing for those same works because the rights holder opted out. The provider cannot simultaneously comply with both: licensing the works would be unnecessary under US law but required under EU law; refusing to license would satisfy US fair use but violate EU market-access conditions. This conflict has no established resolution mechanism. There is no bilateral treaty governing AI training data copyright, and the WIPO copyright framework does not address TDM exceptions in the context of generative AI.

Enforcement mechanisms are already operational. The AI Office can issue requests for information to any GPAI provider offering models in the EU, regardless of headquarters location. Gap analysis against the Code of Practice applies to all providers, including non-signatories. The AI Office can demand documentation of copyright compliance policies, training data summaries, and technical measures for opt-out compliance. Failure to respond or to demonstrate adequate measures exposes providers to the AI Act’s penalty structure, including fines up to 3 percent of global annual turnover for GPAI model violations under Article 99.

The Compliance Dilemma: Dual-Jurisdiction Operations

Providers with operations in both jurisdictions face a structural compliance dilemma. Consider a US provider that trains a model under fair use assumptions, using web-scraped data without systematic opt-out checking. The provider now wishes to deploy the model to EU users. Under Recital 106, the provider must comply with EU copyright obligations for the model as offered in the EU, even though the training occurred in the US.

The provider has limited options. Territorial data segmentation—training separate models for EU and non-EU markets using different datasets—is technically possible but economically costly. It requires maintaining parallel training pipelines, duplicate infrastructure, and distinct model versions. For large foundation models, the cost of retraining or maintaining separate weights is prohibitive. Retroactive licensing is theoretically possible but practically difficult: the provider may not know which works were used, and rights holders may not offer licenses for past training or may demand rates that make the model uneconomical.

Documentation requirements create a secondary tension. The EU AI Office template requires detailed disclosure of training data sources, crawler configurations, and rights-reservation compliance measures. In US litigation, such disclosures could be discoverable and used to undermine fair use defenses. A provider that documents its opt-out compliance efforts for EU regulators may inadvertently create evidence that its training practices were more systematic than fair use doctrine contemplates, potentially supporting plaintiffs’ arguments that the use was commercial and substitutive rather than transformative. This transparency-litigation privilege tension is a novel governance challenge that has no established resolution.

The risk of “compliance laundering” accusations adds further complexity. If a provider segregates its most compliant documentation for EU regulators and maintains separate, less transparent records for US litigation purposes, it may face allegations of regulatory manipulation. Governance teams must structure documentation systems that satisfy EU transparency requirements while preserving defensive legal positions in US courts—a balance that requires careful legal and technical architecture.

The Code of Practice and Non-Signatory Obligations

The GPAI Code of Practice, published in July 2025, serves as an interim compliance benchmark while the AI Office develops detailed harmonized standards under Article 56. The Code was developed through a multi-stakeholder process involving providers, rights holder organizations, and civil society groups. Signatories commit to implementing the Code’s provisions, including copyright compliance policies, transparency measures, and opt-out compliance mechanisms.

Non-signatories face enhanced scrutiny. The AI Office has stated that non-signatory providers will be subject to more intensive gap analysis and information requests, on the basis that the absence of a signed commitment raises questions about the adequacy of voluntary measures. The Code is not legally binding in itself, but it shapes enforcement expectations. A provider that has not signed the Code and cannot demonstrate equivalent measures will find it harder to satisfy the AI Office that its Article 53 compliance is adequate.

The relationship to the future Copyright Directive review is also significant. The June 2026 statutory deadline for review of Directive 2019/790, mandated under Article 29 of the Directive, may result in amendments that strengthen or modify the TDM exception framework. If the Commission proposes mandatory licensing mechanisms or flat-rate fees along the lines suggested by the European Parliament’s March 2026 resolution, the Code of Practice may be superseded by binding legislative requirements. Providers that have built compliance systems around the current Code may need to adapt rapidly to new statutory obligations.

IP Risk Categories and Common Misconceptions

Governance failures in AI training data copyright often stem from misconceptions that conflate superficially similar concepts. The following five misconceptions are prevalent among compliance teams and lead to systematic underestimation of legal exposure.

Misconception 1: “Publicly Available Online Content Is Free to Use”

The distinction between publicly visible and lawfully accessible is central to Article 4 of the DSM Directive. The TDM exception applies only to works that are “lawfully accessible,” not merely visible. Content behind paywalls, content subject to terms of service that prohibit scraping, and content protected by technical access barriers may be publicly visible but not lawfully accessible for TDM purposes.

Paywall circumvention for training data acquisition violates the lawful access requirement regardless of whether the content is technically retrievable. Terms of service violations that prohibit automated access or data extraction may render the access unlawful, even if no technical barrier was breached. The Hamburg Higher Regional Court’s 2025 LAION decision emphasized that lawful access requires compliance with both technical measures and contractual conditions governing the content’s availability.

Providers that scrape content without reviewing terms of service or that bypass rate limits, CAPTCHA challenges, or other technical barriers risk losing the TDM exception entirely. The exception is not a blanket permission to use any content that can be retrieved; it is a conditional permission that depends on the rights holder’s chosen mode of distribution and reservation. Governance teams must implement source verification workflows that distinguish between licensed content, public domain content, opt-out-checked content, and content with unresolved rights status.

Misconception 2: “Opt-Out Is Only About robots.txt”

The Hamburg court clarified that natural language reservations of rights—such as copyright notices in website footers or terms of service clauses—are insufficient to trigger the Article 4(3) opt-out. The reservation must be expressed in machine-readable form. However, the machine-readable landscape is more complex than a single robots.txt file.

Current standards include robots.txt directives, meta tags in HTML headers, HTTP headers with TDM-specific fields, and the emerging TDM Reservation Protocol. For images, video, and audio, metadata embedding standards such as IPTC Photo Metadata and XMP can carry rights reservation signals. The requirement for machine-readable form means that providers must implement multi-modal checking systems: text crawlers must read robots.txt and meta tags, image crawlers must extract embedded metadata, and audio crawlers must parse container metadata.

Terms of service limitations are supplementary but insufficient as standalone measures. A website’s terms of service may prohibit TDM, but unless the prohibition is also expressed in machine-readable form, it does not trigger the Article 4(3) opt-out. Conversely, a machine-readable opt-out may exist even where terms of service are silent. Providers must monitor both channels and maintain dynamic crawler blocking configurations that update as rights holders change their reservations. The maintenance burden is substantial: opt-out signals change as websites update, new rights holders enter the market, and standards evolve. A one-time opt-out check at the start of a training project is inadequate; compliance requires ongoing re-crawling and re-verification.

Misconception 3: “Training Data Transparency Compromises Trade Secrets”

The AI Office template explicitly accommodates confidential business information. Providers are not required to disclose proprietary algorithms, model architectures, or competitive strategies. The template requires disclosure of dataset characteristics—modalities, sizes, languages, acquisition dates—not the specific proprietary methods used to process or filter those datasets.

The distinction between “sufficiently detailed” and “work-by-work assessment” is critical here. The template does not require itemization of every copyrighted work, every image file, or every text snippet. It requires aggregate information that enables rights holders and regulators to assess whether the provider has complied with opt-out obligations and lawful access requirements. A provider can disclose that it trained on a dataset of approximately 2 billion image-text pairs, sourced from 50 domains, with crawler configurations X, Y, and Z, without revealing the specific weighting schemes, filtering algorithms, or model hyperparameters that constitute trade secrets.

Strategic disclosure involves calibrating what must be revealed against what can be protected. The AI Office template’s structure supports this calibration by specifying required fields and leaving room for aggregation. Providers that treat the template as a threat to trade secrets rather than a structured disclosure framework may over-withhold information and trigger regulatory scrutiny, or under-withhold and face enforcement action. Training data provenance systems that integrate copyright compliance and bias auditing requirements can support transparent disclosure while protecting sensitive technical details.

Misconception 4: “US Fair Use Protects Global Training Activities”

Fair use under 17 U.S.C. § 107 is a US doctrine with territorial limits. It is an affirmative defense to copyright infringement, not a blanket authorization to use copyrighted works. A US court’s determination that training on specific works is fair use has no binding effect in EU courts or before the AI Office. The EU’s market-access regulation does not recognize foreign copyright exceptions as satisfying EU obligations.

The territoriality of copyright law means that each jurisdiction applies its own rules to acts occurring within its territory or to products offered within its market. The EU’s Recital 106 approach extends this territoriality to market access: a model offered in the EU is subject to EU copyright obligations regardless of where training occurred. US fair use determinations are irrelevant to this analysis. A provider that has won a fair use ruling in a US district court may still face AI Office enforcement for the same model in the EU if the training data did not comply with EU opt-out requirements.

The litigation risk is also multi-jurisdictional. A provider may face copyright claims in the US under fair use, in Germany under the DSM Directive, in the UK under domestic copyright law, and in France under droit moral protections, all for the same training activity. The February 2025 decision in Thomson Reuters v. Ross Intelligence demonstrates that US courts may reject fair use defenses for AI training when the product competes with the original work’s market. Fair use provides no shield against non-US claims. Governance teams must assess litigation risk in each jurisdiction where the model is offered, not merely the jurisdiction where training occurred.

Misconception 5: “Synthetic Data Eliminates Copyright Risk”

The AI Office template explicitly requires disclosure of synthetic data usage. Synthetic data does not eliminate copyright obligations if the synthetic data was derived from protected works. If a synthetic dataset was generated by a model trained on copyrighted works, the synthetic data inherits the copyright status of its source material. The template’s requirement for synthetic data disclosure reflects this understanding: providers must account for synthetic data in their training data summaries, including the provenance of the synthetic generation process.

Downstream copyright in synthetic outputs is an additional concern. If a synthetic dataset contains expressions substantially similar to protected works, it may itself be infringing. The degree of similarity required for infringement varies by jurisdiction, but the risk is not theoretical. Providers that rely on synthetic data to avoid copyright issues may find that the synthetic data creates new copyright problems or that its quality degradation introduces bias and performance issues that undermine the model’s utility.

Quality and bias risks of synthetic-only training regimes are well-documented in research literature. Models trained predominantly on synthetic data exhibit compounding errors, reduced diversity, and degraded performance on out-of-distribution tasks. For governance teams, synthetic data is a risk mitigation tool, not a risk elimination strategy. It must be accompanied by provenance documentation, quality validation, and copyright assessment of the synthetic generation pipeline.

Governance Implementation Framework

The divergence between EU regulatory obligations and US litigation-driven uncertainty demands governance architectures that can operate across both regimes simultaneously. Organizations that treat training data copyright as a legal afterthought—addressed only when a claim arises or a regulator inquires—will face compounding regulatory, litigation, and reputational risks. The framework below is designed for implementation by governance officers, compliance leads, and legal counsel managing cross-jurisdictional AI deployment.

Training Data Provenance Architecture

Source verification workflows must distinguish between four categories of training data: licensed content with contractual usage rights, public domain content with expired or waived protection, content that has been checked against opt-out signals and confirmed as available for TDM, and synthetic data with documented generation provenance. Each category requires distinct documentation standards and retention policies.

Data lineage documentation systems must track each dataset from acquisition through preprocessing to model training. The lineage record should include the dataset identifier, acquisition date, source URL or provider, modality, approximate size, language distribution, and rights status at the time of acquisition. For web-crawled data, the lineage record should include the crawler version, the robots.txt and meta tag status at crawl time, and any terms of service restrictions identified. Version control for crawler configurations and opt-out lists is essential: a provider must be able to demonstrate that a specific data point was acquired under a specific crawler configuration that reflected the opt-out status at that time.

Automated rights reservation checking should supplement, not replace, manual verification. Automated systems can parse robots.txt, meta tags, and embedded metadata at scale, but they cannot resolve ambiguities in natural language terms of service or assess whether a technical barrier constitutes a lawful access restriction. Manual verification should focus on high-risk content categories—press publications, stock photography, music lyrics, software code—where rights holder enforcement activity is concentrated. The combination of automated scanning for scale and manual review for risk concentration provides both efficiency and accuracy.

The Opt-Out Compliance Stack

Technical implementation of opt-out compliance requires integration across multiple signaling systems. robots.txt remains the baseline standard, but its limitations are significant: it applies at the domain or path level, not the individual work level; it does not standardize TDM-specific syntax; and it is not enforceable against non-compliant crawlers. Providers should implement robots.txt parsing as a minimum requirement and supplement it with ai.txt parsing, meta tag extraction, and HTTP header field checking where these standards are deployed.

Periodic re-crawling and re-verification schedules must account for the dynamic nature of web content. Rights holders may add or remove opt-out signals, change terms of service, or deploy new technical barriers. A provider that checked opt-out status at the start of a training project but never re-verified may find that content acquired lawfully at one time is no longer lawfully accessible. The re-verification schedule should align with the provider’s model update cadence: if a model is retrained quarterly, opt-out status should be re-verified before each training run. For continuously updated models, re-verification should occur on a rolling basis with defined intervals.

Integration with dataset management platforms is necessary for operational sustainability. Opt-out compliance cannot be maintained as a separate manual process outside the data pipeline; it must be embedded in the platform that manages dataset versioning, preprocessing, and model training. The platform should flag datasets with changed opt-out status, block acquisition from newly opted-out sources, and maintain audit trails of compliance actions. Governance tooling solutions that combine automated opt-out checking with dataset management and documentation generation can reduce the operational burden of maintaining compliance at scale.

Conflicting opt-out signals across jurisdictions present an emerging challenge. A rights holder may opt out of TDM in the EU but not in the US, or may use different signaling mechanisms in different jurisdictions. A provider training a global model must decide whether to apply the strictest applicable standard globally, maintain jurisdiction-specific datasets, or accept the risk of non-compliance in jurisdictions with stricter requirements. The EU’s Recital 106 approach suggests that the strictest standard may become the de facto global standard for models offered in the EU, since segregating training data by jurisdiction is economically costly for large foundation models.

Documentation and Transparency Readiness

EU AI Office template pre-population should begin before the first regulatory inquiry, not after. The template’s fields are known and structured; providers can build internal documentation systems that generate template-compliant summaries as a byproduct of normal data governance activities. Pre-population reduces response time to information requests and demonstrates proactive compliance posture to regulators.

California Act summary preparation runs in parallel. The California Generative AI Training Data Transparency Act requires high-level summaries that are less granular than the EU template but cover overlapping subject matter. A unified documentation system can generate both EU and California disclosures from the same underlying data lineage records, reducing duplication and ensuring consistency. Inconsistencies between disclosures made to different regulators can create enforcement risk: if a provider’s EU template describes a dataset composition that differs from its California summary, regulators in either jurisdiction may question the accuracy of both.

Internal audit trails must be maintained separately from public disclosures. The audit trail includes detailed records of opt-out checks, licensing negotiations, terms of service reviews, and compliance decisions that may not be suitable for public disclosure but are essential for regulatory defense and internal governance. The audit trail should be structured to support rapid retrieval in response to regulatory requests while preserving litigation privilege where applicable.

The tension between transparency and litigation privilege requires careful documentation architecture. EU-mandated disclosures are public and may be discoverable in US litigation. The February 2025 decision in Thomson Reuters v. Ross Intelligence illustrates how courts scrutinize the commercial purpose and market effect of AI training practices. A provider’s detailed EU transparency disclosures could be used to support arguments that its training was systematic and commercial, undermining fair use defenses. Internal audit trails may be protected by work-product doctrine or attorney-client privilege in US proceedings but are subject to regulatory access in EU enforcement. Providers should structure their documentation to maximize privilege protection for US litigation while ensuring that sufficient information is available to satisfy EU transparency requirements. This may involve maintaining parallel documentation streams: public summaries for regulatory disclosure, and privileged internal analyses for litigation defense. Legal counsel should review the documentation architecture in both jurisdictions to ensure that privilege is not inadvertently waived by public disclosure or regulatory submission.

Risk Assessment and Mitigation Matrix

Training data risk categories should be assessed based on rights holder enforcement patterns, litigation history, and regulatory focus. High-risk categories include press publications, where publishers have actively pursued licensing claims; stock photography, where image libraries maintain extensive rights management systems; music lyrics, where collecting societies have initiated litigation; and software code, where open-source license compliance intersects with copyright enforcement. Medium-risk categories include academic papers, where publisher policies vary and institutional access agreements may limit TDM; government documents, where copyright status varies by jurisdiction and publication type; and user-generated content, where rights are often unclear and multiple claimants may exist. Low-risk categories include public domain works, self-generated data, and licensed proprietary datasets with clear contractual usage rights.

The risk matrix should inform acquisition decisions, not merely retrospective compliance checks. A provider evaluating a potential training dataset should assess the risk category of its dominant content types, the prevalence of opt-out signals in that category, the likelihood of rights holder enforcement, and the availability of alternative licensed or public domain sources. High-risk datasets should require enhanced verification, including manual review of rights status and documented licensing where the TDM exception does not apply.

Contingency planning for retroactive licensing demands is essential because the current legal landscape may shift rapidly. The European Parliament’s proposed flat-rate licensing fee, if implemented in the June 2026 Copyright Directive review or subsequent legislation, could impose retroactive obligations on providers that trained models under the current TDM exception. The May 2025 Copyright Office report’s rejection of compulsory licensing does not preclude future US legislative action if Congressional sentiment shifts. Providers should maintain reserve provisions for potential licensing liabilities and document their training data provenance to support rapid licensing negotiations if required. The ability to identify which works were used, from which sources, and under what rights status at the time of use is a prerequisite for any retroactive licensing strategy.

Standards and Framework References

Regulatory compliance with AI training data copyright obligations does not operate in isolation from existing governance frameworks. ISO/IEC standards, the NIST AI Risk Management Framework, and OECD principles provide structural support for implementing the operational requirements described in the preceding sections. Governance professionals can integrate copyright compliance into existing management systems rather than building parallel programs.

ISO/IEC 23053:2022 (Framework for AI Systems Using ML)

ISO/IEC 23053 establishes a framework for AI systems that use machine learning, including data governance requirements relevant to training data provenance. The standard’s data governance provisions address data acquisition, data quality, and data lineage—capabilities that directly support copyright compliance documentation. A provider implementing ISO/IEC 23053 data governance controls can extend those controls to capture rights status information, opt-out verification records, and licensing documentation without fundamental architectural change.

The risk management integration points in ISO/IEC 23053 are particularly relevant. The standard requires identification of risks associated with data sources, including legal and regulatory risks. Training data copyright risk fits within this category and should be explicitly included in the organization’s risk register. The standard’s performance evaluation requirements for data governance can be adapted to monitor opt-out compliance rates, licensing coverage ratios, and transparency documentation completeness.

ISO/IEC 42001 (AI Management Systems)

ISO/IEC 42001 specifies requirements for establishing, implementing, maintaining, and continually improving an AI management system. The standard’s clause on “Context of the organization” requires identification of external regulatory requirements that affect the AI system. For training data copyright, this includes the EU AI Act Article 53 obligations, the DSM Directive TDM exception, the California Generative AI Training Data Transparency Act, and any other applicable national requirements. The management system must document these requirements and ensure they are communicated to relevant personnel.

The “Operational planning and control” clause requires processes for managing data acquisition, including controls to ensure that data is obtained in compliance with legal and contractual requirements. This maps directly onto the opt-out compliance stack and source verification workflows described in the governance implementation framework. The standard’s performance evaluation clause requires monitoring and measurement of process effectiveness, which can include metrics such as the percentage of training data with verified rights status, the timeliness of opt-out re-verification, and the accuracy of public transparency disclosures.

Organizations pursuing ISO/IEC 42001 certification or alignment should integrate copyright compliance into their AI management system scope statement, risk assessment, and operational controls. Auditors assessing conformity to ISO/IEC 42001 will expect to see documented procedures for training data rights management, evidence of implementation, and records of monitoring and review. The standard does not prescribe specific copyright compliance measures, but it requires that whatever measures the organization chooses are documented, implemented, and subject to continual improvement.

NIST AI RMF 1.0

The NIST AI Risk Management Framework provides a function-based approach that aligns well with cross-jurisdictional copyright compliance. The Govern function requires policies and procedures for legal and regulatory compliance, including intellectual property rights. A governance program under the NIST AI RMF should include explicit policies for training data acquisition, rights verification, opt-out compliance, and transparency documentation. These policies should be approved at the appropriate organizational level, communicated to relevant personnel, and subject to periodic review.

The Map function requires identification of AI system risks, including risks related to training data. For copyright compliance, this means mapping the organization’s training data sources to jurisdictional requirements, identifying gaps where opt-out compliance has not been verified, and assessing the likelihood and impact of regulatory enforcement or litigation. The Map function should produce a documented inventory of training data risks that informs the organization’s overall AI risk profile.

The Measure and Manage functions require ongoing monitoring and response to identified risks. For training data copyright, this includes measuring opt-out compliance rates, tracking changes in regulatory requirements, monitoring litigation developments that may affect fair use risk assessments—such as the February 2025 decision in Thomson Reuters v. Ross Intelligence and its pending appeal—and managing responses to enforcement actions or rights holder claims. The NIST AI RMF’s iterative approach—govern, map, measure, manage, and repeat—supports the dynamic compliance requirements of a field where regulations, standards, and judicial interpretations are evolving rapidly.

OECD AI Principles

The OECD Recommendation on Artificial Intelligence, updated in 2024 to address intellectual property considerations, provides high-level principles that inform national regulatory approaches. The 2024 update explicitly acknowledges the need for AI systems to respect intellectual property rights and calls for transparency in training data provenance. While the OECD principles are not binding law, they shape the policy environment in which national regulations are developed and provide a common reference point for cross-border cooperation.

The transparency and explainability requirements in the OECD principles align with the EU AI Act’s training data summary obligations and the California Transparency Act’s disclosure requirements. Providers that implement OECD-compliant transparency practices will find that they have already addressed many of the substantive requirements of these national regulations. The OECD’s cross-border cooperation recommendations are particularly relevant to the territoriality issues discussed earlier: the principles encourage jurisdictions to work toward interoperable approaches to AI governance, including intellectual property, even where full harmonization is not achievable.

For standards practitioners participating in ISO/IEC JTC 1/SC 42 or national mirror committees, the OECD principles provide the policy context for technical standard development. Standards that support training data provenance, rights reservation signaling, and transparency documentation contribute to the implementation of OECD principles at the operational level. The alignment between OECD policy direction and ISO/IEC technical standards creates a coherent governance ecosystem that providers can navigate with a unified approach rather than fragmented compliance efforts.

Emerging Developments and Future Compliance Considerations

The regulatory landscape for AI training data copyright will not stabilize in 2026. Multiple parallel processes—legislative review, judicial interpretation, standardization, and international policy development—will continue to reshape obligations. Governance teams must monitor these developments and build adaptive compliance systems rather than static checklists.

Article 29 of Directive (EU) 2019/790 mandates a statutory review of the Directive by June 7, 2026. The European Commission must submit a report to the European Parliament and Council evaluating the Directive’s application, including the TDM exceptions in Articles 3 and 4. The 2026 review process is the first formal opportunity to amend the TDM framework since its adoption.

The potential for mandatory licensing mechanisms is the most significant question. The European Parliament’s March 2026 non-binding resolution proposed a flat-rate licensing fee of 5 to 7 percent of global turnover for creative industry compensation. If the Commission incorporates this proposal into its review report and subsequent legislative proposal, the current opt-out framework could be replaced by a compulsory licensing regime. Such a shift would fundamentally alter the economics of AI training data acquisition: providers would no longer be able to rely on the TDM exception for commercially available works but would instead pay into a collective licensing system.

The operational implications of a flat-rate fee are substantial. A percentage-of-turnover model would require providers to report global AI-related revenue, allocate portions to specific models or services, and remit fees to collecting societies. The administrative burden would exceed current transparency requirements, and the fee level—if set at the upper end of the proposed range—could materially affect the cost structure of foundation model development. The timeline for legislative action, should the Commission propose amendments, would extend into 2027 or beyond, given the ordinary legislative procedure requirements. Providers should not assume that the current opt-out framework will remain unchanged indefinitely.

UK Consultation Outcomes (Expected 2026)

The UK Intellectual Property Office has been consulting on three options for reforming copyright and AI training data law. Option 1 would strengthen copyright protections and require explicit licensing for all AI training. Option 2 would introduce a broad TDM exception with an opt-out mechanism, similar to the EU’s Article 4 framework. Option 3 would create a statutory licensing scheme with collective management.

The UK’s choice will determine whether it aligns with the EU approach or diverges from it. Alignment would create a contiguous regulatory zone across the Channel, simplifying compliance for providers operating in both markets. Divergence would add a third regime to the EU-US duality, increasing complexity. The UK IPO’s expected report in 2026 will indicate the government’s preferred direction, but legislative implementation would follow only if the government chooses to act on the recommendation. The consultation has attracted significant input from creative industries favoring stronger protection and from technology sectors favoring broader exceptions, reflecting the same stakeholder divisions visible in EU and US debates.

India’s Proposed Statutory Licensing Model

In December 2025, the Department for Promotion of Industry and Internal Trade (DPIIT) published a working paper proposing a mandatory blanket license with collective management organizations for AI training data. The proposal would require AI providers to obtain licenses from designated collecting societies, with fees distributed to rights holders based on usage data. The consultation closed on February 6, 2026, and the government is expected to publish its response in the second half of 2026.

India’s model, if implemented, would create a global precedent for compulsory licensing of AI training data. Unlike the EU’s opt-out framework or the US fair use approach, India’s proposal would make licensing the default and only mechanism, with no exception for rights holders who do not opt out. The operational implications for providers with Indian operations or Indian training data would be direct: they would need to negotiate with Indian collecting societies and maintain usage records for fee distribution. The model’s potential influence on other jurisdictions, particularly in the Global South, should not be underestimated. If India demonstrates that compulsory licensing is administratively feasible and economically sustainable, other countries may adopt similar frameworks.

Technical Standards for Rights Reservation

The World Wide Web Consortium (W3C) and industry consortia are developing technical standards for machine-readable rights reservation that may eventually supplant the current fragmented landscape of robots.txt, ai.txt, and proprietary metadata formats. Standardization efforts focus on interoperability: a single, widely adopted protocol that rights holders can use to signal TDM reservations across all content types and platforms, and that providers can implement once rather than maintaining multiple parsing systems.

The interoperability challenges between current opt-out mechanisms are a live operational problem. A provider may encounter robots.txt directives, ai.txt files, meta tags, HTTP headers, and embedded metadata within the same domain, with inconsistent or contradictory signals. Resolving these conflicts requires judgment that automated systems cannot yet exercise reliably. The development of automated compliance checking tools is advancing, but current tools are limited to specific signaling systems and cannot yet handle the full complexity of multi-jurisdictional, multi-modal rights management. Governance teams should monitor standardization developments and plan for migration to emerging standards when they achieve sufficient adoption and stability.

Why Provenance-by-Design Will Define Competitive Advantage in AI Development

The 2026 landscape for AI training data copyright is not a temporary compliance hurdle that organizations can clear and then return to unconstrained web scraping. It is a permanent structural shift in AI development economics. The era of treating training data as an abundant, unregulated input is ending. The new era requires provenance-by-design governance architectures that embed rights management into data pipelines from acquisition through model deployment.

Organizations that build systematic rights-management into their data pipelines will gain sustainable competitive advantage. They will face lower regulatory enforcement risk, reduced litigation exposure, and faster market access in jurisdictions with strict compliance requirements. Their transparency documentation will be ready for regulatory submission without emergency remediation. Their training data inventories will support rapid licensing negotiations if mandatory regimes emerge. Their governance posture will be defensible to boards, investors, and partners who increasingly scrutinize AI risk management.

Organizations that treat training data copyright as a legal afterthought will face compounding risks. Regulatory enforcement under the AI Act will intensify as the AI Office completes its organizational build-out and begins systematic compliance review. Litigation in the US will produce precedents that narrow fair use defenses for AI training, as the February 2025 decision in Thomson Reuters v. Ross Intelligence has already demonstrated. The June 2026 Copyright Directive review and subsequent legislative processes will likely tighten, not relax, EU obligations. State-level transparency requirements in the US will proliferate, creating additional documentation burdens. The cumulative effect is a governance environment where reactive compliance becomes progressively more expensive and less effective.

The June 2026 Copyright Directive review and ongoing US litigation will further clarify requirements, but the direction of clarification is toward greater obligation, not lesser. Early governance investment is the only prudent strategy. The providers that begin building provenance-by-design systems now will be positioned to adapt to whatever regulatory configuration emerges. Those that delay will face retrofit costs that multiply with each new obligation, each enforcement action, and each judicial decision that narrows their defensive options.

The cross-jurisdictional friction between EU regulatory certainty and US litigation uncertainty will not resolve in the near term. It is a structural feature of the global AI governance landscape that providers must learn to navigate, not await resolution of. The frameworks, standards, and implementation guidance in this article are designed for that navigation: not as a prediction of what the law will be, but as a map of what it is, where it is heading, and how governance professionals can build systems that remain effective across the range of plausible regulatory outcomes.

Does the EU AI Act require AI providers to license all copyrighted training data?

No. Article 53 of the EU AI Act requires compliance with existing copyright law, not universal licensing. The TDM exception under Article 4 of Directive (EU) 2019/790 permits commercial text and data mining on works where rights holders have not validly reserved those rights in machine-readable form. Licensing is only required for content where a valid opt-out applies or where the TDM exception does not cover the use. The AI Act does not create a new licensing obligation; it mandates adherence to the existing framework.

Can a US company rely on fair use for models deployed in the EU?

Fair use under 17 U.S.C. § 107 is a US doctrine with no direct equivalent in EU law. Under Recital 106 of the EU AI Act, copyright obligations may apply to models offered in the EU regardless of where training occurred. A US fair use determination does not shield a provider from EU regulatory action. The February 2025 decision in Thomson Reuters v. Ross Intelligence, which rejected fair use for AI training on legal headnotes, illustrates the litigation risk in the US, but that determination has no binding effect on EU regulators. Providers training under fair use assumptions and deploying to EU users must independently assess compliance with EU opt-out and transparency requirements.

What constitutes a “sufficiently detailed” training data summary under Article 53(1)(d)?

The AI Office template, published on July 24, 2025, specifies the required granularity. Providers must disclose dataset modalities, sizes, languages, acquisition dates, major publicly available datasets with identifiers, web crawler specifications, and measures to comply with rights reservations. The template requires disclosure of the top 10 percent of crawled domains by volume and any dataset exceeding 3 percent of total training data. It does not require work-by-work assessment. Confidential business information may be withheld if the summary remains sufficiently detailed to serve its regulatory purpose.

Are synthetic datasets exempt from transparency requirements?

No. The AI Office template explicitly requires disclosure of synthetic data usage. Synthetic data does not eliminate copyright obligations if the synthetic data was derived from protected works. If a model trained on copyrighted works generated the synthetic dataset, the synthetic data inherits the source material’s copyright status. Additionally, synthetic outputs may themselves carry downstream copyright risk if they contain expressions substantially similar to protected works. Providers must document synthetic data provenance, including the generation pipeline and source model training data, as part of their transparency disclosures.

AI Governance Desk

Covering responsible AI, governance frameworks, policy, ethics, and global regulations shaping the future of artificial intelligence.