Two Laws Walk Into a Dataset

On the structural impossibility of obeying both copyright and privacy in the age of foundation models, and what to build in their place.

May 09, 2026

Tags: AI training, copyright, GDPR, AI Act, machine unlearning, algorithmic disgorgement, data privacy, foundation models, regulatory ontology, Computer Law & Security Review

A note before the argument

The paper this post summarises has just been accepted in Computer Law & Security Review, the Elsevier journal of record for the intersection of technology, law, and security policy. CLSR is Q1 in the Law category, ranked second of ninety on the four-year metric, with a 2024 Journal Impact Factor of 4.89, a CiteScore of 6.2, an SJR of 1.085, and an h-index of 62. The article reference is CLSR_106343. The article is Training on the Tightrope: AI Copyright and Data Privacy as Colliding Regulatory Regimes, and it will appear under the journal’s open access option, free for anyone to read on the day it goes live. The DOI and issue number will follow once the production team finishes its work.

What follows is the argument in the voice this Substack speaks, rather than the voice the law reviews demand. A summary mind you of what is to come.

The species of problem

There is a kind of legal problem that flatters everyone involved. Lawyers love it because it cannot be solved and so will pay them indefinitely. Legislators love it because it can be addressed by the production of more law, which is what legislators are paid to produce. Regulators love it because it provides them with infinite enforcement opportunities against the same defendant for the same conduct. Academics love it because it generates papers. The only people who do not love it are the small minority who believe that legal obligations should be the kind of thing one can actually obey, and these people, fortunately for the system, are not consulted.

The collision between copyright law and data-privacy law in the training of artificial intelligence systems is the cleanest contemporary example of this kind of problem.

The artefact under dispute

A foundation-scale language model is trained on a corpus of, conservatively, several billion documents drawn from the open web. The corpus contains, by anyone’s reckoning, a million news articles in which copyright subsists; thousands of literary works whose authors have not been consulted; the personal correspondence of academics who imagined themselves writing for an audience of fifty; and, threaded through all of it, the names, addresses, photographs, opinions, medical disclosures, and bureaucratic residue of hundreds of millions of identifiable human beings.

The training proceeds. The model emerges. It can produce fluent English on any topic. It cannot tell you which article it has read. It cannot identify which diary it has digested. It cannot, in any sense the law recognises as deletion, forget anything. The bytes that entered the training pipeline as expression and as personal data have been processed into statistical residue from which neither expression nor personal data can be cleanly recovered, and into which neither expression nor personal data can be cleanly returned.

This is the artefact about which the law has opinions.

The three regulators arrive

The copyright plaintiff, in the person of the New York Times or the Authors Guild or whichever publishing concern has discovered that its inventory has been digested without payment, arrives with the standard repertoire. He demands that the training dataset be preserved as evidence. He demands that the model be enjoined from further infringement, a demand whose meaning, applied to a probabilistic generator, nobody has yet specified. He demands damages, calculated by reference to the works copied, at statutory rates that, multiplied by the size of the corpus, exceed the gross domestic product of small republics. He demands, in the most ambitious cases, that the model itself be impounded and destroyed, as one impounds and destroys a printing press.

The data-protection authority, by contrast, arrives with a different vocabulary and a different set of concerns. He demands that the corpus be deleted in the portions that pertain to his citizens — the right to erasure, Article 17 of the GDPR, the cornerstone of the modern European conception of personal autonomy in the data age. He demands that the developer demonstrate a lawful basis for processing, a demand which, given the impracticality of obtaining individual consent from hundreds of millions of data subjects, leaves the developer arguing legitimate interests against an audience of regulators who are not in a charitable mood. He demands that the model itself prove, by some method nobody has yet invented, that it does not contain personal data in any sense that enables identification. He demands data minimisation, applied to a developer whose entire business model depends on the indefinite retention of everything.

The Federal Trade Commission, never to be left out of an enforcement opportunity, arrives with the doctrine of algorithmic disgorgement. He demands not merely the deletion of the data but the destruction of every model trained on it, on the theory that the taint of improper provenance infects the parameters as cholera infects a well.

Three regulators. Same developer. Same corpus. Same model. Demands, between them, that span preservation, deletion, injunction, demonstration of a negative, retention, minimisation, and destruction. They cannot all be obeyed, because some of them contradict each other. They cannot be obeyed in part, because each regime claims the whole of its jurisdiction. They cannot be reconciled by balancing, because — and this is the point that escapes the literature — there is no shared object over which to balance.

The ontological divergence

This is the central observation of the paper, and once it is grasped the rest follows.

Copyright law sees the corpus as a collection of works. Privacy law sees the corpus as a repository of personal data. These are not two perspectives on one thing. They are two legal objects constructed on the same physical substrate. The same byte sequence that copyright analyses through originality and fixation, privacy analyses through identifiability and consent. The two regimes are not disagreeing about how to value a shared thing. They are disagreeing about what the thing is. And until someone constructs a legal object that admits both characterisations simultaneously, no balancing exercise within either regime can resolve the resulting contradictions, because the regime within which the balancing is conducted does not recognise the existence of the other regime’s object.

The five contradictions

The paper extracts from this insight five structural contradictions, derived by enumerating the elementary obligations each regime imposes, pairing them, and identifying the cells in which joint compliance is logically or technologically impossible. The taxonomy is exhaustive within the obligation set examined, and it is falsifiable. A reader who finds a sixth, or who shows that one of mine is not contradictory, refutes it. That is the test, and I welcome anyone who wishes to apply it.

The Retention Paradox

Copyright says preserve the corpus. Privacy says delete it. The GDPR has a legal-claims exception, but it shields only the personal data relevant to the claim, not the whole corpus. The developer cannot retain everything and delete the parts that pertain to data subjects without retaining what he must delete, and cannot delete the parts that pertain to data subjects without compromising what he must preserve. He is told to do both, to do them honestly, and to document the process. This is the kind of instruction that produces compliance officers and bankruptcy lawyers in roughly equal measure.

The Unlearning Impossibility

Copyright wants the model destroyed or filtered. Privacy wants specific personal data surgically erased from the model. Neither operation can be done, in any sense the law would recognise. Exact unlearning — retraining from scratch on a clean corpus — is technically defined and economically prohibitive at foundation scale, where a single training run costs tens of millions of dollars. Approximate unlearning — gradient ascent, representation rewriting — is technically feasible at modest cost and verificationally unsound, in that recent adversarial work demonstrates the residue can be recovered by anyone with a sufficient interest in recovering it. Output filtering does not erase anything; it merely declines to disclose. The law’s preferred remedy is a forensic operation on a substrate the technology cannot perform that operation on, and the law has not yet noticed.

The Consent–Licence Gap

A copyright licence is granted by the author. A privacy consent is granted by the data subject. They are different instruments, granted by different parties, governed by different regimes. The publisher who licenses the news archive does not speak for the politicians, witnesses, victims, and ordinary citizens named in the stories. Their data-subject rights run against the developer independently of any agreement with the publisher. The reverse is equally true: the user who consents to the processing of his personal data does not thereby grant a copyright licence for the third-party expression embedded in the consented content. Collective licensing schemes mitigate the gap; they do not close it, because they cannot speak for parties who never appointed them.

The Jurisdictional Fracture

American copyright is exclusively federal, by force of the Constitution and Section 301. American privacy is overwhelmingly state-level, by force of congressional paralysis. European copyright is partially harmonised through directives that admit national variation. European data protection is fully harmonised through a directly applicable regulation that twenty-seven national authorities enforce twenty-seven different ways. A developer operating across both jurisdictions faces a compliance matrix of bewildering complexity, and a litigant may bring copyright claims in federal court while data-protection authorities run parallel proceedings in twenty-seven national capitals, with no mechanism to coordinate their outcomes. This is not a regulatory system. It is a regulatory weather pattern.

The Remedial Mismatch

Copyright remedies are litigation-driven, backward-looking, and oriented to the rights holder’s economic interest. Privacy remedies are rights-driven, forward-looking, and oriented to the data subject’s autonomy. FTC remedies are oriented to the wrongful conduct of the developer. These are three different theories of what regulation is for, and they generate three different and incompatible accounts of what should be done about the model. Destruction satisfies the rights holder and the FTC and defeats privacy’s preference for targeted erasure. Targeted erasure satisfies privacy and leaves the rights holder unsatisfied. No combination addresses all three.

The European accomplishment

The European Union has built the most comprehensive legislative architecture for the regulation of artificial intelligence in the world: the General Data Protection Regulation, the 2019 Copyright Directive, and the AI Act, layered over each other like geological strata. The Directive’s text-and-data-mining exception, in Articles 3 and 4, was a real legislative achievement. It gave commercial AI developers a legal pathway for training that their American counterparts still lack.

It is also, in its operational complexity, its silence on downstream uses, and its character as a copyright instrument only, an illustration of the pathology I am describing. An Article 4 reservation is a copyright reservation. It says nothing about lawful basis under Article 6 of the GDPR for processing the personal data embedded in the same works. The European legislature created a copyright exception for AI training without considering the privacy consequences of the very same activity. This is what I mean when I say the regime is, in its own terms, comprehensive and incoherent.

The European Data Protection Board’s December 2024 opinion completes the picture by requiring the developer to prove that the model does not store personal data in a manner enabling identification, using a verification technology that does not yet exist. The developer is asked to prove a negative with an instrument that has not been built. He is told this is the law.

The American shrug

The American picture is no better, which is the kind of sentence one wishes one did not have to write.

There is no comprehensive federal privacy statute. There is no comprehensive federal AI statute. Copyright litigation has become the de facto policymaking instrument, with each major case proceeding on its own track, no mechanism for coordinating outcomes, and federal courts hearing copyright cases that have no occasion to consider the privacy implications of their remedial orders. State privacy enforcement and FTC algorithmic disgorgement run in separate institutional silos with no coordination protocol. Congress has been unable to pass comprehensive legislation in either domain for a decade, and shows no sign of remedying this. Executive orders cannot create statutory rights or restructure jurisdictional architecture. The American system, viewed from any altitude, is a regulatory pile-up at an intersection where nobody has thought to install a traffic light.

Why the smart fixes are not fixes

The hybrid approaches in the literature reduce the cost of fragmentation without addressing the source of the contradictions.

Risk-tiering, as in the AI Act and the NIST framework, sorts AI systems by risk profile and calibrates obligations accordingly. This is useful and does not solve the problem. Federated sector-led governance, as developed for the Indian context, coordinates across vertical sectors and regulatory layers. This is also useful and also does not solve the problem. AI incident reporting, modelled on telecommunications frameworks, provides the empirical record needed to make governance evidence-based. The AI Act’s general-purpose provisions impose model-level obligations cutting across copyright and privacy concerns.

Each of these contributes to the design of an adequate regulatory regime. None of them addresses the legal-object problem, because none of them constructs a legal object that admits the dual characterisation. Each operates downstream of the ontological question rather than on it.

An interstitial law of training data

The proposal in the paper is what I call an interstitial law of training data. Not copyright reform. Not privacy reform. A new layer that recognises the artefact for what it is and governs it accordingly. The framework has five pillars.

The unified regulatory category

Recognise training data as a legal object with a dual nature, expressive and personal, and require both courts and regulators to address both dimensions when ruling on it. This is not as radical as it sounds. Personal data is itself a regulatory invention; it did not exist as a legal category before data-protection legislation created it. Fair use is a judicial creation. Trade secrets bridge contract, tort, and property. Categories follow function, and the function here demands a category neither regime currently provides.

The tiered compliance model

Four tiers, one for each combination of copyright and privacy concerns. Most regulatory weight falls on the tier in which both regimes apply, which is the tier in which most web-scraped data sits. Mandatory ex ante classification before training begins, addressing the temporal asymmetry between copyright’s retrospective fair-use analysis and privacy’s prospective lawful-basis requirement.

The model-audit regime

This is where the paper does the substantive work of substituting an environmental-regulation logic for the impossible demand of perfect deletion. The Clean Air Act does not require zero emissions. It requires monitored compliance with emission standards. The Clean Water Act does not require the elimination of all pollutants. It requires verified concentration limits. Financial regulation does not demand banks eliminate risk. It demands stress tests and audits.

The model-audit regime applies the same logic. The obligation is not to eliminate every trace of protected expression and personal data from the model, which may be technologically impossible. The obligation is to demonstrate, through rigorous independent testing, that the model does not behave in ways that materially infringe copyright or violate privacy. Two dimensions tested. Copyright memorisation. Personal-data leakage. Concrete thresholds. Adversarial probes drawn from the unlearning-attack literature, so that surface-level filtering does not satisfy the audit. Independent auditors. Periodic certification. Public reporting of incidents.

Safe harbours

A set of safe harbours, conditioned on five practices: good-faith assessment and tier classification of the training dataset; reasonable technical measures honouring copyright opt-outs and anonymising personal data; record-keeping sufficient for the AI Act’s transparency requirements or their American equivalent; periodic audits with mandatory incident reporting; and prompt response to standardised takedown notices and erasure requests through a notice-and-response mechanism modelled on Section 512 of the DMCA but adapted to the realities of training. Safe harbours are not absolute. They do not protect wilful infringement. They do not protect models that demonstrably memorise protected expression or accurately reveal sensitive personal data. They apply to the training process, not the deployment of the trained model.

Institutional design

Two options, one ambitious and one feasible. The ambitious option is a dedicated federal AI Training Data Office, located in the Department of Commerce alongside NIST and the Patent and Trademark Office, structured as an independent multi-member commission with combined copyright, data-protection, and machine-learning expertise. The feasible option is an interagency AI Training Data Coordination Council, established by executive order, comprising the Copyright Office, the FTC, NIST, and state privacy agencies, organised along a federated five-layer architecture: regulation, standards, assessment, certification, enforcement, with horizontal coordination at each layer. The dedicated agency is more powerful. The coordination council is more achievable. The choice is political, not analytical.

The objections, briefly

The administrative-burden objection is met by the tiering structure and a small-entity exception modelled on environmental and securities regulation. The technological-obsolescence objection is met by deliberately technology-neutral, performance-based design, defining training data by function rather than format. The sovereignty-and-harmonisation objection is met by aiming for mutual recognition of audit certifications rather than full legislative harmonisation, the precedent being international accounting standards. The rights-holder objection is met by the safe harbour’s exclusion of wilful infringement and demonstrable memorisation, and by the observation that the alternative is the current regime of existential legal uncertainty, which is worse for rights holders than the proposal. The privacy-absolutist objection is met by the observation that the absolutist conception of erasure was always a legal fiction, even in the pre-AI world, and that the model-audit approach is more honest about the technological realities while preserving the functional objective of the right.

What the framework does not claim

It does not eliminate the tension between copyright and privacy. That tension is inherent in the technology and follows from the ontological divergence. The framework is a rational structure for managing what cannot be dissolved. The audit thresholds, the safe-harbour conditions, the institutional structures are parameters to be empirically tested and iteratively refined. They are not the final word. They are the first coherent word, after a generation of incoherent ones.

A closing observation

The legal profession has a habit of treating regulatory contradiction as evidence of the law’s sophistication, on the theory that a framework subtle enough to demand impossibilities must be sophisticated indeed. This is a vanity. A framework that demands impossibilities is a framework that has not been thought through, and the function of the academic, when invited to admire it, is to decline the invitation and point out where the thinking stopped.

The collision between copyright and privacy in AI training is not subtle. It is structural. It is consequential. It is not solved by another commentator clearing his throat about the importance of balance.

The paper says so. The journal has agreed to publish it. The reference is CLSR_106343 and you will find it open access in Computer Law & Security Review once production finishes its work. Until then, this is the argument in shorter form, in the voice this Substack speaks. Whether you agree or disagree with the proposal, the contradictions are there. They will not disappear because we look away. The AI systems being trained today will outlive this commentary by decades, and the law that governs them, whatever it turns out to be, will be the work either of design or of accident.

I prefer design.

Craig’s Substack

Discussion about this post

Ready for more?