ABSTRACT
Courts and commentators commonly assume that scraping copyrighted works to train large language models (LLMs) constitutes ‘copying’ under U.S. copyright law. This Article argues that the premise is wrong. Training creates no fixed, human-readable embodiment of the source material; models retain only non-expressive statistical weightings. Without fixation, there is no ‘copy’, and because developers exercise no volitional control over any later verbatim output, direct-infringement liability fails. Even if training were deemed copying, it is a highly transformative use that leaves every traditional market for the underlying works intact, easily satisfying fair-use analysis. The infringement inquiry should therefore shift downstream to users who intentionally prompt a model to reproduce protected expression. Refocusing the law this way preserves copyright’s incentive structure while allowing AI research and innovation to flourish.
Hoag, Marc, Why AI Training Should Fall Outside Copyright’s Domain: A Legal Analysis (June 12, 2025).
Leave a Reply