Share This Article
How AI models learn from online data
In recent years, technology companies have argued that generative artificial intelligence models learn from online content in a way similar to human beings.
However, recent academic studies and journalistic investigations suggest a more complex reality. In several cases, AI systems do not merely learn abstract concepts but can memorize and regenerate copyright-protected content, raising significant legal and economic questions for companies, authors and creative industries.
The Stanford Study on Generative AI
Research published by Stanford University in January 2026 analysed some of the most advanced AI models currently available, including:
- GPT-4.1
- Claude 3.7 Sonnet
- Gemini 2.5 Pro
- Grok 3
The study showed that, by providing short initial excerpts and bypassing security filters, these models can reconstruct large portions of copyrighted books, sometimes with a very high level of fidelity to the original text.
The most relevant finding concerns the possible storage of portions of literary works within the internal parameters of the models, which could then be retrieved through targeted prompts.
Not only books: images, music and articles
This phenomenon is not limited to literature.
An investigation published in The Atlantic by journalist Alex Reisner highlighted that other types of content can also re-emerge from AI models in recognisable forms, including:
- images and photographs
- song lyrics
- journalistic articles
This scenario challenges the widespread belief that AI systems are trained solely at a conceptual level and not through the storage of original content.
Human learning and AI: a fundamental difference
Comparing human learning and artificial intelligence learning can be misleading.
When a person reads a text, they tend to interpret it, connect it to other knowledge and rework it over time. Moreover, much of the information is forgotten or transformed.
An AI model, on the other hand, compresses huge amounts of data into mathematical parameters. When generating a response, it reconstructs statistically probable sequences of words, some of which may closely resemble content contained in the training data.
This difference raises doubts about the very use of the term “learning” in relation to generative AI.
Copyright and fair use in the United States
In the United States, many technology companies defend the use of protected works by invoking the principle of fair use, arguing that training models constitutes a transformative use of content.
However, if a system is capable of reproducing entire chapters of a book, the transformative nature of such use becomes questionable.
A case in point is the lawsuit filed by The New York Times against OpenAI, in which the newspaper accuses the company of using millions of articles without authorization and, in some cases, allowing paywalls to be bypassed.
The European Union’s position
In Europe, the regulatory framework is more restrictive than in the United States.
In fact, there is no general principle of fair use and exceptions to copyright — such as those related to text and data mining — can be excluded by rights holders.
In January 2026, the European Parliament proposed stricter rules for artificial intelligence, calling for:
- greater transparency on training datasets
- fair compensation for authors and creators
- clearer legal liability for AI companies
A significant precedent is the 2025 ruling by the Munich Regional Court, which established that training on protected musical texts constitutes illegal reproduction, even when the works are transformed into numerical parameters.
Opaque datasets and bypassable filters
Another issue concerns the lack of transparency of training datasets.
Currently, there are no complete public lists of the works used to train AI models, nor are there any systematic independent auditing systems. It is known that some datasets also contained pirated books, but without effective external controls.
Furthermore, the security filters adopted by companies do not eliminate any potentially stored content; they merely prevent direct access to it. In several cases, these filters can be bypassed using relatively simple techniques.
The impact on cultural industry
The implications for the cultural industry are significant.
Writers, journalists, musicians and artists risk seeing their works incorporated into AI models without consent or compensation, with a potentially substantial economic impact.
If substantial parts of a work can be obtained for free via chatbots, the commercial value of the original content may decline, putting pressure on the entire creative ecosystem.
Innovation and regulation: finding a balance
Completely restricting access to data would slow the development of artificial intelligence.
At the same time, continuing to argue that these systems “learn like humans” appears increasingly unconvincing.
To ensure the development of sustainable AI, many experts point to the need for a new balance based on:
- clear copyright rules
- greater transparency in training datasets
- exclusion rights for authors
- fair compensation systems for creators
Conclusions: a decisive phase for AI and copyright laws
Regulatory decisions in the coming years will be crucial in defining the relationship between artificial intelligence and human creativity.
Without adequate protection for authors and creators, there is a risk of progressive impoverishment of the cultural ecosystem. Directly addressing the issue of content storage and copyright laws has now become a necessary condition for truly sustainable technological development.
