In the ever-evolving landscape where ink meets code, The New York Times has unleashed a legal storm against artificial intelligence (AI) giants Microsoft and OpenAI. As we bid farewell to a year marked by explosive growth in generative AI, the newspaper’s lawsuit, takes centre stage, accusing the tech behemoths of riding on the coattails of journalistic brilliance without fair compensation. The battle lines have been drawn, with The Times boldly claiming that the very essence of its groundbreaking journalism is under threat, setting the stage for a landmark case that could redefine the symbiotic relationship between media and artificial intelligence. The lawsuit represents a watershed moment as The New York Times becomes the first news entity to wage a legal war against generative AI. Asserting that there is nothing transformative about using its content without payment, The Times has alleged that OpenAI and Microsoft, creators of ChatGPT and other AI models, are not merely transformative but occasionally indulge in what the newspaper terms as memorization. The lawsuit points fingers at the AI models, claiming they can regurgitate chunks of its articles, posing a grave threat to the newspaper’s revenue streams.

New York Times vs OpenAI

The New York Times and OpenAI are embroiled in a legal and ethical battle regarding the use of The Times’ content in training ChatGPT and the potential implications for copyright, journalism, and the future of information generation. The Times alleges copyright infringement, claiming OpenAI copied millions of its articles to train ChatGPT without permission or compensation. In a 70-page complaint filed in a Manhattan federal court, the newspaper has accused OpenAI, now owned by Microsoft, of unauthorized use of copyrighted material and profiting from its work and name. The Times has presented a strong case by presenting 100 examples of ChatGPT reproducing verbatim articles, a feat rarely showcased in earlier AI-related copyright suits. Meanwhile, OpenAI has swiftly responded, denying the accusations and dismissing the lawsuit as without merit. Central to their defence was the argument that training AI models with publicly available internet material, including news articles, falls within the bounds of fair use and is essential for fostering innovation. OpenAI contended that the Times exaggerated the problem, accusing them of using manipulated prompts to trigger ChatGPT’s alleged content reproduction. While acknowledging the removal of a feature named Browse due to inadvertent content duplication, OpenAI maintained that ChatGPT typically does not behave as described by the Times.

This legal tussle has ignited broader conversations about AI training data, the interpretation of copyright law in the digital age, and the concept of fair use. The case has the potential to establish crucial precedents for both AI development and intellectual property rights. Beyond the legal intricacies, the case has garnered international attention, prompting discussions about the ethical implications of employing copyrighted material for training AI models. Some advocates argue for AI researchers to seek opt-in permissions from content creators before incorporating their work into training datasets. Moreover, the lawsuit filed by the Times might catalyze other media companies to consider similar legal action against AI developers.

What is Fair Use under copyright laws?

Fair use is a legal doctrine that allows the use of copyrighted material under certain circumstances without obtaining permission from or paying royalties to the copyright holder. Fair use is a complex legal doctrine, and each case is evaluated on its merits. The outcome can depend on the specific circumstances of the use, and legal interpretations may vary. The US Copyright Act outlines the factors that determine whether the use of copyrighted material qualifies as fair use. These factors include – the purpose and character of the use, nature of the copyrighted work, amount and substantiality of the portion used, and the effect on the market. The purpose and character of the use, including whether such use is commercial or for nonprofit educational purposes. Generally, uses that are transformative, adding something new to the original work, are more likely to be considered fair use. The nature of the copyrighted work, with creative works receiving more protection than factual or informational works. Using information from the internet, which often contains factual content, may be more likely to be considered fair use. The amount and substantiality of the portion used concerning the copyrighted work as a whole. Using small or less significant portions of a work is more likely to be considered fair use. The effect of the use upon the potential market for or value of the copyrighted work. If the use of the material negatively impacts the market for the original work, it may weigh against a finding of fair use.

What are Large Language Models (LLM)?

Large Language Models (LLMs) are advanced natural language processing models that are trained on massive amounts of text data to understand and generate human-like language. These models are a type of artificial intelligence (AI) that falls under the broader category of machine learning. Large language models have gained significant attention in recent years due to their impressive ability to perform various language-related tasks. LLMs are characterized by their massive scale, both in terms of the amount of training data they are exposed to and the number of parameters they possess. Training on large datasets helps them learn intricate language patterns and nuances. The number of parameters in a model determines its complexity and learning capacity. Large language models often have millions or even billions of parameters, enabling them to capture and generate highly sophisticated language structures. LLMs are trained on diverse datasets comprising a wide range of text sources, including books, articles, websites, and other publicly available content from the internet. This extensive training data helps them develop a broad understanding of language. These models can be fine-tuned for specific applications and used for various natural language processing tasks, such as text completion, translation, summarization, question-answering, and even creative writing. Some of the most popular LLMs are Bard, BERT, T5 and GEMINI by Google, ChatGPT by OpenAI, Megatron-Turing NLG by Microsoft and Nvidia, LlaMa by Meta, ERNIE by Baidu, etc.

How does OpenAI train ChatGPT?

OpenAI has not publicly disclosed the specific details of how ChatGPT scans the internet to train itself. However, it’s known that ChatGPT is trained using a method known as unsupervised learning on a diverse range of internet text. Unsupervised learning involves training a model without labelled examples, allowing it to learn patterns and information from the data it is exposed to. During the training process, a language model like ChatGPT is typically fed large amounts of text data from various sources on the internet. This data can include articles, websites, forums, and other publicly available text content. The model learns to generate human-like text by predicting the next word or sequence of words in a sentence based on its understanding of the patterns and contexts present in the training data. It’s important to note that while ChatGPT is trained on a diverse dataset from the internet, OpenAI has implemented measures to address potential issues, such as biases and inappropriate content. Additionally, the model is designed to generate responses based on patterns it has learned during training rather than by retrieving specific information from the internet in real time.

OpenAI asserts that training AI models with publicly available internet materials, including news articles, falls under fair use. They argue that the vast and diverse nature of the internet data, combined with the transformative nature of training AI models, supports their claim of fair use. OpenAI contends that the use of copyrighted material is not the primary purpose of training these models and that such use is crucial for innovation in the field of artificial intelligence. The ongoing legal proceedings between The New York Times and OpenAI will likely provide more insights into how fair use is applied in the context of AI models using information from the internet.

A Fight to Watch

The lawsuit between OpenAI and The New York Times (NYT) has significant implications for generative AIs, journalism, and the broader landscape of intellectual property rights. The lawsuit could set a legal precedent for the use of copyrighted material in training generative AIs. The outcome may establish boundaries and guidelines for fair use in the context of AI development. A ruling in favour of OpenAI could influence how other AI developers approach the use of data for training purposes. The case will also decide on the responsibility of AI developers to seek consent or provide fair compensation to content creators. Meanwhile, the result of the lawsuit will also prompt media organizations to reassess how AI technologies impact their ability to generate revenue and maintain reader engagement. The lawsuit and its resolution may impact how AI developers negotiate and enter licensing deals with media organizations. OpenAI cannot settle the case with the New York Times, as this would follow more such settlements with numerous other organisations. Meanwhile, it would surely garner support from other LLM developers as it affects their business as well. Given the high-profile nature of the case, it doesn’t seem we are anywhere close to a result and the lawsuit might go on for months, if not years, before we get a result.


  1. Why The New York Times is suing OpenAI and Microsoft, what it could mean for AI and copyright
  2. Why NYT vs. OpenAI will be the copyright fight to watch in 2024
  3. OpenAI fires back at NYT copying allegations, calls claims ‘without merit’
  4. The New York Times Wants ChatGPT Gone—Nice Try
  5. Image by storyset on Freepik