OpenAI Deal with Financial Times another Sign That Data is the Key LLM Battlefront
These licensing deals are unlikely to last forever as data needs will decline
OpenAI cut a new content licensing deal with the Financial Times. It is the latest indicator of the shift in large language model (LLM) competition from model development to data curation. According to Financial Times reporting:
Under the terms of the deal, the FT will license its material to the ChatGPT maker to help develop generative AI technology that can create text, images and code indistinguishable from human creations.
The agreement also allows ChatGPT to respond to questions with short summaries from FT articles, with links back to FT.com. This means that the chatbot’s 100mn users worldwide can access FT reporting through ChatGPT, while providing a route back to the original source material.
This confirms that OpenAI will have access to Financial Times data for responding to news queries and for model training. The former will provide more value to ChatGPT users. However, the data for model training may be the more significant element. Sam Altman, OpenAI’s CEO, indicated in a 2023 interview with Lex Fridman that data curation for model training would likely become a key differentiator for future LLM performance.
Gaining a Customer
OpenAI also wrote in its blog that the Financial Times became a customer in 2024. That secures a high-profile user of ChatGPT Enterprise despite strong competition from Microsoft and Google for similar applications.
In addition, the FT became a customer of ChatGPT Enterprise earlier this year, purchasing access for all FT employees to ensure its teams are well-versed in the technology and can benefit from the creativity and productivity gains made possible by OpenAI’s tools.
Seeking Performance
LLM performance improvements were largely driven by increased parameter counts up through 2022. That shifted in 2023 to a combination of new model architectures, such as Mixture of Experts (MoE), and larger training datasets. These trends have continued, but we are starting to see training data curation as a rising source of differentiation strategy.
For example, Microsoft indicated in its announcement for the new Phi-3 small language models (emphasis added):
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The
Curated datasets are helpful in model pre-training because the inclusion of higher-quality data offsets, or in some cases replaces, lower-quality data that can undermine output quality. These datasets are also helpful in supervised fine-tuning, which is employed to refine models after their initial training.
News Revenue
Several news media publishers have welcomed the opportunity to take in OpenAI licensing fees. Financial Times indicated that the Axel Springer licensing deal included an upfront fee for access to the publisher’s content catalog, along with “a larger fee” for an annual licensing of newly published information. This is presumably a similar structure to the deal for Financial Times content.
The reporting also indicated that Axel Springer expects to “earn tens of millions of euros a year.” This is larger than the $1-$5 million fees that OpenAI has reportedly been offering other publishers. However, Axel Springer is licensing access to several publications in multiple languages, likely justifying the higher price tag.
Avoiding the Courts
News organizations are rightly pointing out the contrast between this Financial Times’ deal and The New York Times’ decision to sue OpenAI. NYT contends that OpenAI unlawfully used its content to train AI models and has provided ongoing access to news stories. This, in effect, robs The New York Times of customer relationships, advertising revenue, and new paying subscribers, according to the complaint.
The New York Times may be advocating for property rights and limiting tech giants' expropriation of intellectual property. However, it is more likely using the lawsuit as a negotiating tactic to extract higher fees from OpenAI.
The publisher also faces the risk of losing in court and creating a precedent for a broader definition of fair use of content published on the internet. This is a key reason why most observers believe NYT expects to settle its lawsuit before it comes to the judgment phase. With that said, this approach may backfire for a different reason.
The Economics of Information
While OpenAI licensing fees are “found revenue” for news publishers that typically have difficulty monetizing their catalog of previously published stories and represents a novel income source for new publishing, the deals are unlikely to become an endless stream of fees. LLM developers need quality data more for training models than for providing real-time access to news.
Once model developers have the training data, their need for multiple news sources decreases significantly. There are fair use rules around search, and most news stories are covered by multiple outlets. Even if a model developer wants access to real-time news for their generative AI assistants, there is so much overlap in coverage that there is little need to engage with any publisher. In fact, once you have a deal in place with the Associated Press (AP), the value of NYT content declines.
This is not to say the NYT doesn’t have some unique content or high-quality content. It is more of a binary consideration. If there is a need to have access to content, you might strike a licensing deal so you don’t have a significant gap in your customer offering. However, that next deal for redundant content may be viewed as optional.
It is unlikely that the big model developers will continue to license news content at a large scale beyond after the initial need for high-quality training data is met. The rapid improvement of open-source models offered by Meta, Mistral, Microsoft, and X.ai suggests that pricing compression, if not full commoditization, will rise. That will lead proprietary model developers to look at “optional costs” that the open-source providers do not absorb. As to filling the content gaps for generative AI assistants, fair use in a search use case is unlikely to make publisher licensing agreements a requirement.
OpenAI’s strategy is particularly savvy since it needs data now. In addition, any deal it strikes with a publisher will negate its downstream liability for a claim that proprietary publisher content was used to train the AI models. It is unknown whether publishers will win in court on this claim. Regardless, OpenAI should be largely inoculated from lawsuits by any of its publisher partners, even if it cancels annual deals.
OpenAI has announced licensing deals with Financial Times, LeMonde, Prisa, Axel Springer, and AP. Outside of publications with depth regarding niche topics, each new deal makes the need for additional publisher content less valuable. There will be more deals between OpenAI and publishers over the next year. However, they won’t last forever.