Associated Press Licenses its News Archive to OpenAI and will Begin Using LLMs
Access to data for one and to capabilities for the other
The Associated Press and OpenAI inked a deal last week to provide OpenAI access to the AP’s large library of news content. Some key AP stats will explain why OpenAI is particularly interested in partnering with the news syndicator:
The AP published over 400,000 stories in 2022.
It has 3.65 million text stories in its archives.
It also published 21,500 hours of live video in 2022.
It has more than 2 million video stories in the archives.
The archives include more than 60 million photos (presumably all with descriptive information)
OpenAI’s large language models (LLMs) have certainly trained on some of the AP news stories since they are so widely published on the web. However, there are likely many more stories from before widespread internet publishing that GPT-3 and GPT-4 have never encountered. That is probably doubly true for the transcripts of the video news stories.
It is likely true that the AP’s text archive alone would make a deal attractive to OpenAI. However, the video and image libraries may also be particularly helpful for training the DALL-E text-to-image model and upcoming releases of OpenAI text-to-video models.
Sam Altman, CEO of OpenAI, spoke to the importance of the company’s dataset on the Lex Fridman podcast:
We spend a huge amount of effort pulling that together from many different sources. There's like a lot of, there are open source databases of information.
We get stuff via partnerships. There's things on the internet. It's a lot of our work is building a great data set.
Altman was mostly evasive when Fridman asked about specifics associated with OpenAI’s LLMs. However, you can tell from these comments and others that OpenAI sees data as a competitive advantage. News story data from the AP may just extend that advantage. It also may just lock in a customer.
A News Customer for OpenAI
The announcement also reflects a new OpenAI customer acquisition. News organizations will certainly become large users of generative AI. Gannett, the largest news outlet in the U.S., recently announced a deal with OpenAI competitor Cohere. AP is a substantial customer win to counterbalance that deal. The AP announcement suggested both companies will benefit:
The arrangement sees OpenAI licensing part of AP’s text archive, while AP will leverage OpenAI’s technology and product expertise. Both organizations will benefit from each other’s established expertise in their respective industries, and believe in the responsible creation and use of these AI systems.
…
“Generative AI is a fast-moving space with tremendous implications for the news industry. We are pleased that OpenAI recognizes that fact-based, nonpartisan news content is essential to this evolving technology, and that they respect the value of our intellectual property,” said Kristin Heitmann, AP senior vice president and chief revenue officer.
…
AP began automating corporate earnings reports in 2014 and subsequently added automated stories previewing and recapping some sporting events, thereby expanding its content offering. Additionally, AP uses AI technology to aid in the transcription of audio and video from live events like press conferences.
Voicebot.ai also reported that AP executives confirmed that the OpenAI deal is not exclusive. The AP makes 78% of its revenue from content licensing. It is unclear whether OpenAI is paying directly for access to AP text data or getting access in exchange for the use of its foundation models.
It is a good practice to remember that the LLM wars are actually proxy battles in the larger cloud computing wars, and data is the ammunition.
At least “showing intent”.
Some critics might say too late.
I agree with your analysis of dynamics between both parties.
Might we apply GDPR terminology and call OpenAI a “Data Processor” because they
“...process personal data on behalf of the Controller”
The end user would be a Controller since they provide the instructions about processing activities and are responsible for processing.
AP and OpenAI deal; Will this indemnify them from future allegations of IP and copyright violations?