OpenAI's Bid to Become a Data Titan - Distilling the 4-Pronged Generative AI Data Strategy
GPTs represent a giant new UGC data source for model training
The phrase “data is the new oil” was provocative in 2006 but may seem dated to many in 2023. There is also the question of whether it was true. Sure, data was a central asset for a privileged few. Google, Facebook, and Amazon each created multiple billion-dollar businesses on it. Bloomberg, Nielsen, Yelp, Experian, Fair Isaac, IRI, and others have made data the central asset in their product value.
When it comes to creating advanced large language models (LLM), the analogy is equally strong. Crude oil goes into a refinery, processing units break it down and transform it into different grades and byproducts, and refined products come out. That output can then be used to power machines, businesses, homes, and nations. Data is fed into a large language model housed usually in a cloud computing center (i.e., the data refinery). It is refined through training, and the data processing capability that emerges can power other solutions, machines, businesses, and maybe more.
OpenAI makes LLMs that could not exist without feeding them large piles of data. It does not have Google's or Meta’s data aggregation businesses, so it must determine how to efficiently source data to feed the models. Recent announcements and new products paint a clearer picture of how OpenAI intends to win the data war despite its origin as a data consumption and not a data generation business.
An underrecognized story is that GPTs are the latest data source that will feed OpenAI’s hungry data models. The four key data source categories OpenAI is tapping into include:
Public Content - web and other sources
Partner Content - proprietary with robust metadata
User-Generated Content - freely shared data directly with OpenAI
Created Content - human and synthetically generated
Public Content
OpenAI, like most of its competitors, began data collection by crawling the web or accessing pre-existing data piles such as Common Crawl, Wikipedia, and WebText2. This is called public data because it is published in public spaces. That does not mean the information is permitted for public use. It may contain proprietary or copyright-protected data.
For these data sources, OpenAI must determine what to collect and, more recently, began focusing on how to best curate it. OpenAI’s CEO, Sam Altman, commented in an interview on the Lex Fridman Podcast, “A lot of our work is building a great dataset.” This has led OpenAI internally to view its data sources as equally strategic (and maybe more so) than its LLMs and other foundation models.
A key challenge is data accuracy. It is one thing for a GPT model to collect a lot of natural language data to understand best how to generate responses that are humanlike in their textual structure. Ensuring the response's content is accurate or truthful is another matter. This is one place where the curation process impacts product performance. It is not necessarily about maximizing the amount of accurate or truthful information. That can be helpful, but the bigger impact is using data sources that minimize the amount of inaccurate and untruthful information.
However, humanlike textual content generation and accuracy do not address another problem: ownership. Who has rights to the data OpenAI is using to train its models?
OpenAI began by accessing datasets compiled by the open-source and academic communities. The introduction of GPTBot confirms that OpenAI also collects data directly from the public web. To reduce conflicts over ownership rights, the company has created a method for website hosts to indicate they do not want their content used for training data. At the same time, conducting its own internet crawls also reduces OpenAI’s reliance on third-party public information sources and the recency of what data is accessible to its models.
Partner Content
It can be expensive to curate your own data, particularly when you are indexing the knowledge of the world. In addition, not all useful knowledge resides on the web, and some of that web-published content is copyrighted. These considerations suggest partnering with reliable sources that create or curate high-quality natural language information will benefit OpenAI and its users. Unsurprisingly, the company is already doing this.
In July, OpenAI announced a partnership with The Associated Press (AP), a trusted source of news content. The company also inked a deal with Shutterstock around the same time to gain access to image, video, and music libraries. This data will fuel OpenAI’s DALL-E and forthcoming video and music foundation models. Content from those models will also show up as multimodal output for ChatGPT users.
A key benefit of Shutterstock data is not just the data that makes up the media but the metadata descriptions. In a large image dataset such as LAION5B, even with CLIP filtering, errors will remain. A low error count also does not equal robustness. Shutterstock’s library should provide OpenAI with a largely error-free dataset with robust metadata descriptions. This, in turn, should enable OpenAI to produce more accurate image and video output.
Shutterstock’s content library metadata will also include ownership information when available. That can help OpenAI reduce the risk of copyright violations in its model outputs. The AP deal also includes images along with access to text-based information. Anytime this data is used to generate new text or images, OpenAI will avoid the risk of copyright infringement.
Last week, OpenAI announced a new program called Data Partnerships. The program formalizes data-sharing arrangements. According to the announcement:
Modern AI technology learns skills and aspects of our world — of people, our motivations, interactions, and the way we communicate — by making sense of the data on which it’s trained. To ultimately make AGI that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures, and languages, which requires as broad a training dataset as possible.
Including your content can make AI models more helpful to you by increasing their understanding of your domain.
OpenAI is creating two paths for collaboration. The first is Open-Source Archive that would become public for anyone to use in training AI foundation models. The second is Private Datasets. The latter datasets would be proprietary and presumably available only to OpenAI for training. This is also the category where data owners can expect to benefit from some value exchange, either monetarily or through free or discounted access to OpenAI solutions.
OpenAI mentioned last week a couple of unannounced partnerships with the Icelandic Government and Free Law Project as examples. It is unclear whether these fall under the Open-Source or Private programs. Other Private Dataset partnerships are sure to follow. However, once OpenAI has a robust and accurate dataset in a domain, the value-sharing arrangements it extends to additional partners may become less attractive. The exception will be for unique intellectual property OpenAI would like to access without taking on undue legal risk of doing it outside of a formal agreement.
User-Generated (and Shared) Content
OpenAI’s most valuable data source may turn out to be content generated or shared by users. It appears the scale of this data source was unanticipated before the massive popularity enjoyed by ChatGPT. But OpenAI saw a clear case of product-market fit and ran with it. ChatGPT is a big driver of revenue and data for the company, and it is about to get much bigger.
Data generated by ChatGPT and other foundation models and services provided by OpenAI include:
Conversations - The questions and comments made by users are valuable to OpenAI, similar to the benefits Google gains from its position as the web’s search giant. This data contains natural language information and is sometimes laden with expertise. More importantly, information about how users react to responses represent signals of model output quality in terms of intent identification, response accuracy, and response robustness.
Source Preferences - As ChatGPT provides more source information through citations,
PluginsGPTs, and other features, it will provide insight into source ranking authority. Google and its successor search engines were built on this premise. OpenAI has a path to gain this insight directly and bypass the traditional search giant’s lock on this information.GPTs (Curation) - We don’t know whether GPTs will succeed in the market. However, there are already thousands available, and many GPT creators point the solutions to their preferred data sources. As GPT users interact with the services, it will provide another signal for OpenAI about source quality.
GPTs (Uploads) - GPTs also enable users to upload information to be employed as a primary knowledge base. Much of this data does not exist on the web in the form of the uploaded file. Surely, some of it is proprietary. These data may turn out to be a unique and valuable dataset for OpenAI.
Whisper Transcripts - OpenAI’s Data Partnership announcement mentions that it can “Work with data in almost any form and can use our next-generation in-house AI technology to help you digitize and structure your data. For example, we have world-class optical character recognition (OCR) technology to digitize files like PDFs, and automatic speech recognition (ASR) to transcribe spoken words.” Whisper is poised to become
athe leading ASR solution. Transcriptions that run directly through OpenAI’s Whisper service and not the earlier open-source model will represent yet another form of content. Some of that may arrive via partnerships, but much of it will be through the use of the service.
It is true that OpenAI offers users an option to limit the user-generated data in ChatGPT and states it does not use data passing through its API products. According to the terms of use:
(c) Use of Content to Improve Services. We do not use Content that you provide to or receive from our API (“API Content”) to develop or improve our Services. We may use Content from Services other than our API (“Non-API Content”) to help develop and improve our Services. You can read more here about how Non-API Content may be used to improve model performance. If you do not want your Non-API Content used to improve Services, you can opt out by filling out this form. Please note that in some cases this may limit the ability of our Services to better address your specific use case.
ChatGPT and DALL-E are listed as non-API services. You can opt out of data-sharing with ChatGPT, but it does limit the features you can access, and there is a manual form-based process to provide additional protections. Most of the time, for most users, their interactions with ChatGPT and GPTs are available for OpenAI to train its models.
Created Content - Human and Synthetic
Of course, OpenAI also creates its own content. Reinforcement Learning with Human Feedback (RLHF) involves both reviewing model outputs and adding new information to improve model performance. This is an often overlooked data source, but it is entirely controlled by the company that created it.
There is also rising interest in and use of synthetic data in foundation model training. The Financial Times (FT) reported in July:
Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch.
Microsoft, OpenAI and Cohere are among the groups testing the use of so-called synthetic data — computer-generated information to train their AI systems known as large language models (LLMs) — as they reach the limits of human-made data that can further improve the cutting-edge technology.
…
As generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.At an event in London in May, OpenAI’s chief executive Sam Altman was asked whether he was worried about regulatory probes into ChatGPT’s potential privacy violations. Altman brushed it off, saying he was “pretty confident that soon all data will be synthetic data”.
Aiden Gomez, CEO of Cohere, told FT that data from the web is “noisy and messy,” but getting human experts to fill in the gaps is very costly. Gomez also said many foundation model developers are already using synthetic data but not broadcasting that fact. A standard process is to generate synthetic data that human experts then review. This saves the content creation time and focuses on the more efficient process of review for validation, which would need to be done in any event.
There is certainly a downside to this. At some point, if generative AI foundation models are trained mostly on the outputs of other foundation models, we are likely to see quality degradation over time.
Data Titan
LLMs are data monsters. As a general rule, the more high-quality data they devour in their domain, the more value they provide. Most foundation model developers are primarily consumers of source data. That data can be costly. New open-source options, such as the RedPajama 30 trillion token dataset, can help reduce data acquisition costs. However, LLM differentiation going forward may depend more on data quality than quantity.
OpenAI is set up to become a data titan. Its large consumer and business user base, substantial resources, and market momentum suggest its biggest advantage may soon be its data. Google became a data titan by dominating search. Facebook achieved the title through social media. OpenAI is on a path to reach this status via generative AI services. The question remains whether other LLM providers that are not Google or Meta can match OpenAI in data differentiation.
Very good write up!