What is GPTBot and Why You Want OpenAI's New Web Crawler to Index Your Content

Plus, how to stop the crawler!

Aug 07, 2023

OpenAI has quietly added documentation to its website about GPTBot. According to OpenAI, “GPTBot is OpenAI’s web crawler and can be identified by the following user agent and string.”

User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

OpenAI’s Web Crawler

It is unsurprising that OpenAI has a web crawler. While it could have gathered internet data for training GPT-3 and GPT-4 through third-party sources, OpenAI understandably wants more control over its training data. The notable element of the new information in OpenAI docs is a method for identifying the requesting crawler, disallowing access, and “customizing” or filtering access. The former is a binary decision to disallow crawling, and the latter offers website publishers the ability to allow the crawling of only some designated content.

This approach is similar to how you indicate to Google or Bing your crawling preferences. It is also a way for OpenAI to act as a good digital citizen by providing a choice to website publishers about whether or not they want the generative AI leader to access their content.

The new option could also blunt OpenAI’s risk from complaints about using proprietary information without consent. Granted, legal disputes around intellectual property issues and generative AI models are in their infancy. This move will not offer OpenAI blanket protection against claims of unauthorized use. It will show a standard of care based on an implicit opt-in and optional opt-out. It also suggests that OpenAI may defend its use of data published on the web as similar to a search engine’s use which is settled law.

How OpenAI Will Use Web Crawl Data

Many website owners are incentivized to allow Google, Bing, and other search engines to crawl their content so their target users can discover them more easily. Websites want to be found, and search engines provide a way to find them.

This mutually beneficial relationship is not as straightforward when it comes to large language models and the applications built on them. Many, like ChatGPT today, do not provide source links or references. That means they provide information to users but do not typically create a path for discovering the source web page or credit the creator. OpenAI clearly states the value transaction for allowing GPTBot and, by omission, distinguishes it from traditional search.

Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety. Below, we also share how to disallow GPTBot from accessing your site.

The value transaction is assumed to be unidirectional. OpenAI benefits from improving its AI models, and there is no mention of facilitating website discovery.

Of course, there are other large language model (LLM) solutions that offer a familiar bi-directional value exchange. Perplexity AI is an example of a generative AI search engine that crawls the web and, in its responses, provides source links to facilitate the discovery of the websites providing information.

Why You Want to Enable GPTBot Anyway

The value transaction may be unidirectional today, but most website publishers will still want to permit GPTBot to crawl their site. ChatGPT surpassed 100 million daily active users in January and has more than five million users on iOS and at least that many on Android. In addition, GPT-3.5 and GPT-4 are the most widely used LLMs by third-party developers. OpenAI has massive reach today.

If your content is not part of OpenAI model training, it means that your ideas and commentary will not be part of the results provided to users of ChatGPT as well as all of the applications that use these LLMs. That may not seem like a significant issue today, because absence from the results is no different than inclusion if there are no links or credit that drive website discovery. However, what we see today is not necessarily what we should expect to see tomorrow.

OpenAI’s Role in Search Transformation

This market is moving quickly. OpenAI’s generative models are not retrieval models and therefore are limited in what they can do for sourcing information. However, that could change with the foundation models and it could change more radically with a new GPT model designed for search.

Any search-optimized generative AI model from OpenAI will be trained using GPTBot web crawl data and that same dataset would almost certainly be updated regularly and used for inference (i.e., generating search results). This is where the bi-directional value transaction will emerge because OpenAI will then be offering discovery and referral in exchange for viewing the data.

Search engine optimization (SEO) is going through a transformation. The keyword-oriented SEO strategies employed today will change as search results become more about answers than links and more driven by what users want than what publishers want to give them. OpenAI is a key company helping usher in that change through the use of GPT-4 with Bing Chat and as foundation model technology for other generative search providers. It strikes me as risky to absent from that dataset.

Let me know what you think? Should companies be permitting GPTBot crawl their websites? If not, what are the exceptions or the decision framework you would apply?

Perplexity AI CEO Interview and Demo - Taking on Google and Microsoft in Generative AI Search

Bret Kinsella

July 23, 2023

Perplexity AI CEO Interview and Demo - Taking on Google and Microsoft in Generative AI Search

Read full story

Amazon CEO Lays Out Generative AI Strategy, Says Everyone in the Company is Using the Tech