OpenAI has launched a new web crawler called GPTBot, aimed at searching for online content to train large language models like GPT-4, which is used in chatbots such as ChatGPT.
OpenAI Deploys Online Creep Software to Read Everything for ChatGPT Training
The company stated in a blog post that allowing the GPTBot to access website content can enhance the accuracy of artificial intelligence models, improve their overall capabilities, and ensure their safety.
The AI leader also mentioned that the GPTBot is sorted and filtered to remove paywalled sources, personal information, and texts that violate policies.
How to prevent the storage and indexing of your website or blog topics in ChatGPT for artificial intelligence training purposes.
OpenAI provides an easy way to block GPTBot by adding an entry to the website's robots.txt file, which informs crawling programs like Google and Bing about the areas they can access.
In addition, website administrators can customize the sections that GPTBot can access. There are also specific IP addresses available for this purpose, making the blocking process more convenient.
Method of blocking and not allowing ChatGPT to crawl websites
All you need to do to block GPTBot from crawling your website topics is to add the block to the robots.txt file through your blog or website settings, which are available in all hosting platforms.
Disallow GPTBot
To prevent GPTBot from accessing your site, you can add GPTBot to your site's robots.txt file.
User-agent: GPTBotDisallow: /
Customizing access to GPTBot
To allow GPTBot to access only specific parts or sections of your website, you can add the unique GPTBot code to your site's robots.txt file as follows:
User-agent: GPTBotAllow: /directory-1/Disallow: /directory-2/
It's worth noting that the large language models used in ChatGPT have been trained on massive amounts of data from the web, collected up until September 2021.
Furthermore, data extracted prior to that date cannot be retroactively removed. However, the new web crawler's ban could mitigate its impact at least to some extent, safeguarding future websites that wish to avoid similar content.
Many website owners, who may not be keen on AI replicating their content, are already benefiting from the capability to enforce bans.
A notable example is the well-known science fiction magazine Clarkesworld, which announced on the social media platform X (formerly known as Twitter) that it had blocked the GPTBot.
Similarly, The Verge, a technology news website, took the same step, and countless articles are currently circulating offering advice on preventing automated visitors.
Web crawling programs are considered a lifeline of the modern internet, not a new concept. In many cases, websites encourage crawling programs like those from Google and other search engines to visit them in order to help bring web traffic.
However, many website owners now believe that utilizing their data to train generative AI models is unacceptable.
For instance, in a recent lawsuit against OpenAI, it was alleged that allowing the chatbot ChatGPT to train itself on everything others have written online, including books and articles, without permission, constitutes theft.