After OpenAI recently announced that web admins would be able to block its systems from crawling their content, via an update to their site’s robots.txt file, Google is also looking to give web managers more control over their data, and whether they allow its scrapers to ingest it for generative AI search.
As explained by Google:
“Today we’re announcing Google-Extended, a new control that web publishers can use to manage whether their sites help improve Bard and Vertex AI generative APIs, including future generations of models that power those products. By using Google-Extended to control access to content on a site, a website administrator can choose whether to help these AI models become more accurate and capable over time.”
Which is similar to the wording that OpenAI has used, in trying to get more sites to allow data access with the promise of improving its models.
Indeed, the OpenAI documentation explains that:
“Retrieved content is only used in the training process to teach our models how to respond to a user request given this content (i.e., to make our models better at browsing), not to make our models better at creating responses.”
Obviously, both Google and OpenAI want to keep bringing in as much data from the open web as possible. But the capacity to block AI models from content has already seen many big publishers and creators do so, as a means to protect copyright, and stop generative AI systems from replicating their work.
And with discussion around AI regulation heating up, the big players can see the writing on the wall, which will eventually lead to more enforcement of the datasets that are used to build generative AI models.
Of course, it’s too late for some, with OpenAI, for example, already building its GPT models (up to GPT-4) based on data pulled from the web prior to 2021. So some large language models (LLMs) were already built before these permissions were made public. But moving forward, it does seem like LLMs will have significantly fewer websites that they’ll be able to access to construct their generative AI systems.
Which will become a necessity, though it’ll be interesting to see if this also comes with SEO considerations, as more people use generative AI to search the web. ChatGPT got access to the open web this week, in order to improve the accuracy of its responses, while Google’s testing out generative AI in Search as part of its Search Labs experiment.
Eventually, that could mean that websites will want to be included in the datasets for these tools, to ensure they show up in relevant queries, which could see a big shift back to allowing AI tools to access content once again at some stage.
Either way, it makes sense for Google to move into line with the current discussions around AI development and usage, and ensure that it’s giving web admins more control over their data, before any laws come into effect.
Google further notes that as AI applications expand, web publishers “will face the increasing complexity of managing different uses at scale”, and that it’s committed to engaging with the web and AI communities to explore the best way forward, which will ideally lead to better outcomes from both perspectives.
You can learn more about how to block Google’s AI systems from crawling your site here.