The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an "emerging crisis in consent," as publishers and online platforms have taken steps to prevent their data from being harvested.
The researchers estimate that in the three data sets -- called C4, RefinedWeb and Dolma -- 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted.
https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html
Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.