Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Comments on What can be done to block Codidact content from getting used by crawlers/for AI training?

Parent

What can be done to block Codidact content from getting used by crawlers/for AI training?

+9
−2

Given the latest debacle Somewhere Else, where users fear that the site contents will get used for the purpose of training AI, what can be done at Codidact to prevent such from happening?

  • At what extent can we block "crawlers" and the like from stealing site content? What is technically possible?

  • Do the communities want such a block, if technically possible to achieve?

If we would be able to block such crawlers or at least make a statement that Codidact content will never be deliberately used for the purpose of training AI (unless perhaps attribution can be guaranteed), I think that could be some major selling arguments for winning new users over here.

There is a mass exodus of users leaving SE currently because of this and many will be looking for a new "home".

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

What's the harm? (2 comments)
Post
+10
−1

At what extent can we block "crawlers" and the like from stealing site content? What is technically possible?

We can block at least the OpenAI crawler and the Google-Extended crawler (for Gemini) through the robots.txt file. We've been discussing this in the admin room for the past few days, and while nothing has been done as of yet, the general sentiment has been leaning towards blocking these AI crawlers.

If the community indicates support for such a move, we'll most likely block AI crawlers to the extent possible, at least for crawlers that we're aware of and have documented methods of blocking. (We don't want to block all crawlers, since that would mess up e.g. the Wayback Machine and search engines.)

Update: Cloudflare added the ability to block known LLM bots and we have enabled this for our network.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

2 comment threads

So we'll block only responsible bots (3 comments)
Do it. (2 comments)
So we'll block only responsible bots
Olin Lathrop‭ wrote 4 months ago

What you are saying is that since we can't actually block or even detect AI bots, all we can do is ask them not to scrape the site. That means only the responsible bots will stop, leaving the irresponsible bots. Doesn't sound any better.

Monica Cellio‭ wrote 3 months ago

Olin Lathrop‭, what's the alternative? How do you combat irresponsible/unethical players without also hurting the people we serve -- who should be able to find us by search engines, access the sites without first having to create accounts, etc? We can't block all bad actors but we can block some; isn't that better than doing nothing?

Olin Lathrop‭ wrote 3 months ago

Monica Cellio‭ isn't that better than doing nothing? That's the question. I'm not sure this issue is worth doing anything about. First, everybody just takes it as a given that bots crawling the site is harmful, but haven't really made arguments to that affect. Second, stopping a few bots that play by the rules and having all the others continue what they do seems like feel-good theater but no real solution.