Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Comments on What can be done to block Codidact content from getting used by crawlers/for AI training?

Post

What can be done to block Codidact content from getting used by crawlers/for AI training?

+9
−2

Given the latest debacle Somewhere Else, where users fear that the site contents will get used for the purpose of training AI, what can be done at Codidact to prevent such from happening?

  • At what extent can we block "crawlers" and the like from stealing site content? What is technically possible?

  • Do the communities want such a block, if technically possible to achieve?

If we would be able to block such crawlers or at least make a statement that Codidact content will never be deliberately used for the purpose of training AI (unless perhaps attribution can be guaranteed), I think that could be some major selling arguments for winning new users over here.

There is a mass exodus of users leaving SE currently because of this and many will be looking for a new "home".

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

What's the harm? (2 comments)
What's the harm?
Olin Lathrop‭ wrote 6 months ago

Prohibiting activities should only be done for activities that are harmful, but you haven't provided any evidence that crawling the site for AI data is harmful. Also, we can't know what a crawler does with the data. You can only prohibit crawling or not, not crawling to gather AI training data versus indexing web pages, versus gathering word usage frequency, versus any number of other reasons we might never know about. Some crawlers might be "nice" in that they tell you why and respect your request not to for a particular use, but those are probably the ones least likely to cause whatever problem you are trying to avoid.

Lundin‭ wrote 6 months ago

Olin Lathrop‭ The main harm in case of GenAI would be that it steals licensed content, bakes it into the training and then uses that stolen content without any attribution to the original author (since the AI itself normally doesn't even know where the training data is coming from). Or worse: uses the content in a hallucination where half of the AI output would be stolen from an original author and the rest of it would be complete nonsense and lies.