Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Post History

50%
+1 −1
Q&A What can be done to block Codidact content from getting used by crawlers/for AI training?

There is no technical measure that could possibly guarantee that we won't get scraped. It comes down to symbolic gestures and hoping they will comply voluntarily. I do think we should do as many "s...

posted 4mo ago by matthewsnyder‭  ·  edited 3mo ago by matthewsnyder‭

Answer
#2: Post edited by user avatar matthewsnyder‭ · 2024-07-23T22:31:00Z (3 months ago)
  • There is no technical measure that could possibly guarantee that we won't get scraped. It comes down to symbolic gestures and hoping they will comply voluntarily. I do think we should do as many "symbolic gestures" as possible.
  • * Indicate in `robots.txt` that we don't want crawlers
  • * If there's any heuristic services like what Mithical mentioned for Cloudflare, enable them. I don't think it's worth putting too much effort into writing our own, the scrapers will win that arms race. But just using an existing service allows us to make their life harder with little cost to us.
  • * The licensing terms should be updated to say "you may not use the answers to train AIs". This will make the bigger projects avoid us, because their legal department will complain.
  • These don't actually stop anyone from scraping us, but they make us a less preferential target. While we're small, we become "small risk, tiny reward" and they'll go for other sites that are no risk, small reward. When we're bigger, it ceases to be a technical problem, because they will attempt to bribe or coerce the site admins to do it clandestinely.
  • There is no technical measure that could possibly guarantee that we won't get scraped. It comes down to symbolic gestures and hoping they will comply voluntarily. I do think we should do as many "symbolic gestures" as possible.
  • * Indicate in `robots.txt` that we don't want AI crawlers
  • * If there's any heuristic services like what Mithical mentioned for Cloudflare, enable them. I don't think it's worth putting too much effort into writing our own, the scrapers will win that arms race. But just using an existing service allows us to make their life harder with little cost to us.
  • * The licensing terms should be updated to say "you may not use the answers to train AIs". This will make the bigger projects avoid us, because their legal department will complain.
  • These don't actually stop anyone from scraping us, but they make us a less preferential target. While we're small, we become "small risk, tiny reward" and they'll go for other sites that are no risk, small reward. When we're bigger, it ceases to be a technical problem, because they will attempt to bribe or coerce the site admins to do it clandestinely.
#1: Initial revision by user avatar matthewsnyder‭ · 2024-07-22T18:40:45Z (4 months ago)
There is no technical measure that could possibly guarantee that we won't get scraped. It comes down to symbolic gestures and hoping they will comply voluntarily. I do think we should do as many "symbolic gestures" as possible.

* Indicate in `robots.txt` that we don't want crawlers
* If there's any heuristic services like what Mithical mentioned for Cloudflare, enable them. I don't think it's worth putting too much effort into writing our own, the scrapers will win that arms race. But just using an existing service allows us to make their life harder with little cost to us.
* The licensing terms should be updated to say "you may not use the answers to train AIs". This will make the bigger projects avoid us, because their legal department will complain.

These don't actually stop anyone from scraping us, but they make us a less preferential target. While we're small, we become "small risk, tiny reward" and they'll go for other sites that are no risk, small reward. When we're bigger, it ceases to be a technical problem, because they will attempt to bribe or coerce the site admins to do it clandestinely.