Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

What can be done to block Codidact content from getting used by crawlers/for AI training?

−2

Given the latest debacle Somewhere Else, where users fear that the site contents will get used for the purpose of training AI, what can be done at Codidact to prevent such from happening?

At what extent can we block "crawlers" and the like from stealing site content? What is technically possible?
Do the communities want such a block, if technically possible to achieve?

If we would be able to block such crawlers or at least make a statement that Codidact content will never be deliberately used for the purpose of training AI (unless perhaps attribution can be guaranteed), I think that could be some major selling arguments for winning new users over here.

There is a mass exodus of users leaving SE currently because of this and many will be looking for a new "home".

posted about 1 year ago

CC BY-SA 4.0

Lundin‭

33 57 426 51

Raw

Markdown

History

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

1 comment thread

What's the harm? (2 comments)

2 answers

Score Active Age

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

−1

There is no technical measure that could possibly guarantee that we won't get scraped. It comes down to symbolic gestures and hoping they will comply voluntarily. I do think we should do as many "symbolic gestures" as possible.

Indicate in robots.txt that we don't want AI crawlers
If there's any heuristic services like what Mithical mentioned for Cloudflare, enable them. I don't think it's worth putting too much effort into writing our own, the scrapers will win that arms race. But just using an existing service allows us to make their life harder with little cost to us.
The licensing terms should be updated to say "you may not use the answers to train AIs". This will make the bigger projects avoid us, because their legal department will complain.

These don't actually stop anyone from scraping us, but they make us a less preferential target. While we're small, we become "small risk, tiny reward" and they'll go for other sites that are no risk, small reward. When we're bigger, it ceases to be a technical problem, because they will attempt to bribe or coerce the site admins to do it clandestinely.

posted 10 months ago

CC BY-SA 4.0

10mo ago

matthewsnyder‭

31 17 118 42

Copy Link

Raw

Markdown

History

1 comment thread

Types of crawlers (3 comments)

+10

−1

At what extent can we block "crawlers" and the like from stealing site content? What is technically possible?

We can block at least the OpenAI crawler and the Google-Extended crawler (for Gemini) through the robots.txt file. We've been discussing this in the admin room for the past few days, and while nothing has been done as of yet, the general sentiment has been leaning towards blocking these AI crawlers.

If the community indicates support for such a move, we'll most likely block AI crawlers to the extent possible, at least for crawlers that we're aware of and have documented methods of blocking. (We don't want to block all crawlers, since that would mess up e.g. the Wayback Machine and search engines.)

Update: Cloudflare added the ability to block known LLM bots and we have enabled this for our network.

posted about 1 year ago

CC BY-NC-SA 4.0

10mo ago by Monica Cellio‭

Mithical‭ staff

23 70 672 105

Copy Link

Raw

Markdown

History

2 comment threads

So we'll block only responsible bots (3 comments)

Do it. (2 comments)

Communities

What can be done to block Codidact content from getting used by crawlers/for AI training?

1 comment thread

2 answers

1 comment thread

2 comment threads