Welcome to Codidact Meta!
Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.
Comments on Is it okay to scrape Codidact for personal tools?
Parent
Is it okay to scrape Codidact for personal tools?
I'd look to use my own tools for browsing Codidact. Examples include:
- Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/289291#answer-289291
- Summarize latest posts in a way that the site UI doesn't yet support, such as collecting recent posts across multiple sites
- Keep track of responses to specific posts in a way that the builtin notifications don't support
- Create a CLI tool so I can use Codidact without leaving my terminal
There is currently no API. If there was one, I would of course use it, and it would put much less burden on CD than scraping. However, I understand that it may take a lot of time and effort for CD devs to actually implement an API. Instead of waiting for that, I'd like to just write my own tools that scrape the HTML.
A corollary is reverse engineering internal API calls based on what the browser is doing.
Is this okay to do? I understand that I should not expect any kind of support from the devs for this. That's okay, I can figure it out myself - my main question is, will the CD devs get very mad when they see my Python useragent in the access logs?
To be clear, I am asking only about very limited scraping here, which would have usage similar to an active human user (potentially less, because the scraper wouldn't bother downloading many resources like scripts and images). I am not asking about:
- Mass-downloading non-trivial sections of the site all at once
- I assume downloading a couple dozen posts is not harmful, though
- Making so many requests that I am effectively doing a DoS attack
- Creating my own website, with monetization all going to me, which actually just pulls the content from CD
- Exception: An alternate Web UI on localhost, intended for a single user, is probably harmless, no?
- Training AI models on Codidact content
- Possible exception: Training a small, local, per-user model to filter and rank posts based on that user's preferences, while respecting the constraints imposed by 1 and 2 and without any intent to market or distribute the model to anyone but the primary user
Post
As a general rule: yes, for personal use there's no issue with that. Caveats:
- First & foremost, this is a general rule - if we discover requests or usage that looks like a problem we'll want to know what's going on;
- CloudFlare sits in front of all the public communities and will take issue with anything that looks spammy - we have only limited control over exactly what it blocks or not;
- Please be a considerate scraper: consider that our poor server has to render every bit of content you request[1] and limit yourself to no more than a couple of requests per second at absolute maximum;
- Our frontend HTML can (and does) change frequently and without notice - be aware scraping may break often and require maintenance.
However, I'd also encourage you (and anyone considering this kind of thing) to get involved - contribute back as much as you can in return, if you're able to:
- If there are ways in which you'd like to see things (posts, notifications) that aren't yet possible, create or support feature requests for them.
- If you can scrape frontend code you probably have a decent idea of writing it: consider sending a PR to improve the UI yourself.
- If you can write Ruby (and if you can write Python, Ruby's not hard to pick up), you might be able to help improve backend code or implement what you're looking for yourself.
Codidact Collab is available if you'd like to contribute but don't know how to get started; if you're looking to contribute but aren't sure what to work on, there are some easier issues to start with and issues we're looking for contributors for on the repo.
-
Yes, yes, API better etc etc. It's on the list. ↩︎
1 comment thread