Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Post History

80%

+6 −0

Q&A Is it okay to scrape Codidact for personal tools?

I'd look to use my own tools for browsing Codidact. Examples include: Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/2...

1 answer · posted 2y ago by matthewsnyder‭ · last activity 2y ago by ArtOfCode‭

Question discussion

#2: Post edited by

matthewsnyder‭ · 2023-08-08T18:16:02Z (almost 2 years ago)

Copy Link

Raw

Markdown

I'd look to use my own tools for browsing Codidact. Examples include:
* Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/289291#answer-289291
* Summarize latest posts in a way that the site UI doesn't yet support, such as collecting recent posts across multiple sites
* Keep track of responses to specific posts in a way that the builtin notifications don't support
* Create a CLI tool so I can use Codidact without leaving my terminal
There is [currently no API](https://meta.codidact.com/posts/281407). If there was one, I would of course use it, and it would put much less burden on CD than scraping. However, I understand that it may take a lot of time and effort for CD devs to actually implement an API. Instead of waiting for that, I'd like to just write my own tools that scrape the HTML.
A corollary is reverse engineering internal API calls based on what the browser is doing.
Is this okay to do? I understand that I should not expect any kind of support from the devs for this. That's okay, I can figure it out myself - my main question is, will the CD devs get very mad when they see my Python useragent in the access logs?
~~To be clear, I am asking only about very limited scraping here, which would have usage similar to an active human user. I am not asking about:~~
1. Mass-downloading non-trivial sections of the site all at once
* I assume downloading a couple dozen posts is not harmful, though
2. Making so many requests that I am effectively doing a DoS attack
3. Creating my own website, with monetization all going to me, which actually just pulls the content from CD
* Exception: An alternate Web UI on localhost, intended for a single user, is probably harmless, no?
4. Training AI models on Codidact content
* Possible exception: Training a small, local, per-user model to filter and rank posts based on that user's preferences, while respecting the constraints imposed by 1 and 2 and without any intent to market or distribute the model to anyone but the primary user

I'd look to use my own tools for browsing Codidact. Examples include:
* Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/289291#answer-289291
* Summarize latest posts in a way that the site UI doesn't yet support, such as collecting recent posts across multiple sites
* Keep track of responses to specific posts in a way that the builtin notifications don't support
* Create a CLI tool so I can use Codidact without leaving my terminal
There is [currently no API](https://meta.codidact.com/posts/281407). If there was one, I would of course use it, and it would put much less burden on CD than scraping. However, I understand that it may take a lot of time and effort for CD devs to actually implement an API. Instead of waiting for that, I'd like to just write my own tools that scrape the HTML.
A corollary is reverse engineering internal API calls based on what the browser is doing.
Is this okay to do? I understand that I should not expect any kind of support from the devs for this. That's okay, I can figure it out myself - my main question is, will the CD devs get very mad when they see my Python useragent in the access logs?
To be clear, I am asking only about very limited scraping here, which would have usage similar to an active human user (potentially less, because the scraper wouldn't bother downloading many resources like scripts and images). I am not asking about:
1. Mass-downloading non-trivial sections of the site all at once
* I assume downloading a couple dozen posts is not harmful, though
2. Making so many requests that I am effectively doing a DoS attack
3. Creating my own website, with monetization all going to me, which actually just pulls the content from CD
* Exception: An alternate Web UI on localhost, intended for a single user, is probably harmless, no?
4. Training AI models on Codidact content
* Possible exception: Training a small, local, per-user model to filter and rank posts based on that user's preferences, while respecting the constraints imposed by 1 and 2 and without any intent to market or distribute the model to anyone but the primary user

#1: Initial revision by

matthewsnyder‭ · 2023-08-08T18:14:21Z (almost 2 years ago)

Copy Link

Raw

Markdown

Is it okay to scrape Codidact for personal tools?

I'd look to use my own tools for browsing Codidact. Examples include:

* Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/289291#answer-289291
* Summarize latest posts in a way that the site UI doesn't yet support, such as collecting recent posts across multiple sites
* Keep track of responses to specific posts in a way that the builtin notifications don't support
* Create a CLI tool so I can use Codidact without leaving my terminal

There is [currently no API](https://meta.codidact.com/posts/281407). If there was one, I would of course use it, and it would put much less burden on CD than scraping. However, I understand that it may take a lot of time and effort for CD devs to actually implement an API. Instead of waiting for that, I'd like to just write my own tools that scrape the HTML.

A corollary is reverse engineering internal API calls based on what the browser is doing.

Is this okay to do? I understand that I should not expect any kind of support from the devs for this. That's okay, I can figure it out myself - my main question is, will the CD devs get very mad when they see my Python useragent in the access logs?

To be clear, I am asking only about very limited scraping here, which would have usage similar to an active human user. I am not asking about:

1. Mass-downloading non-trivial sections of the site all at once
* I assume downloading a couple dozen posts is not harmful, though
2. Making so many requests that I am effectively doing a DoS attack
3. Creating my own website, with monetization all going to me, which actually just pulls the content from CD
* Exception: An alternate Web UI on localhost, intended for a single user, is probably harmless, no?
4. Training AI models on Codidact content
* Possible exception: Training a small, local, per-user model to filter and rank posts based on that user's preferences, while respecting the constraints imposed by 1 and 2 and without any intent to market or distribute the model to anyone but the primary user

discussion

Communities

Post History