Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Post History

80%
+6 −0
Q&A Is it okay to scrape Codidact for personal tools?

I'd look to use my own tools for browsing Codidact. Examples include: Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/2...

1 answer  ·  posted 1y ago by matthewsnyder‭  ·  last activity 1y ago by ArtOfCode‭

Question discussion
#2: Post edited by user avatar matthewsnyder‭ · 2023-08-08T18:16:02Z (over 1 year ago)
  • I'd look to use my own tools for browsing Codidact. Examples include:
  • * Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/289291#answer-289291
  • * Summarize latest posts in a way that the site UI doesn't yet support, such as collecting recent posts across multiple sites
  • * Keep track of responses to specific posts in a way that the builtin notifications don't support
  • * Create a CLI tool so I can use Codidact without leaving my terminal
  • There is [currently no API](https://meta.codidact.com/posts/281407). If there was one, I would of course use it, and it would put much less burden on CD than scraping. However, I understand that it may take a lot of time and effort for CD devs to actually implement an API. Instead of waiting for that, I'd like to just write my own tools that scrape the HTML.
  • A corollary is reverse engineering internal API calls based on what the browser is doing.
  • Is this okay to do? I understand that I should not expect any kind of support from the devs for this. That's okay, I can figure it out myself - my main question is, will the CD devs get very mad when they see my Python useragent in the access logs?
  • To be clear, I am asking only about very limited scraping here, which would have usage similar to an active human user. I am not asking about:
  • 1. Mass-downloading non-trivial sections of the site all at once
  • * I assume downloading a couple dozen posts is not harmful, though
  • 2. Making so many requests that I am effectively doing a DoS attack
  • 3. Creating my own website, with monetization all going to me, which actually just pulls the content from CD
  • * Exception: An alternate Web UI on localhost, intended for a single user, is probably harmless, no?
  • 4. Training AI models on Codidact content
  • * Possible exception: Training a small, local, per-user model to filter and rank posts based on that user's preferences, while respecting the constraints imposed by 1 and 2 and without any intent to market or distribute the model to anyone but the primary user
  • I'd look to use my own tools for browsing Codidact. Examples include:
  • * Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/289291#answer-289291
  • * Summarize latest posts in a way that the site UI doesn't yet support, such as collecting recent posts across multiple sites
  • * Keep track of responses to specific posts in a way that the builtin notifications don't support
  • * Create a CLI tool so I can use Codidact without leaving my terminal
  • There is [currently no API](https://meta.codidact.com/posts/281407). If there was one, I would of course use it, and it would put much less burden on CD than scraping. However, I understand that it may take a lot of time and effort for CD devs to actually implement an API. Instead of waiting for that, I'd like to just write my own tools that scrape the HTML.
  • A corollary is reverse engineering internal API calls based on what the browser is doing.
  • Is this okay to do? I understand that I should not expect any kind of support from the devs for this. That's okay, I can figure it out myself - my main question is, will the CD devs get very mad when they see my Python useragent in the access logs?
  • To be clear, I am asking only about very limited scraping here, which would have usage similar to an active human user (potentially less, because the scraper wouldn't bother downloading many resources like scripts and images). I am not asking about:
  • 1. Mass-downloading non-trivial sections of the site all at once
  • * I assume downloading a couple dozen posts is not harmful, though
  • 2. Making so many requests that I am effectively doing a DoS attack
  • 3. Creating my own website, with monetization all going to me, which actually just pulls the content from CD
  • * Exception: An alternate Web UI on localhost, intended for a single user, is probably harmless, no?
  • 4. Training AI models on Codidact content
  • * Possible exception: Training a small, local, per-user model to filter and rank posts based on that user's preferences, while respecting the constraints imposed by 1 and 2 and without any intent to market or distribute the model to anyone but the primary user
#1: Initial revision by user avatar matthewsnyder‭ · 2023-08-08T18:14:21Z (over 1 year ago)
Is it okay to scrape Codidact for personal tools?
I'd look to use my own tools for browsing Codidact. Examples include:

* Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/289291#answer-289291
* Summarize latest posts in a way that the site UI doesn't yet support, such as collecting recent posts across multiple sites
* Keep track of responses to specific posts in a way that the builtin notifications don't support
* Create a CLI tool so I can use Codidact without leaving my terminal

There is [currently no API](https://meta.codidact.com/posts/281407). If there was one, I would of course use it, and it would put much less burden on CD than scraping. However, I understand that it may take a lot of time and effort for CD devs to actually implement an API. Instead of waiting for that, I'd like to just write my own tools that scrape the HTML.

A corollary is reverse engineering internal API calls based on what the browser is doing.

Is this okay to do? I understand that I should not expect any kind of support from the devs for this. That's okay, I can figure it out myself - my main question is, will the CD devs get very mad when they see my Python useragent in the access logs?

To be clear, I am asking only about very limited scraping here, which would have usage similar to an active human user. I am not asking about:

1. Mass-downloading non-trivial sections of the site all at once
    * I assume downloading a couple dozen posts is not harmful, though
2. Making so many requests that I am effectively doing a DoS attack
3. Creating my own website, with monetization all going to me, which actually just pulls the content from CD
   * Exception: An alternate Web UI on localhost, intended for a single user, is probably harmless, no?
4. Training AI models on Codidact content
   * Possible exception: Training a small, local, per-user model to filter and rank posts based on that user's preferences, while respecting the constraints imposed by 1 and 2 and without any intent to market or distribute the model to anyone but the primary user