Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Comments on Is it okay to scrape Codidact for personal tools?

Post

Is it okay to scrape Codidact for personal tools?

+6
−0

I'd look to use my own tools for browsing Codidact. Examples include:

  • Summarize activity for a proposal I'm interested in, similar to what's described in https://meta.codidact.com/posts/289288/289291#answer-289291
  • Summarize latest posts in a way that the site UI doesn't yet support, such as collecting recent posts across multiple sites
  • Keep track of responses to specific posts in a way that the builtin notifications don't support
  • Create a CLI tool so I can use Codidact without leaving my terminal

There is currently no API. If there was one, I would of course use it, and it would put much less burden on CD than scraping. However, I understand that it may take a lot of time and effort for CD devs to actually implement an API. Instead of waiting for that, I'd like to just write my own tools that scrape the HTML.

A corollary is reverse engineering internal API calls based on what the browser is doing.

Is this okay to do? I understand that I should not expect any kind of support from the devs for this. That's okay, I can figure it out myself - my main question is, will the CD devs get very mad when they see my Python useragent in the access logs?

To be clear, I am asking only about very limited scraping here, which would have usage similar to an active human user (potentially less, because the scraper wouldn't bother downloading many resources like scripts and images). I am not asking about:

  1. Mass-downloading non-trivial sections of the site all at once
    • I assume downloading a couple dozen posts is not harmful, though
  2. Making so many requests that I am effectively doing a DoS attack
  3. Creating my own website, with monetization all going to me, which actually just pulls the content from CD
    • Exception: An alternate Web UI on localhost, intended for a single user, is probably harmless, no?
  4. Training AI models on Codidact content
    • Possible exception: Training a small, local, per-user model to filter and rank posts based on that user's preferences, while respecting the constraints imposed by 1 and 2 and without any intent to market or distribute the model to anyone but the primary user
History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

Making an API (5 comments)
Making an API
trichoplax‭ wrote over 1 year ago

As you may guess from the API post you linked to, I am very interested in implementing an API for Codidact. I'm also likely to have free time coming up in a couple of months.

I know exactly what I want out of an API, but I don't want to implement just what suits me, and then find it needs to change when other people's requirements become apparent. Ideally I'd like to know as much as possible about various different people's requirements in advance, and then agree on something that will work for all of them.

We can discuss this on Discord (link in the right hand panel), or maybe we should start a Meta discussion about what requirements people have.

matthewsnyder‭ wrote over 1 year ago

Sure, I'll take a look when I have a moment!

I think if there's multiple people interested in this, the logical thing is to have an adapter package: A library that presents like an API client, but behind the sees does everything through scraping. All actual applications would import this lib and use it like a normal API client.

This allows the CD devs to see what API operations people are interested in, implement them in the backend, and let the client maintainers know that they can replace a bunch of their scraper code with simpler API endpoint requests.

trichoplax‭ wrote over 1 year ago

Interesting idea. Might not be necessary if we can just make the changes directly in the back end. Let me know if there's an advantage I'm overlooking.

I've been working on Codidact bugs and feature requests as practice to get used to the codebase, as part of working up to implementing an API. If people express what they'd like to see in an API I'm happy to try to implement it.

Monica Cellio‭ wrote over 1 year ago

I think a public discussion more visible than comments - about goals, requirements, basically what we want in an API - is a good next step. Then we can figure out an MVP that moves us forward without precluding things we'll want to add later.

trichoplax‭ wrote over 1 year ago

Meta discussion started: How should a Codidact public API work? .