Welcome to Codidact Meta!
Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.
Comments on How should a Codidact public API work?
Post
How should a Codidact public API work?
I'm planning to add a public API to Codidact so people can make applications that read Codidact data automatically.
Features and purposes
I have my own purposes in mind for this[1], but there are many other potential uses. Before I start building this, I'd like to hear what features you would like the API to have, and what purposes you might use it for.
General thoughts
Even if you are not planning anything yet, I would still like to hear your thoughts on what might be useful for the future. I'm hoping this discussion will help avoid designing an API that is too specific, or that needs breaking changes later.
Whatever your thoughts, please share them. This could be any of the following or anything else you want to bring up:
- Ideas for projects you might make one day
- Thoughts on the structure of an API
- Things to bear in mind / avoid (I'm new to this, so things that are obvious to you may not be obvious to me)
- Preferred data format(s)
- What to include from the start and what can wait until later
- Limits, such as how many results to permit in a single response (like pagination) and how to handle requesting more
-
I'll add my own personal API requirements in an answer. Please comment on the answer if there's a better approach I should consider. ↩︎
Does access through an API comply with all possible licenses for questions and answers? If yes, shouldn't the licenses be included when this content is programmatically accessed, so that end-users know what they are allowed (not) to do with the retrieved content?
Good point. Thanks for raising this.
I don't know the best approach for this, but the options that come to mind are:
- Include the licence with all licensed content (so questions, answers, articles, but not comments).
- Only allow programmatically accessing content that does not require a licence (comments but not licensed content like questions, answers, articles; only metadata about licensed content and not the content itself).
- Request by licence: every request for licensed content must specify exactly 1 licence, so all the results are implicitly under that licence (or the licence can be explicitly returned but only once per response rather than duplicated for every post in the response).
I'm interested to hear whether there are any other options, and any problems with the options I've suggested.
I think if you're retrieving content we should include the license in the returned values, and if you're retrieving content metadata like scores or authors or tags, that's not licensed content so no need. Edits are an interesting case as they aren't explicitly licensed but carry the post license.
Since the licence isn't editable I guess it applies to all edits regardless of who does the editing.
So I guess we need to include the licence with any response that contains all or part of the content of any revision.
I'm not sure if I see a need not to include the license with all content; this can be as simple (and small) as a single byte, which is an ID of a predefined license, or an index into a table of licenses. So, this data is very small, and can be transmitted regardless of how the content is requested.
Is there any reason why one would prefer option 2?
Also, all edits are tied to a specific post, so no need to tie a license to them; they are already tied to a post, so if the license is needed from the point of an edit, that's just one extra step of indirection.
Simple idea:
class Post {
long id;
byte license;
List<Comment> comments;
List<Revision> revisions;
}
class Revision {
long authorID;
String text;
List<String> tags;
Date date;
int index; // in case certain revisions are omitted
}
class Comment {
long authorID;
String text;
Date date;
}
class Question: Post {
List<Post> answers;
}
Question getQuestion(long id, dict options);
(I likely have no idea what I'm talking about here).
options = { "answer-licenses": [1, 2] }
question = getQuestion(options);
Interesting thoughts. I get 2 questions from this:
- In those places where a licence is required, how much of it is required?
- a licence id?
- a licence short name such CC BY-SA 4.0?
- a URL for the licence text available online?
- the full licence text?
- Where is a licence required?
- every time any licenced content is included in a response?
- only in responses that do not depend on a previous response that already provided the licence?
For example, if the revision history can only be requested using the list of revision ids, and those are only available from the post content response, which already included the licence, do we still need to include the licence again in the revision history response? That is, is the licence required per interaction (which may extend to several responses), or for every response that contains licensed content?
As for preferring option 2 of 3 from my first comment: No, I can't imagine anyone preferring that. I only included it as an approach that definitely doesn't cause any legal worries. If we can't agree on what is required for licensing, option 2 is still available.
I don’t think you need to address question 1 upfront, if you go with the idea of using an ID or index. Doesn’t Codidact only permit a predefined set of licenses anyway? So each time Codidact accepts a new license, assign a new ID to it, or add it to the list of licenses, and use its index.
If you ever need more information about a license, other than its ID/index, make a request to Codidact’s server (or bundle it in a library), and use that ID/index to retrieve all the attributes for that license.
For instance:
// dictionary if using IDs; an array if using indices
licenses = {
436312: {
"name": "CC BY-SA",
"version": "4.0",
"url": "https://....."
"text": "...",
}
592343: {
// etc
}
}
If the API requires downloading this upon requests, you can for instance specify which fields to include; some are included by default; some are excluded by default.
licenses = get_licenses({ "url": true, "text": true })
- Where is a license required?
- every time any licensed content is included in a response?
- No; document that certain content is under a license (that said, isn't everything under a license (does Codidact not have a license for its content?). Why would you want to process license data every single time you make these requests? All you need to ensure, is that the license is available when required; meaning, that the API user can always get hold of the license for the content they request. If that requires a new request to be made, that's fine.
I think the question of when to bundle license information, will depend too much on the overall design of the API. I suggest just postponing this concern until later.
I like this approach of only being required to make the licence available on request, rather than having to include it everywhere. I'd want to know what the legal situation is, and to what extent it varies by country, but my rough understanding of international copyright law is that if a licence is not included, the content is automatically copyright. That is, including the licence gives the recipient additional rights, for their benefit. In the absence of the licence, they must regard the content as copyright protected.
If we can confirm that this is the case, then the licence and its various licence metadata can just be another API endpoint, as you describe.
Possibly relevant aside:
My impression of the problem people had with SE retrospectively changing all content to CC BY-SA 4.0 is that they objected to SE adding a new licence that they did not have the legal authority to decide on.
I see it as very important that we never accidentally provide an incorrect licence for a post. The user provided a licence, and the user still owns the copyright. We can never provide the content under any other licence. (Technically there could be a licence that is legally compatible with all of the rights and constraints provided by the user's licence, but I wouldn't want to make such judgement calls.)
With this as background, the problem would be with adding an incorrect licence, not with omitting the correct licence from any particular response. The correct licence must be available, but it seems an API endpoint would cover that.
trichoplax I am afraid that these licences are typically stricter than copyright law. E.g. from the CC-by-SA license (4a):
[...] You must include a copy of, or the Uniform Resource Identifier for, this License with every copy or phonorecord of the Work You distribute [or] publicly display [...].
@mrTsjolder My understanding is that this can be satisfied by first downloading all the licenses with one request, then let subsequent requests just refer to one of the licenses already downloaded. Is that wrong to assume?
Andreas witnessed the end of the world today I am afraid I am too little of a legal person to judge that. The main point that I wanted to bring across is that it probably will not suffice to only provide the licenses on request as suggested by trichoplax.
It's good to hear that a URL is sufficient rather than the full licence text.
Even if it turns out that we need the URL in every response that contains licensed content, that doesn't seem too bad as it will be small in proportion to the length of most posts.
I'm not a legal expert, but "you must include the license with content" seems to be part of most licenses, so I'd prefer we do that. A URL is fine and sounds lightweight. Licenses can be added, so IDs shouldn't be baked into the code -- the URL is a unique identifier already, so we can use that.
Are APIs usually stateful or stateless? The comment about sending the license with the post and thus not needing it with history entries surprised me because my mental model was stateless, but I don't have an actual good basis or informed opinion here.
In response to the stateful/stateless query:
My earlier comment was a contrived example imagining that we had a history API that could not be accessed without info only found in the response to the post API. I don't think this is realistic, but that's how I was imagining we could know that the request for history had already been preceded by a request for the post. It would not be a stateful API, there would just be no way to request history until you had found out the required ids from the request for the post.
I'm not suggesting we use such a model. It was deliberately restrictive for the purpose of finding out if there could be a way to avoid needing the licence to be included everywhere. Now that I understand that we can just include the licence URL I would rather just include it everywhere, and not make the API needlessly convoluted.
It was the thought of having to include the (often long) full text of each licence that had me looking for artificial ways to reduce how often.
Skipping 1 deleted comment.
This community is part of the Codidact network. We have other communities too — take a look!
You can also join us in chat!
Want to advertise this community? Use our templates!
Like what we're doing? Support us! Donate
4 comment threads