Welcome to Codidact Meta!
Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.
Post History
#1: Initial revision
Incident postmortem - 4th October 2024
## Incident At 00:49 UTC on the 4th October 2024, Codidact communities unexpectedly became unavailable from the web. This outage lasted for 15 hours and 30 minutes - most of that time was waiting for someone with access to become available, and about an hour was spent investigating and remediating the causes of the incident. Codidact communities were returned to normal service at 16:19 UTC. ## Root cause The ultimate cause of the incident was a Cloudflare configuration that resulted in the QPixel web service (the software that serves web requests for Codidact communities) being unable to communicate at startup with our main domain, codidact.com, to fetch a list of communities. It’s not clear whether this was caused by a change we made within Cloudflare’s systems that had a delayed effect, or whether it was a configuration change pushed by Cloudflare that resulted in the recategorization of this specific request. It’s also not clear why the incident started when it did, although its delayed nature is likely to be down to a scheduled service restart - as the request in question is only sent once at service startup, it would only have started failing once the service was restarted. ## Response We responded to the incident at 15:34 UTC. This was the first time when someone with appropriate access was available - as a small, volunteer, non-profit team, we don’t have the luxury of round-the-clock incident response teams. Initial investigation revealed that both the QPixel web service and our reverse proxy, nginx, were reporting as running and healthy. This prompted us to look preliminarily into Cloudflare as a potential cause, but we identified no obvious issues there. It would later turn out that we were looking in the wrong places on this first look. We turned back to our own servers and began digging into logs, which showed that there was an error between the reverse proxy and our web service - the service units were both reporting up, but nginx wasn’t able to connect to QPixel. We were receiving requests correctly to our origin server, but nginx wasn’t able to forward them on to serve them. We shut down the QPixel service unit to take manual control of it to gain more debugging information. There was an error message, but it was generic and unhelpful in solving the issue - it was a generic error message that our web framework produces when it’s not able to start correctly. However, we were then able to dig into the Rails request log, which revealed a significant number of failing outbound requests to codidact.com. These requests repeated every few seconds, which lined up with the restart timer for the service unit. Knowing which request was failing, we turned back to Cloudflare and looked into analytics and firewall logs, which finally led us to the cause of the requests failing. We found the requests, saw that they were being actively blocked by Cloudflare, and tracked back which firewall rule was causing it. As it turns out, this was a bot traffic detection & blocking rule, so perhaps working as intended, if unhelpfully. ## Remediation We changed the startup procedure for the web service to allow the request to codidact.com to fail without causing the process to exit. This was the quickest available solution at the time, but does have some ongoing impact that we’ll need to resolve - the request still fails, it just doesn’t brick the service any more. ## Learning & Next Steps There are a number of takeaways from this incident: * `systemd` service units report as “up” even if they’re stuck in a restart loop - check them! * There is further remedial work to do in allowing the request to codidact.com through Cloudflare’s firewall. At the time of writing (21st October), this work [has been completed](https://github.com/codidact/qpixel/pull/1424) but is awaiting deployment. * There is a wider discussion to be started among the project team around the value of Cloudflare’s bot detection & blocking functions. This is not the first issue we’ve had with it, but it does also deal with a fair volume of traffic that’s a nuisance at best and a threat at worst, so turning it off immediately is not the solution here. Although it’s not something from this incident specifically, it’s always worth mentioning that we’re a small team of volunteers working for a non-profit organisation to try to make the web a better place. The delay in responding to this incident was down to that nature of the team, and we’re always looking to grow and expand. If you’re interested in contributing, look at our [open GitHub issues](https://github.com/codidact/qpixel/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen) and [join us in Discord](https://discord.gg/WZ7aTst) if you’re looking for some help.