Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Blog

Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Incident postmortem - 4th October 2024

+8
−0

Incident

At 00:49 UTC on the 4th October 2024, Codidact communities unexpectedly became unavailable from the web. This outage lasted for 15 hours and 30 minutes - most of that time was waiting for someone with access to become available, and about an hour was spent investigating and remediating the causes of the incident. Codidact communities were returned to normal service at 16:19 UTC.

Root cause

The ultimate cause of the incident was a Cloudflare configuration that resulted in the QPixel web service (the software that serves web requests for Codidact communities) being unable to communicate at startup with our main domain, codidact.com, to fetch a list of communities.

It’s not clear whether this was caused by a change we made within Cloudflare’s systems that had a delayed effect, or whether it was a configuration change pushed by Cloudflare that resulted in the recategorization of this specific request. It’s also not clear why the incident started when it did, although its delayed nature is likely to be down to a scheduled service restart - as the request in question is only sent once at service startup, it would only have started failing once the service was restarted.

Response

We responded to the incident at 15:34 UTC. This was the first time when someone with appropriate access was available - as a small, volunteer, non-profit team, we don’t have the luxury of round-the-clock incident response teams.

Initial investigation revealed that both the QPixel web service and our reverse proxy, nginx, were reporting as running and healthy. This prompted us to look preliminarily into Cloudflare as a potential cause, but we identified no obvious issues there. It would later turn out that we were looking in the wrong places on this first look. We turned back to our own servers and began digging into logs, which showed that there was an error between the reverse proxy and our web service - the service units were both reporting up, but nginx wasn’t able to connect to QPixel. We were receiving requests correctly to our origin server, but nginx wasn’t able to forward them on to serve them.

We shut down the QPixel service unit to take manual control of it to gain more debugging information. There was an error message, but it was generic and unhelpful in solving the issue - it was a generic error message that our web framework produces when it’s not able to start correctly. However, we were then able to dig into the Rails request log, which revealed a significant number of failing outbound requests to codidact.com. These requests repeated every few seconds, which lined up with the restart timer for the service unit.

Knowing which request was failing, we turned back to Cloudflare and looked into analytics and firewall logs, which finally led us to the cause of the requests failing. We found the requests, saw that they were being actively blocked by Cloudflare, and tracked back which firewall rule was causing it. As it turns out, this was a bot traffic detection & blocking rule, so perhaps working as intended, if unhelpfully.

Remediation

We changed the startup procedure for the web service to allow the request to codidact.com to fail without causing the process to exit. This was the quickest available solution at the time, but does have some ongoing impact that we’ll need to resolve - the request still fails, it just doesn’t brick the service any more.

Learning & Next Steps

There are a number of takeaways from this incident:

  • systemd service units report as “up” even if they’re stuck in a restart loop - check them!
  • There is further remedial work to do in allowing the request to codidact.com through Cloudflare’s firewall. At the time of writing (21st October), this work has been completed but is awaiting deployment.
  • There is a wider discussion to be started among the project team around the value of Cloudflare’s bot detection & blocking functions. This is not the first issue we’ve had with it, but it does also deal with a fair volume of traffic that’s a nuisance at best and a threat at worst, so turning it off immediately is not the solution here.

Although it’s not something from this incident specifically, it’s always worth mentioning that we’re a small team of volunteers working for a non-profit organisation to try to make the web a better place. The delay in responding to this incident was down to that nature of the team, and we’re always looking to grow and expand. If you’re interested in contributing, look at our open GitHub issues and join us in Discord if you’re looking for some help.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

Why does the internal communication between your servers go through cloudflare at all? Shouldn't that... (4 comments)