Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Blog

Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Recent outages and site reliability improvements

+14
−0

We've recently been having some trouble with the uptime of the Codidact Network, with the entire network going down and becoming unavailable for stretches of time. The most egregious case was from November 17–18, 2023, when the network was down for a period of roughly 15 hours. Since then, we've taken steps to ensure that the network is more reliable, and added some backup tools to help keep us online. In the interests of transparency, here's an explanation of what we've done to keep things stable.

Background

The network runs on an AWS EC2 t3a.small instance, which has 2GB of memory. The network runs on Redis and a Rails server, hosted on that instance.

Timeline

On November 1, 2023, after a small series of outages, we adjusted the maximum memory allotment for the Redis configuration on the server to 400MB, down from 800MB, to avoid maxing out the memory on the EC2 instance. At the same time, we set up a daily reboot of the server process and the Redis process (at 04:00UTC).

After further outages on November 6, 7, and 8, further investigations into what was causing the outages took place. We found that stable memory usage was around ~1.45GB; on an instance with 2GB of memory, however, that usage is susceptible to spikes (thus causing an outage).

Over the next few days, we discussed the possibilities and logistics of increasing the size of our EC2 instance.

On November 17–18, a CPU spike took down the server. It took over fifteen hours for someone with appropriate access to come online and manually reboot the server through AWS.

On November 18, automated monitoring tools were added to the AWS instance, including an automatic reboot if too many resources (memory or CPU) were being used.

On November 19, the automated monitoring tools alerted us of some minor CPU spikes caused by the Rails server (although still in normal range), confirming that the tools were working as expected.

On November 25, the automated tools caught a memory spike and automatically rebooted the system, preventing a full outage. The logs also revealed what had caused the spike.
The spike was caused by a request to a specific path. We're now aware of the issue with that path and are working to fix it.

On November 26, on the advice of Andreas, we installed jemalloc on the system to help reduce the memory used by Rails. This effectively brought us from ~60% memory usage down to ~44%.


Since then, we've been much more stable - you can view the uptime stats on status.codidact.com, which is running Atlassian's Statuspage monitoring tool. Hopefully, the new tools and improved memory usage will keep us steady. If you notice an outage, please report it by dropping into Discord and pinging @Admin in either the Codidact development server or the Codidact Communities server; it'll send a notification to the right people.

Almost all of the work on this was done by ArtOfCode; a particular thank you is in order for the effort put into investigating the issue and working to fix it.

Thank you all for your patience as we investigated the downtime issues and worked to fix them.

History
Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads