Welcome to Codidact Meta!
Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.
We've recently been having some trouble with the uptime of the Codidact Network, with the entire network going down and becoming unavailable for stretches of time. The most egregious case was from November 17–18, 2023, when the network was down for a period of roughly 15 hours. Since then, we've taken steps to ensure that the network is more reliable, and added some backup tools to help keep us online. In the interests of transparency, here's an explanation of what we've done to keep things stable.
The network runs on an AWS EC2 t3a.small instance, which has 2GB of memory. The network runs on Redis and a Rails server, hosted on that instance.
On November 1, 2023, after a small series of outages, we adjusted the maximum memory allotment for the Redis configuration on the server to 400MB, down from 800MB, to avoid maxing out the memory on the EC2 instance. At the same time, we set up a daily reboot of the server process and the Redis process (at 04:00UTC).
After further outages on November 6, 7, and 8, further investigations into what was causing the outages took place. We found that stable memory usage was around ~1.45GB; on an instance with 2GB of memory, however, that usage is susceptible to spikes (thus causing an outage).
Over the next few days, we discussed the possibilities and logistics of increasing the size of our EC2 instance.
On November 17–18, a CPU spike took down the server. It took over fifteen hours for someone with appropriate access to come online and manually reboot the server through AWS.
On November 18, automated monitoring tools were added to the AWS instance, including an automatic reboot if too many resources (memory or CPU) were being used.
On November 19, the automated monitoring tools alerted us of some minor CPU spikes caused by the Rails server (although still in normal range), confirming that the tools were working as expected.
On November 25, the automated tools caught a memory spike and automatically rebooted the system, preventing a full outage. The logs also revealed what had caused the spike.
The spike was caused by a request to a specific path. We're now aware of the issue with that path and are working to fix it.
Since then, we've been much more stable - you can view the uptime stats on status.codidact.com, which is running Atlassian's Statuspage monitoring tool. Hopefully, the new tools and improved memory usage will keep us steady. If you notice an outage, please report it by dropping into Discord and pinging
@Admin in either the Codidact development server or the Codidact Communities server; it'll send a notification to the right people.
Almost all of the work on this was done by ArtOfCode; a particular thank you is in order for the effort put into investigating the issue and working to fix it.
Thank you all for your patience as we investigated the downtime issues and worked to fix them.