Welcome to Codidact Meta!
Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.
The Abilities rollout (or, how not to do deployments)
We recently deployed the Abilities system, replacing our old and rusty reputation-based privileges system with a shiny new granular, activity-based system for granting abilities. We spent months working on the system and we’re pleased to finally have it up and working – but the process of getting there was rather more rocky than anticipated.
A granular system for granting abilities, based on what you actually participate in, has been on our TODO list since very early on. The software that we’re running on was developed from an earlier, more primitive system of mine, which – much like Stack Exchange – used voting and reputation to grant privileges.
The major problem with granting privileges based on reputation is that you’re granting all privileges based on reputation – you’re using one number to represent a person’s ability to do a range of different things, which all require different knowledge and skills. Someone who consistently provides good answers will have a high reputation score and should almost certainly be trusted with the ability to edit posts themselves, but why should it mean that they’re capable of knowing when a question should be closed, or when comments have gone off the rails and should be locked?
Our new system is designed to address that. Each ability is granted based on your overall performance at a related activity – the ability to edit, for instance, requires you to have had a good proportion of your edit suggestions approved. You can earn the ability to vote to put posts on hold, but the system requires proof that you know how posting and flagging work beforehand. The thresholds for exactly how many and what proportion of each contribution you need are all configurable, so different communities can set them at different levels – all part of our vision for letting communities decide for themselves how their sites should work.
The code for the system is the brainchild of luap42, our Tech Lead, who did the majority of the programming work for it. The pull request was initially created back in August, and went back in for some more changes after an initial round of review. Overall, it took 9 cycles of code review and reviews from our documentation team, and multiple rounds of manual testing on a development server before it was ready – and that’s a good thing! Knowing that new code has been thoroughly reviewed, particularly for these kinds of major changes, helps to provide peace of mind that the deploy will run smoothly.
Speaking of which…
Everything was going smoothly up until this point, but the deploy was where we hit roadblocks. Most of our deploys are small changes and run very quickly – apply code changes, run any database migrations, copy in any new assets, and restart the app server. This one, on the other hand, was a big changeset – over 100 new commits and some major database migrations as well as some data migrations to do.
Initially, the plan was to run through the deploy much as normal – run a backup beforehand so we had a copy of the database before the migration, then go offline so that we could run the migrations without new data interrupting the process, allocate initial abilities, then come back online again and celebrate.
It didn’t go quite that easily. We’d initially planned for an hour or two of downtime while the migrations ran, but it turned out to take much longer than that – long enough that it was impractical to run them on the live server. Either we would have had to stay offline for 24 hours or more while the migrations ran, or we would have had to run them with the sites online, which could potentially corrupt existing data. So, after a few hours, we decided we would have to roll back to the backup, come back online, and complete the upgrade another way. Part of the reason it was taking so long was the server: it’s small. It runs production well enough, but doing any heavy operations like this doesn’t go down so well. It’s an AWS EC2 t2.small – which gives us 1 vCPU and 2GB of RAM – not a lot. Fortunately, my local development environment has rather more than that – so, with some help from manassehkatz to configure MySQL for the upgrade, we ran the migrations on my local environment on the backup copy of the database. This still took almost 24 hours. Part of the reason it took so long is that we were adding columns and indexes to the Posts table, which (unsurprisingly) contains all the posts – it’s the biggest table both in terms of complexity and raw size, and adding to it is always slow. The migrations did, eventually, run, and we had an upgraded copy of the database that could be transferred back to the server.
Copying the database back to the server was a non-event: we were offline for 10 minutes while the database copied over and we reconfigured the app to access it. More of an issue: we’d been online while the upgrade was working locally, so there was new old data in the old database that wasn’t in the old new version of the new database but had to be in the new version of the new old database. Make sense? Good. In simpler words, folks had still been posting and voting and commenting and all while the upgrade was running, and we had to make sure that new content was reflected in the upgraded database. I wrote a simple script to copy everything across, ran it, and we came back online.
Except… thanks to the peculiarities of databases, only some things copied correctly, others didn’t copy but said they did, and others copied twice. Or more. The script also didn’t account for things that had changed but that weren’t new, such as edits to user profiles or changes in post score.
We did eventually manage to track down everything that was missing (both new and changed content) and get it copied over, but between the unexpectedly long downtime the first time round, having to have a second round of downtime, and then missing content on the return, there was some disruption.
Part of this is down to the nature of what we’re doing. Up until this point, everything has been running on private contributions, including our servers. That means we don’t have a dedicated development or test server, so we couldn’t test the upgrade beforehand to see where the pressure points were. That in turn meant that when we did get the upgrade done, we were doing it under more pressure than we expected, and when it was done we hadn’t considered or planned for having content missing, because we’d expected to run the upgrade on the live server under the cover of downtime.
In the good news, we recently incorporated as a non-profit organisation, so we can start looking to move ownership of servers to the organisation and, hopefully, look to set up development or test environments to test things like this on. We’re also rather more circumspect about whether it’s really necessary to add to the Posts table now. We also have to acknowledge that the big upgrades are where it’s likely to go wrong, and hence where we need the most planning – great, well-tested code is useless if you can’t deploy it!