Note for .net players: if you’re interested in how the problems affected you as a player, you are better off reading the announcement thread on the .net forums: link
Yesterday, tribalwars.net experienced an issue that presented itself as all troops disappearing from certain villages. We’ve now managed to wrap everything up, but finding, fixing and repairing the damage caused by this bug took us the most part of the day because of the complexity, and we thought that you guys might be interested in hearing more about it!
12:00 – We had just finished the day’s planned updates to 8.5 on schedule. Besides tribalwars.net, we had also updated the Bosnian, Croatian, Finnish, Indonesian and US versions, and would be starting worlds. The day before, we had updated and started worlds for the Swiss and UK versions, with had gone relatively smoothly. We monitored everything for another hour or so, and then stepped out quickly for lunch, so as to be back on time for some meetings and the world starts at 14:00
14:00 – On the way back from lunch, I got texted by a colleague “Major bug in TW, villages lost their troops!” We hurried to our seats, and discovered that players with population bonus villages with a population higher than 24,000 no longer had any troops in them.
Now believe me – there’s a lot of odd things that can go wrong with a game, and especially so if that game has had dozens of different developers working on it at different times over more than 9 years – but this was certainly an interesting one.
One of the first things we looked at was the collection of ready-made repair tools that we have accumulated over the years, for all of the little quirks that can pop up from time to time. We’ve got a few of those tools related to village population, so it was a good bet that one of them might be of some help.
14:15 – Jackpot – the repair tools helped, but not in the way it was intended. When checking our repair tool log, we discovered that some of the repair processes were running every hour. They had been initially activated to run hourly for the rest of the night at around 23:00 one night about a month ago, to correct a small issue in an earlier version, but due to an oversight, they had not been deactivated again in the morning that the true issue was fixed. Now, this would normally not be a problem, but in this case, since it was the repair tool itself that was the issue, it most certainly was.
15:00 – Finding the error in the repair process was not too difficult – 2 characters in the code had been incorrectly placed. The normal functionality of this particular population tool is to correct freak errors which can cause incorrect population numbers – to solve this, the tool kills troops (noblemen excluded) in a village until it is at the correct population value again. In this case, the incorrectly placed characters caused the tool to determine that any village with a population sum higher than 24,000 was an abomination that needed to be purged completely of all inhabitants. (More details can be found at the bottom of this post.)
The true problem was now that a good number of players had lost their troops to our system’s failure – completely unacceptable. This would be the subject of the next several hours of work.
Our system makes regular backups of the game’s state so that we are always prepared to correct errors when they inevitably occur. It was to these backups that we turned, to retrieve the troops value that the depopulated villages should have. These backups needed to be made available for every world, and we had to write, test and rewrite some tools to detect which villages and players were affected, grab the information from the backup database, and apply it to the live database.
Discussion about the optimal restoration method.
18:30 – After several hours of work, we had a solution we were sure of and were able to roll it out for all worlds on tribalwars.net, to restore everyone’s troops. That was definitely not the end of things, though: we had inconvenienced players for a good portion of the day. Work commenced on an additional tool for compensating the affected players, and pizza was ordered – it was going to be a long night. (Pizza is very important in any programming work.)
20:00 – Halfway through pizza, additional reports started coming in from players that had not actually received their troops back. After a bit of fiddling, it turned out that these were all coming from the same world World 55. This confused us to no end, because the output from the fixing process was displaying that world has handled, and even gave us the amount of players and villages that supposedly had been properly repopulated.
Cue frantic research. Was the reporting system somehow biased towards only informing us about W55, but the problem more widespread? Did we have more active players on that single world, which was skewing our data? (W55 does actually have more players than others of comparable age.) Did we somehow have our reports handling system set up incorrectly so that we were only seeing reports from W55? (Nope) Did W55 have anything special in its configuration? Start time? Features? (W55 has a higher unit speed than the norm, was the first world that started in 2011, and the second world after 7.0, so it contained, for example, militia.) Nothing from any of these areas of research revealed anything of relevance.
Villages were double-and-triple checked with the help of the moderation team. In the office, we passed player and village IDs back and forth to each other over instant messengers for comparison to the restoration script’s output, comparison to detection process running on the backup, or on the live world.
In the end, it turned out that the output saying that W55 had been handled was incorrectly reporting the W56 values for W55, and W56 looked like it had been ommitted. Lamely, we solved it by running the troops restoration process again, for only W55.
Output from restoration process.
22:00 – With all worlds fixed, only thing left to do was run the compensation script which would grant a certain amount of premium points to each affected player, and notify them via mail – this had been finished separately, while the other research was still ongoing. We started the script, monitored for another 15 minutes, and called it a night.
There were a total of 25,040 villages affected villages, belonging to 2890 players.
/Sirko and Robert
Additional info about the problem, for those who are interested:
The repairscript for fixing villages with too high population numbers was the root of all of these problems. Normally it scans through all villages with a population higher than 24000 (which is the default max population for a level 30 farm), calculates the real allowed maximum population and finally removes units from this village until it is below the calculated maximum and back to normal.
The bug was in the calculation of the allowed maximum population. Normally the base-population per level is calculated and several factors are applied. In this case, the usual farm level was much too high and not in range of the normal 0-30 which resulted in an overflow which equivalented in the end to 0. That means that all scanned villages (village population bonus villages with more than 24000 pop) got their units reduced until they reached the calculated allowed max – 0.