FoL 2012-2 – quadratic fail
Intro
Till ca 18:30
Well, everything is going smoothly with the four-core CPU (actually Dual-Core AMD Opteron™ Processor 2216), 8 GB memory dedicated machine. Even the inital load didn’t cause any problems (probably thanks to placing a mirror to different machine).
Dark age
Responses from server take longer and longer and sometimes users get even 5xx errors.
top
utility shows that one core is fully occupied withmysqld
and because almost all web requests work with database, they have to wait for that. What’s the cause? Writes to shared table and locks on that? Restartingmysqld
seems to make situation a bit better. However, several minutes of ineffective thinking about the cause interleaved with few desperate restarts ofmysqld
produce a bunch of upset users on the other end(s).
Obviously, it’s not a hardware insufficiency – a few tens of seconds after the restart everything seems to work smoothly again. Well, that could be caused by regular (every two minutes) invokation of statistics generation. Alright, I’m turning off automatic stats generation, all in all, it should be freezed before the end of game.
Final delay
Generating final statistics tooks nontrivial amount of time even on local unloaded PC. There must be something fatally wrong in it.
Probable cause of the failure
As can be seen on Fig. 2, generating task statistics with all the answers
in the system takes approximately 90 seconds on unloaded machine with
roughly 1.4 times faster CPU than web server’s (ratio of bogomips listed in
/proc/cpuinfo
).
An important constant is 120 seconds – it’s the time between two invocations
of stats generation. Therefore, it might have happened that on loaded game
machine generating the stats reached the constant earlier than in 3500
answers and the system stopped responding. According to logs the system
didn’t accept a single answer between 19:18:00+0100 and 19:30:00+0100 (Fig.
3) as the mysqld
was occupied by stats calculation.
Figure 1 Handled HTTP requests during time (according to Apache log, mean values over each minute). Answers present in the system. (Note the peaks at the beginning and the end of the game as well as at the beginning and the end of Hurry-up series (18:00:00+0100–18:30:00+0100).)
Primary cause
Statistics are in a group of six views that are on demand (or regularly as
cron
ed task) copied into temporary tables that serve as a cache. All but one
view are calculated in reasonable time (see Fig. 2).
Figure 2 Dependency of statistics generation time on number of answers in the system. (Benchmarked with real data on unloaded machine with Intel® Core™ i5-3550.)
The root of all evil was in the definition of view view_task_stat
,
namely counting the number of incorrect answers to a problem in a subquery.
I don’t know how other DBMS would cope with this optimization challenge but MySQL
executes the query in quadratic time (quadratic with respect to the size of
view_incorrect_answer
which is bounded from above with a number of answers
in the system).
Rewriting the query using JOINs (and technicaly joining a bit different views) helps and query is executed in linear time (Fig. 3).
Figure 3 Dependency of statistics generation time on number of answers in the system. Subquery replaced with join. (Benchmarked with real data on unloaded machine with Intel® Core™ i5-3550.)
Conclusion
In the first year of Physics Brawl Online, there were problems with HW that struggled with handling large amount of requests (especially during the initial raid). However, the mentioned primary cause was present in that system too. At that time, I didn’t have that much experience and diagnostics data. Probably, effect of quadratic execution time in stats generation couldn’t be directly observed because of lesser answers count in the system (roughly 1800 in total) and general sluggishness of that machine
To the second year. Because the system responded promptly during the initial phase or during Hurry-up series and there was found serious bug in DB schema, it could be concluded that whole game would have passed smoothly if the bug had been fixed before.
Reference
- Processed data (6.5 KiB) from Apache log, system log and MySQL views benchmarking.
- Game system sources at Github.