Intro

Till ca 18:30

Well, everything is going smoothly with the four-core CPU (actually Dual-Core AMD Opteron™ Processor 2216), 8 GB memory dedicated machine. Even the inital load didn’t cause any problems (probably thanks to placing a mirror to different machine).

Dark age

Responses from server take longer and longer and sometimes users get even 5xx errors. top utility shows that one core is fully occupied with mysqld and because almost all web requests work with database, they have to wait for that. What’s the cause? Writes to shared table and locks on that? Restarting mysqld seems to make situation a bit better. However, several minutes of ineffective thinking about the cause interleaved with few desperate restarts of mysqld produce a bunch of upset users on the other end(s).

Obviously, it’s not a hardware insufficiency – a few tens of seconds after the restart everything seems to work smoothly again. Well, that could be caused by regular (every two minutes) invokation of statistics generation. Alright, I’m turning off automatic stats generation, all in all, it should be freezed before the end of game.

Final delay

Generating final statistics tooks nontrivial amount of time even on local unloaded PC. There must be something fatally wrong in it.

Probable cause of the failure

As can be seen on Fig. 2, generating task statistics with all the answers in the system takes approximately 90 seconds on unloaded machine with roughly 1.4 times faster CPU than web server’s (ratio of bogomips listed in /proc/cpuinfo). An important constant is 120 seconds – it’s the time between two invocations of stats generation. Therefore, it might have happened that on loaded game machine generating the stats reached the constant earlier than in 3500 answers and the system stopped responding. According to logs the system didn’t accept a single answer between 19:18:00+0100 and 19:30:00+0100 (Fig. 3) as the mysqld was occupied by stats calculation.

Figure 1

Figure 1 Handled HTTP requests during time (according to Apache log, mean values over each minute). Answers present in the system. (Note the peaks at the beginning and the end of the game as well as at the beginning and the end of Hurry-up series (18:00:00+0100–18:30:00+0100).)

Primary cause

Statistics are in a group of six views that are on demand (or regularly as croned task) copied into temporary tables that serve as a cache. All but one view are calculated in reasonable time (see Fig. 2).

Figure 2

Figure 2 Dependency of statistics generation time on number of answers in the system. (Benchmarked with real data on unloaded machine with Intel® Core™ i5-3550.)

The root of all evil was in the definition of view view_task_stat, namely counting the number of incorrect answers to a problem in a subquery. I don’t know how other DBMS would cope with this optimization challenge but MySQL executes the query in quadratic time (quadratic with respect to the size of view_incorrect_answer which is bounded from above with a number of answers in the system).

    SELECT
    `view_possibly_available_task`.*,
    MIN(`view_possibly_available_taskcorrect_answer`.`inserted`) AS `best_time`,
    MAX(`view_correct_answer`view_correct_answer.`inserted`) AS `worst_time`,
    FROM_UNIXTIME(AVG(UNIX_TIMESTAMP(`view_correct_answercorrect_answer`.`inserted`)))
        AS `avg_time`,
    COUNT(DISTINCT `view_correct_answercorrect_answerrect_answer`.`id_answer`)
        AS `count_correct_answer`,
    -- vvvv
    IFNULL((
        SELECT COUNT(`view_incorrect_answer`.`id_answer`)
        FROM `view_incorrect_answer`
        WHERE `view_incorrect_answer`.`id_task` = `view_possibly_available_task`.`id_task`
        GROUP BY `view_incorrect_answer`.`id_task`),0)
          AS `count_incorrect_answer`,
    -- ^^^^
    COUNT(DISTINCT `task_state`.`id_team`) AS `count_skipped`
    FROM `view_possibly_available_taskble_task`
    LEFT JOIN `view_correct_answer` USING(`id_task`)
    LEFT JOIN `task_state`
        ON `task_state`.`id_task` = `view_possibly_available_task`.`id_task`
        AND `task_state`.`skipped` = 1
    GROUP BY `view_possibly_available_task`.`view_possibly_available_taskid_task`
    ORDER BY `view_possibly_available_task`.`id_group`,
        `view_possibly_available_tasksibly_available_task`.`number`;

Rewriting the query using JOINs (and technicaly joining a bit different views) helps and query is executed in linear time (Fig. 3).

    SELECT
    `view_possibly_available_task`.*,
    MIN(`view_correct_answer`.`inserted`) AS `best_time`,
    MAX(`view_correct_answer`.`inserted`) AS `worst_time`,
    FROM_UNIXTIME(AVG(UNIX_TIMESTAMP(`view_correct_answer`.`inserted`))) AS `avg_time`,
    COUNT(DISTINCT `view_correct_answer`.`id_answer`) AS `count_correct_answer`,
    -- vvvv
    	COUNT(DISTINCT `view_answer`.`id_answer`) - COUNT(DISTINCT `view_correct_answer`.`id_answer`)
        AS `count_incorrect_answer`,
    -- ^^^^
    	COUNT(DISTINCT `task_state`.`id_team`) AS `count_skipped`
    FROM `view_possibly_available_task`
    LEFT JOIN `view_answer` USING(`id_task`)
    LEFT JOIN `view_correct_answer` USING(`id_task`,`id_team`)
    LEFT JOIN `task_state`
        ON `task_state`.`id_task` = `view_possibly_available_task`.`id_task`
        AND `task_state`.`skipped` = 1
    GROUP BY `view_possibly_available_task`.`id_task`
    ORDER BY `view_possibly_available_task`.`id_group`,
        `view_possibly_available_task`.`number`;

Figure 3

Figure 3 Dependency of statistics generation time on number of answers in the system. Subquery replaced with join. (Benchmarked with real data on unloaded machine with Intel® Core™ i5-3550.)

Conclusion

In the first year of Physics Brawl Online, there were problems with HW that struggled with handling large amount of requests (especially during the initial raid). However, the mentioned primary cause was present in that system too. At that time, I didn’t have that much experience and diagnostics data. Probably, effect of quadratic execution time in stats generation couldn’t be directly observed because of lesser answers count in the system (roughly 1800 in total) and general sluggishness of that machine

To the second year. Because the system responded promptly during the initial phase or during Hurry-up series and there was found serious bug in DB schema, it could be concluded that whole game would have passed smoothly if the bug had been fixed before.

Reference