Akvopedia downtime 2015-08-29 11:03 - 2015-08-30 13:47
Incident Report for Akvo
Resolved
The weblinkchecker bot started running and simultaneously with the weekly logrotate script was executed. Hence, the logs were rotated at 06:25:07 by sending SIGUSR1 to initiate a 'graceful restart' of Apache. In the new log file this type errors started to appear imediately:

[Sun Aug 30 06:26:57.965856 2015] [proxy_fcgi:error] [pid 8543:tid 140241545422592] [client 94.254.0.115:50484] AH01067: Failed to read FastCGI header
[Sun Aug 30 06:26:57.965926 2015] [proxy_fcgi:error] [pid 8543:tid 140241545422592] (104)Connection reset by peer: [client 94.254.0.115:50484] AH01075: Error dispatching request to :

It seems as if the server continued to respond to HTTP requests until 11:03 when it stopped responding.

A simple restart of Apache took care of the immediate problem.

I think that the root cause might be this deadlock issue:

https://bz.apache.org/bugzilla/show_bug.cgi?id=56960

We could avoid the problem by switching to mpm_worker. The fix has been backported to the debian release of apache but the fix is not available yet in the debian repository. I'm guessing it will be released to the repository soon, so just waiting for this should be OK. Specifically checking the status of Akvopedia on saturday mornings may also be a good idea.
Posted Aug 31, 2015 - 08:45 CEST