Pepperfish downtime last night

Rob Kendrick rob.kendrick at codethink.co.uk
Thu Mar 23 11:19:34 GMT 2017


Hi,
 
Summary: We had some downtime, and we're working on a fix.
Services are currently up but may be slower than usual.
Expect moments of downtime throughout the day.
 
Last night, at approximately 0330 GMT, one of Pepperfish's
virtual machines suffered a kernel issue related to handling
of interrupts, causing it to become unresponsive.  Sadly
this happened exactly as a database update was occuring,
causing some corrupted configuration files to be generated
and pushed to other servers.  The upshot was that DNS
broke on our main server and on our secondaries.  This had
a knock-on effect of also preventing mail being delivered
to us.
 
A second issue relating to very poor IO performance on one
of our virtual machine hosts has compounded the problem and
slowed a correct fix.
 
Given the extremely high uptime of the host (Over 1,200 days),
and the lack of anything it any of its logs that might
suggest why IO performance has become disappointing, we plan
on rebooting it (and thus all the virtual machines it hosts)
at some point today.  This will mean the following services
will be down:
    - ssh access to 'platypus'
    - Web server
    - IMAP server
    - Mailing list delivery server
    - One incoming mail server
    - One DNS server
 
Secondary servers hosted elsewhere should receive and store
any email that is delivered while this happens; nothing
should be lost.
 
Apologies for the disruption.

B.



More information about the baserock-dev mailing list