On Sunday, 3 September 2017 12:47:34 PM AEST Russell Coker wrote:

> The luv server was down this morning because of a KVM error. Also another

> KVM VM on the same system crashed. Sorry for sleeping in.

 

It turned out to be BTRFS mis-managing free space, deciding there was none left, and going into read-only mode. The QEMU/KVM server blocked on disk IO and paused the virtual machines, which meant that they couldn't even respond to pings.

 

I've setup a cron job to run a weekly balance on the BTRFS filesystem which will prevent this happening again. I've seen similar things in the past but didn't expect them in this case because the filesystem is only 50% full.

 

Also I had got an alert about problems before going to sleep last night, but it didn't look like an important issue (looked like just a "certificate is going to expire in 2 weeks" not "can't even talk to SSL server"). I've re-written the monitor script in question to give more useful information so I won't make that mistake in future.

 

--

My Main Blog http://etbe.coker.com.au/

My Documents Blog http://doc.coker.com.au/