
The luv server was down this morning because of a KVM error. Also another KVM VM on the same system crashed. Sorry for sleeping in. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sunday, 3 September 2017 12:47:34 PM AEST Russell Coker wrote:
The luv server was down this morning because of a KVM error. Also another KVM VM on the same system crashed. Sorry for sleeping in.
It turned out to be BTRFS mis-managing free space, deciding there was none left, and going into read-only mode. The QEMU/KVM server blocked on disk IO and paused the virtual machines, which meant that they couldn't even respond to pings. I've setup a cron job to run a weekly balance on the BTRFS filesystem which will prevent this happening again. I've seen similar things in the past but didn't expect them in this case because the filesystem is only 50% full. Also I had got an alert about problems before going to sleep last night, but it didn't look like an important issue (looked like just a "certificate is going to expire in 2 weeks" not "can't even talk to SSL server"). I've re- written the monitor script in question to give more useful information so I won't make that mistake in future. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 03/09/17 21:23, Russell Coker via luv-main wrote:
It turned out to be BTRFS mis-managing free space, deciding there was none left, and going into read-only mode. The QEMU/KVM server blocked on disk IO and paused the virtual machines, which meant that they couldn't even respond to pings.
I've setup a cron job to run a weekly balance on the BTRFS filesystem which will prevent this happening again. I've seen similar things in the past but didn't expect them in this case because the filesystem is only 50% full.
Ouch! Well worth knowing about that risk. Thanks again, Russell! Cheers, Andrew

On Monday, 4 September 2017 2:54:04 PM AEST Andrew Pam via luv-main wrote:
I've setup a cron job to run a weekly balance on the BTRFS filesystem which will prevent this happening again. I've seen similar things in the past but didn't expect them in this case because the filesystem is only 50% full.
Ouch! Well worth knowing about that risk. Thanks again, Russell!
I've just joined the BTRFS list again and posted about that issue. I asked for a way of recognising the problem apart from having writes fail and whether it's a known bug. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 3 Sep 2017 21:23, "Russell Coker via luv-main" <luv-main@luv.asn.au> wrote: It turned out to be BTRFS mis-managing free space, deciding there was none left, and going into read-only mode. I've been burnt by this too, on a desktop. I think you need to watch both btrfs fi show and btrfs df. https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space....

On Monday, 4 September 2017 5:48:16 PM AEST Andrew Spiers wrote:
I've been burnt by this too, on a desktop. I think you need to watch both btrfs fi show and btrfs df.
https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space. 21
Thanks for the reference. Next time I see it I'll get all that information and send it to the list. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/
participants (3)
-
Andrew Pam
-
Andrew Spiers
-
Russell Coker