server crash

Russell Coker

3 Sep 2017 3 Sep '17

2:47 a.m.

The luv server was down this morning because of a KVM error. Also another KVM VM on the same system crashed. Sorry for sleeping in. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

Russell Coker

3 Sep 3 Sep

11:23 a.m.

On Sunday, 3 September 2017 12:47:34 PM AEST Russell Coker wrote:

...

The luv server was down this morning because of a KVM error. Also another KVM VM on the same system crashed. Sorry for sleeping in.

It turned out to be BTRFS mis-managing free space, deciding there was none left, and going into read-only mode. The QEMU/KVM server blocked on disk IO and paused the virtual machines, which meant that they couldn't even respond to pings. I've setup a cron job to run a weekly balance on the BTRFS filesystem which will prevent this happening again. I've seen similar things in the past but didn't expect them in this case because the filesystem is only 50% full. Also I had got an alert about problems before going to sleep last night, but it didn't look like an important issue (looked like just a "certificate is going to expire in 2 weeks" not "can't even talk to SSL server"). I've re- written the monitor script in question to give more useful information so I won't make that mistake in future. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Andrew Pam

4 Sep 4 Sep

4:54 a.m.

On 03/09/17 21:23, Russell Coker via luv-main wrote:

...

It turned out to be BTRFS mis-managing free space, deciding there was none left, and going into read-only mode. The QEMU/KVM server blocked on disk IO and paused the virtual machines, which meant that they couldn't even respond to pings.

I've setup a cron job to run a weekly balance on the BTRFS filesystem which will prevent this happening again. I've seen similar things in the past but didn't expect them in this case because the filesystem is only 50% full.

Ouch! Well worth knowing about that risk. Thanks again, Russell! Cheers, Andrew

Russell Coker

5:21 a.m.

On Monday, 4 September 2017 2:54:04 PM AEST Andrew Pam via luv-main wrote:

...

...
I've setup a cron job to run a weekly balance on the BTRFS filesystem which will prevent this happening again. I've seen similar things in the past but didn't expect them in this case because the filesystem is only 50% full.

Ouch! Well worth knowing about that risk. Thanks again, Russell!

I've just joined the BTRFS list again and posted about that issue. I asked for a way of recognising the problem apart from having writes fail and whether it's a known bug. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Andrew Spiers

7:48 a.m.

On 3 Sep 2017 21:23, "Russell Coker via luv-main" <luv-main@luv.asn.au> wrote: It turned out to be BTRFS mis-managing free space, deciding there was none left, and going into read-only mode. I've been burnt by this too, on a desktop. I think you need to watch both btrfs fi show and btrfs df. https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space....

Russell Coker

7:51 a.m.

On Monday, 4 September 2017 5:48:16 PM AEST Andrew Spiers wrote:

...

I've been burnt by this too, on a desktop. I think you need to watch both btrfs fi show and btrfs df.

https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space. 21

Thanks for the reference. Next time I see it I'll get all that information and send it to the list. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Andrew Pam

4:49 a.m.

On 03/09/17 12:47, Russell Coker via luv-main wrote:

...

The luv server was down this morning because of a KVM error. Also another KVM VM on the same system crashed. Sorry for sleeping in.

Thanks for dealing with it on a Sunday! Regards, Andrew

2857

Age (days ago)

2858

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

Andrew Pam
Andrew Spiers
Russell Coker

server crash

Russell Coker

Russell Coker

Andrew Pam

Russell Coker

Andrew Spiers

Russell Coker

Andrew Pam

tags

participants (3)