zfsonlinux error reporting

Is it possible to find out when errors occurred on a RAID-Z other than just monitoring the output of "zpool status" regularly and looking for changes? I have a RAID-Z that I just discovered has between 3 and 7 checksum errors on each of 7 disks. I want to know why disks that had worked without errors on ZFS since 6TB was a big disk have got such errors in the past couple of weeks. If I knew the date and time of the errors it might give me a clue. The system in question has 9*6TB and 9*10TB disks in 2 RAID-Z arrays. None of the 10TB disks had a problem while 7/9 of the 6TB disks reported errors. The 6TB disks are a recent addition to the pool and the 9*10TB RAID-Z was almost full before I added them, so maybe the checksum errors are related to which disks had the most data written. If I knew which day the errors happened on I might be able to guess at the cause. But ZFS doesn't seem to put anything in the kernel log. Any suggestions about what I can do? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sat, May 27, 2017 at 11:09:53PM +1000, Russell Coker via luv-main wrote:
Is it possible to find out when errors occurred on a RAID-Z other than just monitoring the output of "zpool status" regularly and looking for changes?
I have a RAID-Z that I just discovered has between 3 and 7 checksum errors on each of 7 disks. I want to know why disks that had worked without errors on ZFS since 6TB was a big disk have got such errors in the past couple of weeks. If I knew the date and time of the errors it might give me a clue. The system in question has 9*6TB and 9*10TB disks in 2 RAID-Z arrays. None of the 10TB disks had a problem while 7/9 of the 6TB disks reported errors. The 6TB disks are a recent addition to the pool and the 9*10TB RAID-Z was almost full before I added them, so maybe the checksum errors are related to which disks had the most data written.
If I knew which day the errors happened on I might be able to guess at the cause. But ZFS doesn't seem to put anything in the kernel log.
Any suggestions about what I can do?
anything in 'zpool history'? we're using https://calomel.org/zfs_health_check_script.html individual SMART errors in drives should also list power on hours next to the errors. perhaps date and time too. BTW RAIDZ isn't really sufficient for such large arrays of large drives. you should probably be using Z2 or Z3. eg. depending on your drive type, you may have a 55% chance of losing a Z on a rebuild reading 80TB. chance of success: % bc -l scale=20 e(l(1-10^-14)*(8*10*10^12)) .44932896411722159143 https://lwn.net/Articles/608896/ better if they're 10^-15 drives - only an 8% chance of a fail. still not great. or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes) of the data and the rest is ok? cheers, robin

On Saturday, 3 June 2017 12:09:14 PM AEST Robin Humble via luv-main wrote:
Any suggestions about what I can do?
anything in 'zpool history'?
2017-05-03.10:48:38 zpool import -a [user 0 (root) on tank:linux] 2017-06-01.20:00:03 [txg:5525955] scan setup func=1 mintxg=0 maxtxg=5525955 [on tank] 2017-06-01.20:00:10 zpool scrub pet630 [user 0 (root) on tank:linux] Thanks for the suggestion but the above are the most recent entries from "zpool history -il". No mention of the errors that happened after booting on the 3rd of May or the errors that were found in that scrub.
we're using https://calomel.org/zfs_health_check_script.html
Apart from checking for a recent script it has similar features to the mon script I wrote.
individual SMART errors in drives should also list power on hours next to the errors. perhaps date and time too.
The drives didn't have errors as such. There were some "correctable" errors logged, but not as many as ZFS found. ZFS reported no errors reading the disk, just checksum errors. The disks are returning bad data and saying it's good.
BTW RAIDZ isn't really sufficient for such large arrays of large drives. you should probably be using Z2 or Z3. eg. depending on your drive type, you may have a 55% chance of losing a Z on a rebuild reading 80TB. chance of success: % bc -l scale=20 e(l(1-10^-14)*(8*10*10^12)) .44932896411722159143 https://lwn.net/Articles/608896/ better if they're 10^-15 drives - only an 8% chance of a fail. still not great. or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes) of the data and the rest is ok?
That link refers to a bug in Linux Software RAID which has been fixed and also didn't apply to ZFS. Most of the articles about the possibility of multiple drive failues falsely assume that drives will entirely fail. In the last ~10 years the only drive I've been responsible for which came close to entirely failing had about 12,000 errors out of 1.5TB. If you had 2 drives fail in that manner (something I've never seen - the worst I've seen for multiple failures is 2 drives having ~200 errors) then you would probably still get a reasonable amount of data off. Especially as ZFS makes an extra copy of metadata so on a RAID-Z1 2 disks getting corrupted won't lose any metadata. With modern disks you are guaranteed to lose data if you use regular RAID-6. I've had lots of disks return bad data and say it's good. With RAID-Z1 losing data is much less likely. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sun, Jun 04, 2017 at 06:39:34AM +1000, Russell Coker wrote:
On Saturday, 3 June 2017 12:09:14 PM AEST Robin Humble via luv-main wrote:
individual SMART errors in drives should also list power on hours next to the errors. perhaps date and time too. The drives didn't have errors as such. There were some "correctable" errors logged, but not as many as ZFS found. ZFS reported no errors reading the disk, just checksum errors. The disks are returning bad data and saying it's good.
could it be bad RAM or SATA cables? do you have ECC ram? kinda sounds like something transient, so power supply or RAM would be my guess if there's no errors and no red flags (Uncorrecable, Pending, >few Remapped etc.) in SMART. however drives lie about SMART quite a lot. errors may just not show up or may come and go randomly. drives are insane.
BTW RAIDZ isn't really sufficient for such large arrays of large drives. you should probably be using Z2 or Z3. eg. depending on your drive type, you may have a 55% chance of losing a Z on a rebuild reading 80TB. chance of success: % bc -l scale=20 e(l(1-10^-14)*(8*10*10^12)) .44932896411722159143 https://lwn.net/Articles/608896/ better if they're 10^-15 drives - only an 8% chance of a fail. still not great. or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes) of the data and the rest is ok?
That link refers to a bug in Linux Software RAID which has been fixed and also didn't apply to ZFS.
sorry - I meant to say you need to read down a bit and look for comments by Neil Brown. some good explanations of drive statistics. the above calculations come from there. the stats apply to Z1 and Z2 as much as R5 and R6.
Most of the articles about the possibility of multiple drive failues falsely assume that drives will entirely fail. In the last ~10 years the only drive
if md hits a problem during a rebuild that it can't recover from then it will stop and give up (at least last time I tried it), so it is essentially a whole drive fail. it's a block device, it can't do much more... (*) our hardware raid systems also fail about a disk or two a week. so these are 'whole disk fails' regardless of whatever metrics they are using. so whole disk fails definitely do happen. we had IIRC 8% of drives dying and being replaced per year on one system. with ZFS I'm assuming (I haven't experienced it myself) it's more common to have a per-object loss and if eg. drives in a Z1 have dozens of bad sectors, then as long as 2 of those sectors don't line up then you won't lose data? (I'm a newbie to ZFS so happy to be pointed at docs. I can't see any obviously on this topic.) but of course in that situation you will definitely lose some data if one drive totally fails. then there'll be sectors with errors and no redundancy. this is why we all do regular scrubs - so there are no obvious bad sector of disks around. the statistics above address what happens after a drive fails in an otherwise clean Z1 array. they say that 1 new error occurs with such-and-such a probability during a rebuild of a whole new drive full of data. that's statistically at a 55% likelihood level for your array configuration. so if you're convinced you'll never have a drive fail that's fine & good luck to you, but I'm just saying that the stats aren't on your side if you ever do :)
With modern disks you are guaranteed to lose data if you use regular RAID-6.
every scheme is guaranteed to lose data. it's just a matter of time and probabilities.
I've had lots of disks return bad data and say it's good. With RAID-Z1 losing data is much less likely.
in my experience (~17000 Enterprise SATA drive-years or so, probably half that of true SAS and plain SATA drive-years), disks silently giving back bad data is extremely rare. I can't recall seeing it, ever. once an interposer did return bad data, but that wasn't the disk's fault. we did have data corruption once and noticed it, and tracked it down, and it wasn't the fault of disks. so it's not as if we just didn't notice 'wrong data' at the user level even though (at the time) everything wasn't checksummed like in ZFS. I'm not saying silent data corruption doesn't happen (I heard of one incident in a company recently which hastened their transition to ZFS on linux), but perhaps there is something else wrong if you are seeing it "lots". what sort of disks are you using? cheers, robin (*) in actuality it's often possible (and I have) got around this and recovered all but a tiny amount of data when it shouldn't be possible because there is no redundancy left and a drive has errors. it's a very manual proces and far from easy - find the sector from smart or scsi errors or the byte where a dd read fails, dd write to it to remap the sector in the drive, cross fingers, attempt the raid rebuild again, repeat.

On Sunday, 4 June 2017 12:52:33 PM AEST Robin Humble via luv-main wrote:
The drives didn't have errors as such. There were some "correctable" errors logged, but not as many as ZFS found. ZFS reported no errors reading the disk, just checksum errors. The disks are returning bad data and saying it's good.
could it be bad RAM or SATA cables? do you have ECC ram?
The server has ECC RAM. It has 9*8TB disks for which ZFS has never reported an error. It has 9*6TB disks of which 1 is Seagate and has never had an error reported. Of the 8*6TB WD disks 7 have ZFS errors reported against them. The server is a Del PowerEdge Tower T630 which has hot-swap drive bays. Having damaged SATA cables would be very unlikely. Having SATA cable issues that match the model of disk is extremely unlikely. The disks are in 3 rows of 6 disks, so we have disks in 2 rows failing and now correlation with position other than the fact that the 6TB disks are in the first row and half the second row.
kinda sounds like something transient, so power supply or RAM would be my guess if there's no errors and no red flags (Uncorrecable, Pending, >few Remapped etc.) in SMART. however drives lie about SMART quite a lot. errors may just not show up or may come and go randomly. drives are insane.
Yes, drives are insane. I'm pretty sure the PSU is up to the task, it's designed for high performance SAS disks that draw more power.
sorry - I meant to say you need to read down a bit and look for comments by Neil Brown. some good explanations of drive statistics. the above calculations come from there. the stats apply to Z1 and Z2 as much as R5 and R6.
I've read all that. But the thing that is missing here is that RAID-Z is fundamentally not the same as RAID-5 and ZFS is not like a regular filesystem. https://www.illumos.org/issues/3835 ZFS has redundant copies of metadata. This means that if you have enough corruption in a RAID-Z to lose some data it probably won't be enough to lose metadata. Basically a default RAID-Z configuration protects metadata as well as RAID-6 with checksums would (of which BTRFS RAID-6 is the only strict implementation I'm aware of). As an aside I think that NetApp WAFL does something similar with metadata, but as I can't afford it I haven't looked into it much. The errors that I have just encountered (over 100 corrupted sectors so far) would not have been detected on any other RAID configuration. The design of Linux software RAID is that if redundancy on RAID-5 or RAID-6 doesn't match the data then it's rewritten. It's theoretically possible to find which block on a RAID-6 stripe was corrupted, but the assumption is that a changed block was written and the redundancy was lost. There are 2 ways of fixing this problem of RAID-5/RAID-6 partial stripe writes. One option is to have an external journal, EG the battery-backed write-back caches that vendors such as HP and Dell offer as an expensive option. The other option is to do what BTRFS, ZFS, and WAFL do and write the new data in a different location. That gets you the most fragmented filesystem possible but gives greatest reliability and also snapshots are easy to add.
Most of the articles about the possibility of multiple drive failues falsely assume that drives will entirely fail. In the last ~10 years the only drive if md hits a problem during a rebuild that it can't recover from then it will stop and give up (at least last time I tried it), so it is essentially a whole drive fail. it's a block device, it can't do much more... (*)
I've seen that too. Apparently that's supposed to be fixable with a bad blocks list that you can add WHEN YOU CREATE THE ARRAY. So all of us who have existing Linux Software RAID arrays can't fix it - fortunately that's mostly limited to /boot and sometimes / on my systems.
our hardware raid systems also fail about a disk or two a week. so these are 'whole disk fails' regardless of whatever metrics they are using.
If these are cases where you have some dozens of errors you don't want the disk just kicked out, you want it to stay in service until you get the replacement online. As an aside a disk error of any kind during a drive replacement can be more of a problem than it should be. If you are replacing a disk for a non-error reason (EG you replace disks 1 at a time with bigger disks until you can grow the array) there's no reason why a failure of another disk at the same time should be a big deal. But I've seen it bring down a NAS based on Linux Software RAID. I don't think that would bring down ZFS.
with ZFS I'm assuming (I haven't experienced it myself) it's more common to have a per-object loss and if eg. drives in a Z1 have dozens of bad sectors, then as long as 2 of those sectors don't line up then you won't lose data? (I'm a newbie to ZFS so happy to be pointed at docs. I can't see any obviously on this topic.)
The design of ZFS is based on the assumption that errors aren't entirely independent and I believe it aims to put multiple copies of metadata on different parts of disks. This plus the extra copy of metadata means when you do hit data loss you are most likely to see only file data lost. Then at least you know which files to restore from backup or regenerate.
the statistics above address what happens after a drive fails in an otherwise clean Z1 array. they say that 1 new error occurs with such-and-such a probability during a rebuild of a whole new drive full of data. that's statistically at a 55% likelihood level for your array configuration.
It's a fairly low probability in my experience given that I've had systems go for years running a monthly ZFS scrub without such errors. I've had BTRFS systems go for years with weekly scrubs without such errors.
so if you're convinced you'll never have a drive fail that's fine & good luck to you, but I'm just saying that the stats aren't on your side if you ever do :)
I'm looking into Ceph now. I think an ideal solution to large scale storage might be something like Ceph with RAID-1 SSD for metadata storage and either RAID-Z or BTRFS RAID-0 for data storage depending on how much you rely on Ceph for data redundancy.
With modern disks you are guaranteed to lose data if you use regular RAID-6. every scheme is guaranteed to lose data. it's just a matter of time and probabilities.
Guaranteed to lose data in a fairly small amount of time if you don't have checksums on everything.
I've had lots of disks return bad data and say it's good. With RAID-Z1 losing data is much less likely.
in my experience (~17000 Enterprise SATA drive-years or so, probably half that of true SAS and plain SATA drive-years), disks silently giving back bad data is extremely rare. I can't recall seeing it, ever. once an interposer did return bad data, but that wasn't the disk's fault.
I've seen it lots of times in all manner of hardware, from old IDE disks to modern SATA. I don't use SAS much. The problem with SAS is that disks are more expensive and that leads the people paying towards cutting corners elsewhere. SATA with adequate redundancy and backups is much better than SAS with all corners cut.
we did have data corruption once and noticed it, and tracked it down, and it wasn't the fault of disks. so it's not as if we just didn't notice 'wrong data' at the user level even though (at the time) everything wasn't checksummed like in ZFS.
I've seen that too. One PC without ECC RAM had BTRFS errors, the BTRFS developers said that it wasn't the pattern of errors they would expect from storage failure or kernel errors and that I should check the RAM. The RAM turned out to be faulty. With Ext4 the system would probably have run for a lot longer in service and the corrupted data would have ended up on backup devices...
I'm not saying silent data corruption doesn't happen (I heard of one incident in a company recently which hastened their transition to ZFS on linux), but perhaps there is something else wrong if you are seeing it "lots".
what sort of disks are you using?
I've explained the disks for this incident above. But in the past I've seen it on a variety of others. I haven't kept a journal, I probably should.
(*) in actuality it's often possible (and I have) got around this and recovered all but a tiny amount of data when it shouldn't be possible because there is no redundancy left and a drive has errors. it's a very manual proces and far from easy - find the sector from smart or scsi errors or the byte where a dd read fails, dd write to it to remap the sector in the drive, cross fingers, attempt the raid rebuild again, repeat.
I've done that before too. I even once wrote a personal version of ddrescue that read from 2 disks and took whichever one would give me data. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/
participants (2)
-
Robin Humble
-
Russell Coker