Re: zfsonlinux error reporting

3 Jun 2017

      On Saturday, 3 June 2017 12:09:14 PM AEST Robin Humble via luv-main wrote:
...
...
Any suggestions about what I can do?
anything in 'zpool history'?
2017-05-03.10:48:38 zpool import -a [user 0 (root) on tank:linux]
2017-06-01.20:00:03 [txg:5525955] scan setup func=1 mintxg=0 maxtxg=5525955 
[on tank]
2017-06-01.20:00:10 zpool scrub pet630 [user 0 (root) on tank:linux]

Thanks for the suggestion but the above are the most recent entries from 
"zpool history -il".  No mention of the errors that happened after booting on 
the 3rd of May or the errors that were found in that scrub.
...
we're using https://calomel.org/zfs_health_check_script.html
Apart from checking for a recent script it has similar features to the mon 
script I wrote.
...
individual SMART errors in drives should also list power on hours next
to the errors. perhaps date and time too.
The drives didn't have errors as such.  There were some "correctable" errors 
logged, but not as many as ZFS found.  ZFS reported no errors reading the 
disk, just checksum errors.  The disks are returning bad data and saying it's 
good.
...
BTW RAIDZ isn't really sufficient for such large arrays of large
drives. you should probably be using Z2 or Z3.
eg. depending on your drive type, you may have a 55% chance of losing
a Z on a rebuild reading 80TB. chance of success:
  % bc -l
  scale=20
  e(l(1-10^-14)*(8*10*10^12))
  .44932896411722159143
https://lwn.net/Articles/608896/
better if they're 10^-15 drives - only an 8% chance of a fail. still
not great.
or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes)
of the data and the rest is ok?
That link refers to a bug in Linux Software RAID which has been fixed and also 
didn't apply to ZFS.

Most of the articles about the possibility of multiple drive failues falsely 
assume that drives will entirely fail.  In the last ~10 years the only drive 
I've been responsible for which came close to entirely failing had about 
12,000 errors out of 1.5TB.  If you had 2 drives fail in that manner 
(something I've never seen - the worst I've seen for multiple failures is 2 
drives having ~200 errors) then you would probably still get a reasonable 
amount of data off.  Especially as ZFS makes an extra copy of metadata so on a 
RAID-Z1 2 disks getting corrupted won't lose any metadata.

With modern disks you are guaranteed to lose data if you use regular RAID-6.  
I've had lots of disks return bad data and say it's good.  With RAID-Z1 losing 
data is much less likely.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

Re: zfsonlinux error reporting

Russell Coker