
On Saturday, 3 June 2017 12:09:14 PM AEST Robin Humble via luv-main wrote:
Any suggestions about what I can do?
anything in 'zpool history'?
2017-05-03.10:48:38 zpool import -a [user 0 (root) on tank:linux] 2017-06-01.20:00:03 [txg:5525955] scan setup func=1 mintxg=0 maxtxg=5525955 [on tank] 2017-06-01.20:00:10 zpool scrub pet630 [user 0 (root) on tank:linux] Thanks for the suggestion but the above are the most recent entries from "zpool history -il". No mention of the errors that happened after booting on the 3rd of May or the errors that were found in that scrub.
we're using https://calomel.org/zfs_health_check_script.html
Apart from checking for a recent script it has similar features to the mon script I wrote.
individual SMART errors in drives should also list power on hours next to the errors. perhaps date and time too.
The drives didn't have errors as such. There were some "correctable" errors logged, but not as many as ZFS found. ZFS reported no errors reading the disk, just checksum errors. The disks are returning bad data and saying it's good.
BTW RAIDZ isn't really sufficient for such large arrays of large drives. you should probably be using Z2 or Z3. eg. depending on your drive type, you may have a 55% chance of losing a Z on a rebuild reading 80TB. chance of success: % bc -l scale=20 e(l(1-10^-14)*(8*10*10^12)) .44932896411722159143 https://lwn.net/Articles/608896/ better if they're 10^-15 drives - only an 8% chance of a fail. still not great. or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes) of the data and the rest is ok?
That link refers to a bug in Linux Software RAID which has been fixed and also didn't apply to ZFS. Most of the articles about the possibility of multiple drive failues falsely assume that drives will entirely fail. In the last ~10 years the only drive I've been responsible for which came close to entirely failing had about 12,000 errors out of 1.5TB. If you had 2 drives fail in that manner (something I've never seen - the worst I've seen for multiple failures is 2 drives having ~200 errors) then you would probably still get a reasonable amount of data off. Especially as ZFS makes an extra copy of metadata so on a RAID-Z1 2 disks getting corrupted won't lose any metadata. With modern disks you are guaranteed to lose data if you use regular RAID-6. I've had lots of disks return bad data and say it's good. With RAID-Z1 losing data is much less likely. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/