Re: zfsonlinux error reporting

3 Jun 2017

      On Sat, May 27, 2017 at 11:09:53PM +1000, Russell Coker via luv-main wrote:
...
Is it possible to find out when errors occurred on a RAID-Z other than just 
monitoring the output of "zpool status" regularly and looking for changes?
I have a RAID-Z that I just discovered has between 3 and 7 checksum errors on 
each of 7 disks.  I want to know why disks that had worked without errors on 
ZFS since 6TB was a big disk have got such errors in the past couple of weeks.  
If I knew the date and time of the errors it might give me a clue.  The system 
in question has 9*6TB and 9*10TB disks in 2 RAID-Z arrays.  None of the 10TB 
disks had a problem while 7/9 of the 6TB disks reported errors.  The 6TB disks 
are a recent addition to the pool and the 9*10TB RAID-Z was almost full before 
I added them, so maybe the checksum errors are related to which disks had the 
most data written.
If I knew which day the errors happened on I might be able to guess at the 
cause.  But ZFS doesn't seem to put anything in the kernel log.
Any suggestions about what I can do?
anything in 'zpool history'?

we're using https://calomel.org/zfs_health_check_script.html

individual SMART errors in drives should also list power on hours next
to the errors. perhaps date and time too.

BTW RAIDZ isn't really sufficient for such large arrays of large
drives. you should probably be using Z2 or Z3.
eg. depending on your drive type, you may have a 55% chance of losing
a Z on a rebuild reading 80TB. chance of success:
  % bc -l
  scale=20
  e(l(1-10^-14)*(8*10*10^12))
  .44932896411722159143
https://lwn.net/Articles/608896/
better if they're 10^-15 drives - only an 8% chance of a fail. still
not great.
or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes)
of the data and the rest is ok?

cheers,
robin

Re: zfsonlinux error reporting

Robin Humble