
On Sat, May 27, 2017 at 11:09:53PM +1000, Russell Coker via luv-main wrote:
Is it possible to find out when errors occurred on a RAID-Z other than just monitoring the output of "zpool status" regularly and looking for changes?
I have a RAID-Z that I just discovered has between 3 and 7 checksum errors on each of 7 disks. I want to know why disks that had worked without errors on ZFS since 6TB was a big disk have got such errors in the past couple of weeks. If I knew the date and time of the errors it might give me a clue. The system in question has 9*6TB and 9*10TB disks in 2 RAID-Z arrays. None of the 10TB disks had a problem while 7/9 of the 6TB disks reported errors. The 6TB disks are a recent addition to the pool and the 9*10TB RAID-Z was almost full before I added them, so maybe the checksum errors are related to which disks had the most data written.
If I knew which day the errors happened on I might be able to guess at the cause. But ZFS doesn't seem to put anything in the kernel log.
Any suggestions about what I can do?
anything in 'zpool history'? we're using https://calomel.org/zfs_health_check_script.html individual SMART errors in drives should also list power on hours next to the errors. perhaps date and time too. BTW RAIDZ isn't really sufficient for such large arrays of large drives. you should probably be using Z2 or Z3. eg. depending on your drive type, you may have a 55% chance of losing a Z on a rebuild reading 80TB. chance of success: % bc -l scale=20 e(l(1-10^-14)*(8*10*10^12)) .44932896411722159143 https://lwn.net/Articles/608896/ better if they're 10^-15 drives - only an 8% chance of a fail. still not great. or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes) of the data and the rest is ok? cheers, robin