Re: zpool status

26 Jul 2013

      On Fri, 26 Jul 2013 12:53:32 +1000
Craig Sanders <cas@taz.net.au> wrote:
...
On Fri, Jul 26, 2013 at 12:38:29PM +1000, Russell Coker wrote:
...
The scrub has completed and I now get the following output.  It
seems that the status has been cleared so if it wasn't for the
kernel messages I wouldn't know which disk had the problem.
that suggests to me that there is nothing wrong with the drive - if
anything is going to stress a drive with problems, a 10+ hour 'zfs
scrub' will do it.
the drive was probably in standby mode and took too long to wake up
and respond, causing the kernel to complain. or possibly a dodgy power
connector or an underpowered PSU or similar.
Entries such as the following from the kernel message log seem to
clearly indicate a drive problem.  Also smartctl reports a history of
errors.

[1515513.068668] ata4.00: status: { DRDY ERR }
[1515513.068669] ata4.00: error: { UNC }
[1515513.103259] ata4.00: configured for UDMA/133
[1515513.103294] sd 3:0:0:0: [sdd] Unhandled sense code
[1515513.103296] sd 3:0:0:0: [sdd] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE [1515513.103298] sd 3:0:0:0: [sdd] Sense Key :
Medium Error [current] [descriptor] [1515513.103301] Descriptor sense
data with sense descriptors (in hex): [1515513.103303]         72 03 11
04 00 00 00 0c 00 0a 80 00 00 00 00 00 [1515513.103307]         2f 08
4a d0 [1515513.103310] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read
error - auto reallocate failed [1515513.103313] sd 3:0:0:0: [sdd] CDB:
Read(10): 28 00 2f 08 4a 80 00 01 00 00 [1515513.103318] end_request:
I/O error, dev sdd, sector 789072592 [1515513.103333] ata4: EH complete
...
i've seen the standby/sleep phantom problem happen with several drives
over the years - even replaced a few until i figured out what was
going on (i've since re-used those drives with no problems).
...
I haven't run any "clear" command, so zfs decided by itself to
remove the data.
no, zfs didn't SEE any problem. when it accessed the drive, there were
no errors. I interpret this as very strong evidence that there is
nothing wrong with the drive.
It reported 1.4MB of data that needed to be regenerated from parity, it
definitely saw problems.
...
if it had seen any errors, they'd be in the error counts in the zfs
status report.
The status report got it wrong.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/