
On Fri, Jul 26, 2013 at 02:24:22PM +1000, Russell Coker wrote:
Entries such as the following from the kernel message log seem to clearly indicate a drive problem. Also smartctl reports a history of errors.
[1515513.068668] ata4.00: status: { DRDY ERR } [1515513.068669] ata4.00: error: { UNC } [1515513.103259] ata4.00: configured for UDMA/133 [1515513.103294] sd 3:0:0:0: [sdd] Unhandled sense code [1515513.103296] sd 3:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [1515513.103298] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor] [1515513.103301] Descriptor sense data with sense descriptors (in hex): [1515513.103303] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [1515513.103307] 2f 08 4a d0 [1515513.103310] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed [1515513.103313] sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 2f 08 4a 80 00 01 00 00 [1515513.103318] end_request: I/O error, dev sdd, sector 789072592 [1515513.103333] ata4: EH complete
if it wasn't for your mention of smartctrl errors, i'd suspect the sata port as an equally-likely culprit. and i still wouldn't rule it (or dodgy power/data connectors), out. BTW, i recall reading a few years ago that drives only reallocate or remap a sector on a WRITE failure, not a READ failure - so the only way to force a good sector to be remapped over a bad sector on a read error is to write to that sector. I'm not 100% sure if this is still the case. googling for it, i haven't found the page where i originally read that but found this instead: http://www.sj-vs.net/forcing-a-hard-disk-to-reallocate-bad-sectors/ the suggestion from there is to use 'hdparm -read-sector' to verify that sector 789072592 has a problem, then 'hdparm -write-sector' to rewrite it. this should force the drive to remap the bad sector. 'hdparm -write-sector' will overwrite the sector with zeroes, but the next zfs scrub (or read of the file using that sector in normal usage) will detect and correct the error. you can also force a resilver of the entire drive by 'zpool offline' the disk, use dd to erase it (and this force a write and remap of any bad sectors), and then 'zpool replace' it with itself.
I haven't run any "clear" command, so zfs decided by itself to remove the data.
no, zfs didn't SEE any problem. when it accessed the drive, there were no errors. I interpret this as very strong evidence that there is nothing wrong with the drive.
It reported 1.4MB of data that needed to be regenerated from parity, it definitely saw problems.
no, that's like saying "there's corruption in this one .tar.gz file so that proves the entire disk is failing"....there's any number of reasons why some data may be corrupt while the disk is still good. some of those reasons are, in fact, the reason why error-detecting and error-correcting filesystems like zfs are necessary. if zfs had seen any read errors while scrubbing the disk, it would have shown them in the status report.
if it had seen any errors, they'd be in the error counts in the zfs status report.
The status report got it wrong.
or maybe there's a tiny, miniscule chance that you're just misinterpreting what it's saying because of unfamiliarity with zfs. craig -- craig sanders <cas@taz.net.au>