
I'm getting some errors on a zpool scrub operation. It's obvious that sdd has the problem both from the below output of zpool status and from the fact that the kernel message log has read errors about sdd. But how can I get detail about what has gone wrong? Has sdd given corrupt data in addition to read failures? Presumably the "(repairing)" string will disappear as soon as the scrub finishes, when that happens how would I determine that sdd was to blame without the kernel error log? # zpool status pool: tank state: ONLINE scan: scrub in progress since Thu Jul 25 16:38:01 2013 1.01T scanned out of 10.3T at 164M/s, 16h26m to go 1.40M repaired, 9.80% done config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 (repairing) -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Although I'm not sure I should be helping the Enlightened, I think I can comment based on the same thing happening with my FreeBSD zpool. On Thu, Jul 25, 2013 at 6:31 PM, Russell Coker <russell@coker.com.au> wrote:
Presumably the "(repairing)" string will disappear as soon as the scrub finishes
No, it won't. I forget what the string will look like, but it will still indicate an error. I did some research at the time, and determined that this wasn't a big deal, so cleared the error: 2013-04-12.07:35:30 zpool scrub storage 2013-04-12.15:35:24 zpool clear storage ada10 At least if it happens again, I have a history of which drive had it. That said, I've done 3 x monthly scrubs since then, and had no errors re-appear. That drive is scheduled to be replaced, so that's probably why I was so quick to clear the error. cheers, / Brett

On Thu, 25 Jul 2013 18:45:30 +1000 Brett Pemberton <brett.pemberton@gmail.com> wrote:
On Thu, Jul 25, 2013 at 6:31 PM, Russell Coker <russell@coker.com.au> wrote:
Presumably the "(repairing)" string will disappear as soon as the scrub finishes
No, it won't. I forget what the string will look like, but it will still indicate an error. I did some research at the time, and determined that this wasn't a big deal, so cleared the error:
2013-04-12.07:35:30 zpool scrub storage 2013-04-12.15:35:24 zpool clear storage ada10
The scrub has completed and I now get the following output. It seems that the status has been cleared so if it wasn't for the kernel messages I wouldn't know which disk had the problem. I haven't run any "clear" command, so zfs decided by itself to remove the data. # zpool status pool: tank state: ONLINE scan: scrub repaired 1.40M in 15h55m with 0 errors on Fri Jul 26 08:33:33 2013 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 errors: No known data errors -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Fri, Jul 26, 2013 at 12:38:29PM +1000, Russell Coker wrote:
The scrub has completed and I now get the following output. It seems that the status has been cleared so if it wasn't for the kernel messages I wouldn't know which disk had the problem.
that suggests to me that there is nothing wrong with the drive - if anything is going to stress a drive with problems, a 10+ hour 'zfs scrub' will do it. the drive was probably in standby mode and took too long to wake up and respond, causing the kernel to complain. or possibly a dodgy power connector or an underpowered PSU or similar. i've seen the standby/sleep phantom problem happen with several drives over the years - even replaced a few until i figured out what was going on (i've since re-used those drives with no problems).
I haven't run any "clear" command, so zfs decided by itself to remove the data.
no, zfs didn't SEE any problem. when it accessed the drive, there were no errors. I interpret this as very strong evidence that there is nothing wrong with the drive. if it had seen any errors, they'd be in the error counts in the zfs status report. craig -- craig sanders <cas@taz.net.au>

On Fri, 26 Jul 2013 12:53:32 +1000 Craig Sanders <cas@taz.net.au> wrote:
On Fri, Jul 26, 2013 at 12:38:29PM +1000, Russell Coker wrote:
The scrub has completed and I now get the following output. It seems that the status has been cleared so if it wasn't for the kernel messages I wouldn't know which disk had the problem.
that suggests to me that there is nothing wrong with the drive - if anything is going to stress a drive with problems, a 10+ hour 'zfs scrub' will do it.
the drive was probably in standby mode and took too long to wake up and respond, causing the kernel to complain. or possibly a dodgy power connector or an underpowered PSU or similar.
Entries such as the following from the kernel message log seem to clearly indicate a drive problem. Also smartctl reports a history of errors. [1515513.068668] ata4.00: status: { DRDY ERR } [1515513.068669] ata4.00: error: { UNC } [1515513.103259] ata4.00: configured for UDMA/133 [1515513.103294] sd 3:0:0:0: [sdd] Unhandled sense code [1515513.103296] sd 3:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [1515513.103298] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor] [1515513.103301] Descriptor sense data with sense descriptors (in hex): [1515513.103303] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [1515513.103307] 2f 08 4a d0 [1515513.103310] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed [1515513.103313] sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 2f 08 4a 80 00 01 00 00 [1515513.103318] end_request: I/O error, dev sdd, sector 789072592 [1515513.103333] ata4: EH complete
i've seen the standby/sleep phantom problem happen with several drives over the years - even replaced a few until i figured out what was going on (i've since re-used those drives with no problems).
I haven't run any "clear" command, so zfs decided by itself to remove the data.
no, zfs didn't SEE any problem. when it accessed the drive, there were no errors. I interpret this as very strong evidence that there is nothing wrong with the drive.
It reported 1.4MB of data that needed to be regenerated from parity, it definitely saw problems.
if it had seen any errors, they'd be in the error counts in the zfs status report.
The status report got it wrong. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Fri, Jul 26, 2013 at 02:24:22PM +1000, Russell Coker wrote:
Entries such as the following from the kernel message log seem to clearly indicate a drive problem. Also smartctl reports a history of errors.
[1515513.068668] ata4.00: status: { DRDY ERR } [1515513.068669] ata4.00: error: { UNC } [1515513.103259] ata4.00: configured for UDMA/133 [1515513.103294] sd 3:0:0:0: [sdd] Unhandled sense code [1515513.103296] sd 3:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [1515513.103298] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor] [1515513.103301] Descriptor sense data with sense descriptors (in hex): [1515513.103303] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [1515513.103307] 2f 08 4a d0 [1515513.103310] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed [1515513.103313] sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 2f 08 4a 80 00 01 00 00 [1515513.103318] end_request: I/O error, dev sdd, sector 789072592 [1515513.103333] ata4: EH complete
if it wasn't for your mention of smartctrl errors, i'd suspect the sata port as an equally-likely culprit. and i still wouldn't rule it (or dodgy power/data connectors), out. BTW, i recall reading a few years ago that drives only reallocate or remap a sector on a WRITE failure, not a READ failure - so the only way to force a good sector to be remapped over a bad sector on a read error is to write to that sector. I'm not 100% sure if this is still the case. googling for it, i haven't found the page where i originally read that but found this instead: http://www.sj-vs.net/forcing-a-hard-disk-to-reallocate-bad-sectors/ the suggestion from there is to use 'hdparm -read-sector' to verify that sector 789072592 has a problem, then 'hdparm -write-sector' to rewrite it. this should force the drive to remap the bad sector. 'hdparm -write-sector' will overwrite the sector with zeroes, but the next zfs scrub (or read of the file using that sector in normal usage) will detect and correct the error. you can also force a resilver of the entire drive by 'zpool offline' the disk, use dd to erase it (and this force a write and remap of any bad sectors), and then 'zpool replace' it with itself.
I haven't run any "clear" command, so zfs decided by itself to remove the data.
no, zfs didn't SEE any problem. when it accessed the drive, there were no errors. I interpret this as very strong evidence that there is nothing wrong with the drive.
It reported 1.4MB of data that needed to be regenerated from parity, it definitely saw problems.
no, that's like saying "there's corruption in this one .tar.gz file so that proves the entire disk is failing"....there's any number of reasons why some data may be corrupt while the disk is still good. some of those reasons are, in fact, the reason why error-detecting and error-correcting filesystems like zfs are necessary. if zfs had seen any read errors while scrubbing the disk, it would have shown them in the status report.
if it had seen any errors, they'd be in the error counts in the zfs status report.
The status report got it wrong.
or maybe there's a tiny, miniscule chance that you're just misinterpreting what it's saying because of unfamiliarity with zfs. craig -- craig sanders <cas@taz.net.au>

On Thu, Jul 25, 2013 at 06:31:54PM +1000, Russell Coker wrote:
I'm getting some errors on a zpool scrub operation. It's obvious that sdd has the problem both from the below output of zpool status and from the fact that the kernel message log has read errors about sdd.
what sort of controller is it on, and do you have standby/spin-down enabled? that can cause drives to be booted from a raid array or zpool if they don't respond fast enough. that's the reason for using IT mode firmware rather than RAID mode firmware in LSI and similar cards - it's far more forgiving of consumer drives and their slow responses when waking up from standby. RAID mode firmware pretty much expects enterprise drives with standby disabled.
But how can I get detail about what has gone wrong? Has sdd given corrupt data in addition to read failures? Presumably the "(repairing)" string will disappear as soon as the scrub finishes, when that happens how would I determine that sdd was to blame without the kernel error log?
# zpool status pool: tank state: ONLINE scan: scrub in progress since Thu Jul 25 16:38:01 2013 1.01T scanned out of 10.3T at 164M/s, 16h26m to go 1.40M repaired, 9.80% done config:
NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 (repairing)
corrupt data on sdd will have been detected and corrected by zfs. if the block read doesn't match the (sha256 IIRC) hash then it will be corrected from the redundant copies on the other drives in the pool. zpool status will tell you how much data was corrected when it has finished. e.g. on my backup pool, zpool status says this: scan: scrub repaired 160K in 4h21m with 0 errors on Sat Jul 20 06:03:58 2013 also the numbers in the READ WRITE and CKSUM columns will show you the number of errors detected for each drive. craig -- craig sanders <cas@taz.net.au> BOFH excuse #202: kernel panic: write-only-memory (/dev/wom0) capacity exceeded.

On Fri, 26 Jul 2013 11:48:28 +1000 Craig Sanders <cas@taz.net.au> wrote:
On Thu, Jul 25, 2013 at 06:31:54PM +1000, Russell Coker wrote:
I'm getting some errors on a zpool scrub operation. It's obvious that sdd has the problem both from the below output of zpool status and from the fact that the kernel message log has read errors about sdd.
what sort of controller is it on, and do you have standby/spin-down enabled? that can cause drives to be booted from a raid array or zpool if they don't respond fast enough.
that's the reason for using IT mode firmware rather than RAID mode firmware in LSI and similar cards - it's far more forgiving of consumer drives and their slow responses when waking up from standby. RAID mode firmware pretty much expects enterprise drives with standby disabled.
It's the onboard controller on a Dell PowerEdge T110, a fairly basic SATA controller. I don't think that spin-down is the issue, if it was then it would affect all disks in the set, it wouldn't give kernel read error messages, and it would only affect the start of the scrub process not happen throughout the scrub.
zpool status will tell you how much data was corrected when it has finished.
e.g. on my backup pool, zpool status says this:
scan: scrub repaired 160K in 4h21m with 0 errors on Sat Jul 20 06:03:58 2013
also the numbers in the READ WRITE and CKSUM columns will show you the number of errors detected for each drive.
However those numbers are all 0 for me. I'm now replacing the defective disk. I've attached a sample of iostat output, it seems to be reading from all disks and then reconstructing the parity for the new disk which is surprising, I had expected it to just read the old disk and write to the new disk but instead I get a scrub as well as the "resilver". So the rate of rebuild is considerably less than half what I had hoped for, I had expected something like 130MB/s for contiguous reads and writes and instead each of the 5 disks is doing about 45MB/s. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Fri, Jul 26, 2013 at 01:00:30PM +1000, Russell Coker wrote:
also the numbers in the READ WRITE and CKSUM columns will show you the number of errors detected for each drive.
However those numbers are all 0 for me.
as i said, i interpret that as indicating that there's no real problem with the drive - unless the kernel is retrying successfully before zfs notices the drive is having problems? is that the case?
I'm now replacing the defective disk. I've attached a sample of iostat output, it seems to be reading from all disks and then reconstructing the parity for the new disk which is surprising, I had expected it to just read the old disk and write to the new disk
there's (at least) two reasons for that. first is that raidz is only similar to raid5/6 but not exactly the same. the redundant copies of a data block can exist anywhere on any of the drives in the vdev, so it's not just a straight dd-style copy from the old drive to the new. the second is that when you're replacing a drive, the old one may not be reliable or trustworthy, or may even be absent from the system. also note that with a raidz vdev you get the IOPS of a single drive - this is why for large pools it is better to have multiple smaller raidz vdevs than one large vdev (e.g. 3 x 5-drive raidz vdevs rather than 1 x 15-drive raidz vdev).
but instead I get a scrub as well as the "resilver".
that's odd. what makes you say that?
So the rate of rebuild is considerably less than half what I had hoped for, I had expected something like 130MB/s for contiguous reads and writes and instead each of the 5 disks is doing about 45MB/s.
which is about 180MB/s total read. btw, from what I understand of the way it works, ZFS avoids the performance penalty on raid writes by always writing the entire (variable-width) stripe. this can result in writes being faster than reads in some cases, and is the main reason for the recommendation to use 2, 4, 8 (a power of two) number of DATA disks plus however many "parity" disks (1 for raidz1, 2 for raidz2 etc) in a raidz pool. and, of course, with raidz (or raid5) writes are always going to be limited to the speed of, at best, a single drive. the SATA controller is also a factor, many (most?) aren't capable of running four or more drives at full speed simultaneously. even a cheap-but-midrange SAS card like my LSI cards couldn't run all 8 ports at full speed with 6Gbps SSDs flat out (since i'm only running hard disks and not SSDs on them, i will never be limited by that so don't care)
avg-cpu: %user %nice %system %iowait %steal %idle 1.69 0.00 20.23 3.79 0.00 74.29
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 373.90 0.40 298.30 5.80 92344.00 75.20 303.91 1.36 4.48 2.23 67.96 sdb 195.90 0.40 502.70 5.80 89902.40 75.20 176.95 1.66 3.27 1.25 63.72 sdc 374.20 0.60 300.30 6.00 92286.40 76.80 301.54 1.41 4.59 2.38 72.84 sdd 175.10 0.60 539.30 6.00 89230.40 76.80 163.78 1.78 3.27 1.24 67.76 sdl 0.00 174.30 0.00 681.10 0.00 88107.10 129.36 6.40 9.39 1.32 89.72
hmm. does iostat know about 4K sectors yet? maybe try that with -m for megabytes/sec rather than rsec/s. also, what does 'zpool iostat' (or 'zpool iostat -v') and 'zpool status' say? craig -- craig sanders <cas@taz.net.au>

On Fri, 26 Jul 2013 14:18:44 +1000 Craig Sanders <cas@taz.net.au> wrote:
On Fri, Jul 26, 2013 at 01:00:30PM +1000, Russell Coker wrote:
also the numbers in the READ WRITE and CKSUM columns will show you the number of errors detected for each drive.
However those numbers are all 0 for me.
as i said, i interpret that as indicating that there's no real problem with the drive - unless the kernel is retrying successfully before zfs notices the drive is having problems? is that the case?
No, the very first message in this thread included the zpool status output which stated that 1.4M of data had been regenerated.
I'm now replacing the defective disk. I've attached a sample of iostat output, it seems to be reading from all disks and then reconstructing the parity for the new disk which is surprising, I had expected it to just read the old disk and write to the new disk
there's (at least) two reasons for that.
first is that raidz is only similar to raid5/6 but not exactly the same. the redundant copies of a data block can exist anywhere on any of the drives in the vdev, so it's not just a straight dd-style copy from the old drive to the new.
the second is that when you're replacing a drive, the old one may not be reliable or trustworthy, or may even be absent from the system.
zpool replace tank \ sdd /dev/disk/by-id/ata-ST4000DM000-1F2168_Z300MHWF-part2 In this case the old disk was online, I ran the above replace command so ZFS should know that the new disk needs to be an exact copy of the old.
but instead I get a scrub as well as the "resilver".
that's odd. what makes you say that?
I've attached the zpool status output. It shows the disk as being replaced but is accessing all disks according to the iostat output I attached previously.
and, of course, with raidz (or raid5) writes are always going to be limited to the speed of, at best, a single drive.
Actually for contiguous writes a RAID-5 array can be expected to exceed the performance of a single disk. It's not difficult to demonstrate this in real life.
the SATA controller is also a factor, many (most?) aren't capable of running four or more drives at full speed simultaneously. even a cheap-but-midrange SAS card like my LSI cards couldn't run all 8 ports at full speed with 6Gbps SSDs flat out (since i'm only running hard disks and not SSDs on them, i will never be limited by that so don't care)
Yes, that's always been an issue, dating back to IDE days.
avg-cpu: %user %nice %system %iowait %steal %idle 1.69 0.00 20.23 3.79 0.00 74.29
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 373.90 0.40 298.30 5.80 92344.00 75.20 303.91 1.36 4.48 2.23 67.96 sdb 195.90 0.40 502.70 5.80 89902.40 75.20 176.95 1.66 3.27 1.25 63.72 sdc 374.20 0.60 300.30 6.00 92286.40 76.80 301.54 1.41 4.59 2.38 72.84 sdd 175.10 0.60 539.30 6.00 89230.40 76.80 163.78 1.78 3.27 1.24 67.76 sdl 0.00 174.30 0.00 681.10 0.00 88107.10 129.36 6.40 9.39 1.32 89.72
hmm. does iostat know about 4K sectors yet? maybe try that with -m for megabytes/sec rather than rsec/s.
also, what does 'zpool iostat' (or 'zpool iostat -v') and 'zpool status' say?
I've attached the zpool iostat output and it claims that the only real activity is reading from the old disk at 38MB/s and writing to the new disk at the same speed. I've attached the iostat -m output which shows that all disks are being accessed at a speed just over 45MB/s. I guess that the difference between 38 and 45 would be due to some random variation, zpool gives an instant response based on past data while iostat runs in real time and gives more current data. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Fri, Jul 26, 2013 at 02:39:37PM +1000, Russell Coker wrote:
On Fri, 26 Jul 2013 14:18:44 +1000 Craig Sanders <cas@taz.net.au> wrote:
On Fri, Jul 26, 2013 at 01:00:30PM +1000, Russell Coker wrote:
also the numbers in the READ WRITE and CKSUM columns will show you the number of errors detected for each drive.
However those numbers are all 0 for me.
as i said, i interpret that as indicating that there's no real problem with the drive - unless the kernel is retrying successfully before zfs notices the drive is having problems? is that the case?
No, the very first message in this thread included the zpool status output which stated that 1.4M of data had been regenerated.
see previous message for why that does not necessarily indicate a failed disk.
I'm now replacing the defective disk. I've attached a sample of iostat output, it seems to be reading from all disks and then reconstructing the parity for the new disk which is surprising, I had expected it to just read the old disk and write to the new disk
there's (at least) two reasons for that.
first is that raidz is only similar to raid5/6 but not exactly the same. the redundant copies of a data block can exist anywhere on any of the drives in the vdev, so it's not just a straight dd-style copy from the old drive to the new.
the second is that when you're replacing a drive, the old one may not be reliable or trustworthy, or may even be absent from the system.
zpool replace tank \ sdd /dev/disk/by-id/ata-ST4000DM000-1F2168_Z300MHWF-part2
In this case the old disk was online, I ran the above replace command so ZFS should know that the new disk needs to be an exact copy of the old.
1. you're still thinking in raid or mdadm terms. zfs doesn't do exact copies of disks. it does exact copies of the data on disks. a data block with redundant copies on multiple disks WILL NOT BE IN THE SAME SECTOR on the different disks. It will be wherever zfs saw fit to put it at the time it was writing it. this also means that it's not copying unused/empty sectors on the disk, it's only copying data in use...so the replace will likely be finished a lot sooner than you expect, and sooner than 'zpool status' estimates. i've also read somehwhere that it reads the blocks in the order that they were written, so if you've created and deleted lots of files or snapshots, fragmentation will be causing the disk to thrash and slow down reads. i'm not 100% sure if this is - or even was - true, just something i've read. 2. if, as you say, the drive has read errors then that will dramatically slow down the read performance of the drive due to retries. 3. you're replacing an entire disk with a partition? is the start of the partition 4k-aligned? if not, that could make a huge performance difference on writing to the replacement disk (and once the buffers are filled, slow down reading to match - no point reading faster than you can write).
but instead I get a scrub as well as the "resilver".
that's odd. what makes you say that?
I've attached the zpool status output. It shows the disk as being replaced but is accessing all disks according to the iostat output I attached previously.
i still don't see a scrub happening as well as a resilver. the zpool iostat output is showing about 6MB/s of other usage, as well as just under 39M/s resilvering the replacement drive. that seems reasonable to do parity checks on the data as its reading it.
I've attached the zpool iostat output and it claims that the only real activity is reading from the old disk at 38MB/s and writing to the new disk at the same speed. I've attached the iostat -m output which shows that all disks are being accessed at a speed just over 45MB/s. I guess that the difference between 38 and 45 would be due to some random variation,
the 6M/s of other usage seems to roughly make up the difference. if i wanted to be more precise, i'd say: "38-ish plus 6-ish approximately equals 45-ish. roughly speaking" :)
zpool gives an instant response based on past data while iostat runs in real time and gives more current data.
FYI you can also use 'zpool iostat -v tank nnnn', with nnnn in seconds similar to /usr/bin/iostat as with iostat, ignore the first set of output and watch it for a while. craig -- craig sanders <cas@taz.net.au> BOFH excuse #368: Failure to adjust for daylight savings time.
participants (3)
-
Brett Pemberton
-
Craig Sanders
-
Russell Coker