
On Fri, Jul 26, 2013 at 02:39:37PM +1000, Russell Coker wrote:
On Fri, 26 Jul 2013 14:18:44 +1000 Craig Sanders <cas@taz.net.au> wrote:
On Fri, Jul 26, 2013 at 01:00:30PM +1000, Russell Coker wrote:
also the numbers in the READ WRITE and CKSUM columns will show you the number of errors detected for each drive.
However those numbers are all 0 for me.
as i said, i interpret that as indicating that there's no real problem with the drive - unless the kernel is retrying successfully before zfs notices the drive is having problems? is that the case?
No, the very first message in this thread included the zpool status output which stated that 1.4M of data had been regenerated.
see previous message for why that does not necessarily indicate a failed disk.
I'm now replacing the defective disk. I've attached a sample of iostat output, it seems to be reading from all disks and then reconstructing the parity for the new disk which is surprising, I had expected it to just read the old disk and write to the new disk
there's (at least) two reasons for that.
first is that raidz is only similar to raid5/6 but not exactly the same. the redundant copies of a data block can exist anywhere on any of the drives in the vdev, so it's not just a straight dd-style copy from the old drive to the new.
the second is that when you're replacing a drive, the old one may not be reliable or trustworthy, or may even be absent from the system.
zpool replace tank \ sdd /dev/disk/by-id/ata-ST4000DM000-1F2168_Z300MHWF-part2
In this case the old disk was online, I ran the above replace command so ZFS should know that the new disk needs to be an exact copy of the old.
1. you're still thinking in raid or mdadm terms. zfs doesn't do exact copies of disks. it does exact copies of the data on disks. a data block with redundant copies on multiple disks WILL NOT BE IN THE SAME SECTOR on the different disks. It will be wherever zfs saw fit to put it at the time it was writing it. this also means that it's not copying unused/empty sectors on the disk, it's only copying data in use...so the replace will likely be finished a lot sooner than you expect, and sooner than 'zpool status' estimates. i've also read somehwhere that it reads the blocks in the order that they were written, so if you've created and deleted lots of files or snapshots, fragmentation will be causing the disk to thrash and slow down reads. i'm not 100% sure if this is - or even was - true, just something i've read. 2. if, as you say, the drive has read errors then that will dramatically slow down the read performance of the drive due to retries. 3. you're replacing an entire disk with a partition? is the start of the partition 4k-aligned? if not, that could make a huge performance difference on writing to the replacement disk (and once the buffers are filled, slow down reading to match - no point reading faster than you can write).
but instead I get a scrub as well as the "resilver".
that's odd. what makes you say that?
I've attached the zpool status output. It shows the disk as being replaced but is accessing all disks according to the iostat output I attached previously.
i still don't see a scrub happening as well as a resilver. the zpool iostat output is showing about 6MB/s of other usage, as well as just under 39M/s resilvering the replacement drive. that seems reasonable to do parity checks on the data as its reading it.
I've attached the zpool iostat output and it claims that the only real activity is reading from the old disk at 38MB/s and writing to the new disk at the same speed. I've attached the iostat -m output which shows that all disks are being accessed at a speed just over 45MB/s. I guess that the difference between 38 and 45 would be due to some random variation,
the 6M/s of other usage seems to roughly make up the difference. if i wanted to be more precise, i'd say: "38-ish plus 6-ish approximately equals 45-ish. roughly speaking" :)
zpool gives an instant response based on past data while iostat runs in real time and gives more current data.
FYI you can also use 'zpool iostat -v tank nnnn', with nnnn in seconds similar to /usr/bin/iostat as with iostat, ignore the first set of output and watch it for a while. craig -- craig sanders <cas@taz.net.au> BOFH excuse #368: Failure to adjust for daylight savings time.