
On Sun, Jun 04, 2017 at 06:39:34AM +1000, Russell Coker wrote:
On Saturday, 3 June 2017 12:09:14 PM AEST Robin Humble via luv-main wrote:
individual SMART errors in drives should also list power on hours next to the errors. perhaps date and time too. The drives didn't have errors as such. There were some "correctable" errors logged, but not as many as ZFS found. ZFS reported no errors reading the disk, just checksum errors. The disks are returning bad data and saying it's good.
could it be bad RAM or SATA cables? do you have ECC ram? kinda sounds like something transient, so power supply or RAM would be my guess if there's no errors and no red flags (Uncorrecable, Pending, >few Remapped etc.) in SMART. however drives lie about SMART quite a lot. errors may just not show up or may come and go randomly. drives are insane.
BTW RAIDZ isn't really sufficient for such large arrays of large drives. you should probably be using Z2 or Z3. eg. depending on your drive type, you may have a 55% chance of losing a Z on a rebuild reading 80TB. chance of success: % bc -l scale=20 e(l(1-10^-14)*(8*10*10^12)) .44932896411722159143 https://lwn.net/Articles/608896/ better if they're 10^-15 drives - only an 8% chance of a fail. still not great. or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes) of the data and the rest is ok?
That link refers to a bug in Linux Software RAID which has been fixed and also didn't apply to ZFS.
sorry - I meant to say you need to read down a bit and look for comments by Neil Brown. some good explanations of drive statistics. the above calculations come from there. the stats apply to Z1 and Z2 as much as R5 and R6.
Most of the articles about the possibility of multiple drive failues falsely assume that drives will entirely fail. In the last ~10 years the only drive
if md hits a problem during a rebuild that it can't recover from then it will stop and give up (at least last time I tried it), so it is essentially a whole drive fail. it's a block device, it can't do much more... (*) our hardware raid systems also fail about a disk or two a week. so these are 'whole disk fails' regardless of whatever metrics they are using. so whole disk fails definitely do happen. we had IIRC 8% of drives dying and being replaced per year on one system. with ZFS I'm assuming (I haven't experienced it myself) it's more common to have a per-object loss and if eg. drives in a Z1 have dozens of bad sectors, then as long as 2 of those sectors don't line up then you won't lose data? (I'm a newbie to ZFS so happy to be pointed at docs. I can't see any obviously on this topic.) but of course in that situation you will definitely lose some data if one drive totally fails. then there'll be sectors with errors and no redundancy. this is why we all do regular scrubs - so there are no obvious bad sector of disks around. the statistics above address what happens after a drive fails in an otherwise clean Z1 array. they say that 1 new error occurs with such-and-such a probability during a rebuild of a whole new drive full of data. that's statistically at a 55% likelihood level for your array configuration. so if you're convinced you'll never have a drive fail that's fine & good luck to you, but I'm just saying that the stats aren't on your side if you ever do :)
With modern disks you are guaranteed to lose data if you use regular RAID-6.
every scheme is guaranteed to lose data. it's just a matter of time and probabilities.
I've had lots of disks return bad data and say it's good. With RAID-Z1 losing data is much less likely.
in my experience (~17000 Enterprise SATA drive-years or so, probably half that of true SAS and plain SATA drive-years), disks silently giving back bad data is extremely rare. I can't recall seeing it, ever. once an interposer did return bad data, but that wasn't the disk's fault. we did have data corruption once and noticed it, and tracked it down, and it wasn't the fault of disks. so it's not as if we just didn't notice 'wrong data' at the user level even though (at the time) everything wasn't checksummed like in ZFS. I'm not saying silent data corruption doesn't happen (I heard of one incident in a company recently which hastened their transition to ZFS on linux), but perhaps there is something else wrong if you are seeing it "lots". what sort of disks are you using? cheers, robin (*) in actuality it's often possible (and I have) got around this and recovered all but a tiny amount of data when it shouldn't be possible because there is no redundancy left and a drive has errors. it's a very manual proces and far from easy - find the sector from smart or scsi errors or the byte where a dd read fails, dd write to it to remap the sector in the drive, cross fingers, attempt the raid rebuild again, repeat.