Re: zfsonlinux error reporting

4 Jun 2017

      On Sun, Jun 04, 2017 at 06:39:34AM +1000, Russell Coker wrote:
...
On Saturday, 3 June 2017 12:09:14 PM AEST Robin Humble via luv-main wrote:
...
individual SMART errors in drives should also list power on hours next
to the errors. perhaps date and time too.
The drives didn't have errors as such.  There were some "correctable" errors 
logged, but not as many as ZFS found.  ZFS reported no errors reading the 
disk, just checksum errors.  The disks are returning bad data and saying it's 
good.
could it be bad RAM or SATA cables?
do you have ECC ram?

kinda sounds like something transient, so power supply or RAM would
be my guess if there's no errors and no red flags (Uncorrecable,
Pending, >few Remapped etc.) in SMART.
however drives lie about SMART quite a lot. errors may just not show
up or may come and go randomly. drives are insane.
...
...
BTW RAIDZ isn't really sufficient for such large arrays of large
drives. you should probably be using Z2 or Z3.
eg. depending on your drive type, you may have a 55% chance of losing
a Z on a rebuild reading 80TB. chance of success:
  % bc -l
  scale=20
  e(l(1-10^-14)*(8*10*10^12))
  .44932896411722159143
https://lwn.net/Articles/608896/
better if they're 10^-15 drives - only an 8% chance of a fail. still
not great.
or is ZFS smart and just fails to rebuild 1 stripe (8*2^zshift bytes)
of the data and the rest is ok?
That link refers to a bug in Linux Software RAID which has been fixed and also 
didn't apply to ZFS.
sorry - I meant to say you need to read down a bit and look for comments
by Neil Brown. some good explanations of drive statistics. the above
calculations come from there. the stats apply to Z1 and Z2 as much as
R5 and R6.
...
Most of the articles about the possibility of multiple drive failues falsely 
assume that drives will entirely fail.  In the last ~10 years the only drive
if md hits a problem during a rebuild that it can't recover from then
it will stop and give up (at least last time I tried it), so it is
essentially a whole drive fail. it's a block device, it can't do
much more... (*)

our hardware raid systems also fail about a disk or two a week. so these
are 'whole disk fails' regardless of whatever metrics they are using.

so whole disk fails definitely do happen.
we had IIRC 8% of drives dying and being replaced per year on one system.

with ZFS I'm assuming (I haven't experienced it myself) it's more common
to have a per-object loss and if eg. drives in a Z1 have dozens of bad
sectors, then as long as 2 of those sectors don't line up then you
won't lose data? (I'm a newbie to ZFS so happy to be pointed at docs.
I can't see any obviously on this topic.)

but of course in that situation you will definitely lose some data if
one drive totally fails. then there'll be sectors with errors and no
redundancy. this is why we all do regular scrubs - so there are no
obvious bad sector of disks around.

the statistics above address what happens after a drive fails in an
otherwise clean Z1 array. they say that 1 new error occurs with
such-and-such a probability during a rebuild of a whole new drive full
of data. that's statistically at a 55% likelihood level for your array
configuration.

so if you're convinced you'll never have a drive fail that's fine & good
luck to you, but I'm just saying that the stats aren't on your side if
you ever do :)
...
With modern disks you are guaranteed to lose data if you use regular RAID-6.
every scheme is guaranteed to lose data. it's just a matter of time and
probabilities.
...
I've had lots of disks return bad data and say it's good.  With RAID-Z1 losing 
data is much less likely.
in my experience (~17000 Enterprise SATA drive-years or so, probably
half that of true SAS and plain SATA drive-years), disks silently
giving back bad data is extremely rare. I can't recall seeing it, ever.
once an interposer did return bad data, but that wasn't the disk's fault.

we did have data corruption once and noticed it, and tracked it down,
and it wasn't the fault of disks. so it's not as if we just didn't
notice 'wrong data' at the user level even though (at the time)
everything wasn't checksummed like in ZFS.

I'm not saying silent data corruption doesn't happen (I heard of one
incident in a company recently which hastened their transition to ZFS
on linux), but perhaps there is something else wrong if you are seeing
it "lots".

what sort of disks are you using?

cheers,
robin

(*) in actuality it's often possible (and I have) got around this and
recovered all but a tiny amount of data when it shouldn't be possible
because there is no redundancy left and a drive has errors. it's a very
manual proces and far from easy - find the sector from smart or scsi
errors or the byte where a dd read fails, dd write to it to remap the
sector in the drive, cross fingers, attempt the raid rebuild again,
repeat.