On Sunday, 4 June 2017 12:52:33 PM AEST Robin Humble via luv-main wrote:
The drives
didn't have errors as such. There were some "correctable"
errors logged, but not as many as ZFS found. ZFS reported no errors
reading the disk, just checksum errors. The disks are returning bad data
and saying it's good.
could it be bad RAM or SATA cables?
do you have ECC ram?
The server has ECC RAM. It has 9*8TB disks for which ZFS has never reported
an error. It has 9*6TB disks of which 1 is Seagate and has never had an error
reported. Of the 8*6TB WD disks 7 have ZFS errors reported against them.
The server is a Del PowerEdge Tower T630 which has hot-swap drive bays.
Having damaged SATA cables would be very unlikely. Having SATA cable issues
that match the model of disk is extremely unlikely. The disks are in 3 rows
of 6 disks, so we have disks in 2 rows failing and now correlation with
position other than the fact that the 6TB disks are in the first row and half
the second row.
kinda sounds like something transient, so power supply
or RAM would
be my guess if there's no errors and no red flags (Uncorrecable,
Pending, >few Remapped etc.) in SMART.
however drives lie about SMART quite a lot. errors may just not show
up or may come and go randomly. drives are insane.
Yes, drives are insane. I'm pretty sure the PSU is up to the task, it's
designed for high performance SAS disks that draw more power.
sorry - I meant to say you need to read down a bit and
look for comments
by Neil Brown. some good explanations of drive statistics. the above
calculations come from there. the stats apply to Z1 and Z2 as much as
R5 and R6.
I've read all that. But the thing that is missing here is that RAID-Z is
fundamentally not the same as RAID-5 and ZFS is not like a regular filesystem.
https://www.illumos.org/issues/3835
ZFS has redundant copies of metadata. This means that if you have enough
corruption in a RAID-Z to lose some data it probably won't be enough to lose
metadata. Basically a default RAID-Z configuration protects metadata as well
as RAID-6 with checksums would (of which BTRFS RAID-6 is the only strict
implementation I'm aware of). As an aside I think that NetApp WAFL does
something similar with metadata, but as I can't afford it I haven't looked
into it much.
The errors that I have just encountered (over 100 corrupted sectors so far)
would not have been detected on any other RAID configuration. The design of
Linux software RAID is that if redundancy on RAID-5 or RAID-6 doesn't match
the data then it's rewritten. It's theoretically possible to find which block
on a RAID-6 stripe was corrupted, but the assumption is that a changed block
was written and the redundancy was lost. There are 2 ways of fixing this
problem of RAID-5/RAID-6 partial stripe writes. One option is to have an
external journal, EG the battery-backed write-back caches that vendors such as
HP and Dell offer as an expensive option. The other option is to do what
BTRFS, ZFS, and WAFL do and write the new data in a different location. That
gets you the most fragmented filesystem possible but gives greatest
reliability and also snapshots are easy to add.
Most of the
articles about the possibility of multiple drive failues
falsely assume that drives will entirely fail. In the last ~10 years the
only drive
if md hits a problem during a rebuild that it can't recover from
then
it will stop and give up (at least last time I tried it), so it is
essentially a whole drive fail. it's a block device, it can't do
much more... (*)
I've seen that too. Apparently that's supposed to be fixable with a bad
blocks list that you can add WHEN YOU CREATE THE ARRAY. So all of us who have
existing Linux Software RAID arrays can't fix it - fortunately that's mostly
limited to /boot and sometimes / on my systems.
our hardware raid systems also fail about a disk or
two a week. so these
are 'whole disk fails' regardless of whatever metrics they are using.
If these are cases where you have some dozens of errors you don't want the
disk just kicked out, you want it to stay in service until you get the
replacement online.
As an aside a disk error of any kind during a drive replacement can be more of
a problem than it should be. If you are replacing a disk for a non-error
reason (EG you replace disks 1 at a time with bigger disks until you can grow
the array) there's no reason why a failure of another disk at the same time
should be a big deal. But I've seen it bring down a NAS based on Linux
Software RAID. I don't think that would bring down ZFS.
with ZFS I'm assuming (I haven't experienced
it myself) it's more common
to have a per-object loss and if eg. drives in a Z1 have dozens of bad
sectors, then as long as 2 of those sectors don't line up then you
won't lose data? (I'm a newbie to ZFS so happy to be pointed at docs.
I can't see any obviously on this topic.)
The design of ZFS is based on the assumption that errors aren't entirely
independent and I believe it aims to put multiple copies of metadata on
different parts of disks. This plus the extra copy of metadata means when you
do hit data loss you are most likely to see only file data lost. Then at
least you know which files to restore from backup or regenerate.
the statistics above address what happens after a
drive fails in an
otherwise clean Z1 array. they say that 1 new error occurs with
such-and-such a probability during a rebuild of a whole new drive full
of data. that's statistically at a 55% likelihood level for your array
configuration.
It's a fairly low probability in my experience given that I've had systems go
for years running a monthly ZFS scrub without such errors. I've had BTRFS
systems go for years with weekly scrubs without such errors.
so if you're convinced you'll never have a
drive fail that's fine & good
luck to you, but I'm just saying that the stats aren't on your side if
you ever do :)
I'm looking into Ceph now. I think an ideal solution to large scale storage
might be something like Ceph with RAID-1 SSD for metadata storage and either
RAID-Z or BTRFS RAID-0 for data storage depending on how much you rely on Ceph
for data redundancy.
With modern
disks you are guaranteed to lose data if you use regular
RAID-6.
every scheme is guaranteed to lose data. it's just a matter of time
and
probabilities.
Guaranteed to lose data in a fairly small amount of time if you don't have
checksums on everything.
I've had
lots of disks return bad data and say it's good. With RAID-Z1
losing data is much less likely.
in my experience (~17000 Enterprise SATA drive-years or so, probably
half that of true SAS and plain SATA drive-years), disks silently
giving back bad data is extremely rare. I can't recall seeing it, ever.
once an interposer did return bad data, but that wasn't the disk's fault.
I've seen it lots of times in all manner of hardware, from old IDE disks to
modern SATA. I don't use SAS much. The problem with SAS is that disks are
more expensive and that leads the people paying towards cutting corners
elsewhere. SATA with adequate redundancy and backups is much better than SAS
with all corners cut.
we did have data corruption once and noticed it, and
tracked it down,
and it wasn't the fault of disks. so it's not as if we just didn't
notice 'wrong data' at the user level even though (at the time)
everything wasn't checksummed like in ZFS.
I've seen that too. One PC without ECC RAM had BTRFS errors, the BTRFS
developers said that it wasn't the pattern of errors they would expect from
storage failure or kernel errors and that I should check the RAM. The RAM
turned out to be faulty. With Ext4 the system would probably have run for a
lot longer in service and the corrupted data would have ended up on backup
devices...
I'm not saying silent data corruption doesn't
happen (I heard of one
incident in a company recently which hastened their transition to ZFS
on linux), but perhaps there is something else wrong if you are seeing
it "lots".
what sort of disks are you using?
I've explained the disks for this incident above. But in the past I've seen
it on a variety of others. I haven't kept a journal, I probably should.
(*) in actuality it's often possible (and I have)
got around this and
recovered all but a tiny amount of data when it shouldn't be possible
because there is no redundancy left and a drive has errors. it's a very
manual proces and far from easy - find the sector from smart or scsi
errors or the byte where a dd read fails, dd write to it to remap the
sector in the drive, cross fingers, attempt the raid rebuild again,
repeat.
I've done that before too. I even once wrote a personal version of ddrescue
that read from 2 disks and took whichever one would give me data.
--
My Main Blog
http://etbe.coker.com.au/
My Documents Blog
http://doc.coker.com.au/