Re: Rebuild after disk fail

30 Jan 2020


      On Thursday, 30 January 2020 5:14:22 PM AEDT Craig Sanders via luv-main wrote:
...
On Tue, Jan 28, 2020 at 08:02:15PM +1100, russell@coker.com.au wrote:
...
Having a storage device fail entirely seems like a rare occurance.  The
only time it happened to me in the last 5 years is a SSD that stopped
accepting writes (reads still mostly worked OK).
it's not rare at all, but a drive doesn't have to be completely
non-responsive to be considered "dead".  It just has to consistently cause
enough errors that it results in the pool being degraded.
In recent times I've only had one disk that had such a large amount of errors, 
a 4TB (from memory) disk with about 12,000 errors. ~12,000 errors out of 
~1,000,000,000 blocks (4K block size) means about 0.0012% errors.  ZFS with 
copies=2 on that seems quite likely to give a good amount of your data back.
...
To me, that's a dead drive because it's not safe to use. it can not be
trusted to reliably store data. it is junk. the only good use for it is to
scrap it for the magnets.
I've had about a dozen disks in the last ~5 years that would give about 20 ZFS 
checksum errors a month.  I got them replaced with that level of errors, who 
knows that they might have done if they had remained in service.  Presumably 
if the system in question had run Ext4 we would have discovered the answer to 
that question.
...
...
I've had a couple of SSDs have checksum errors recently and a lot of hard
drives have checksum errors.  Checksum errors (where the drive returns
what
it considers good data but BTRFS or ZFS regard as bad data) are by far the
most common failures I see of the 40+ storage devices I'm running in
recent
times.
a drive that consistently returns bad data is not fit for purpose. it is
junk. it is a dead drive.
That's my opinion too.  But sometimes the people who pay have different 
opinions and are happy to tolerate a small number of checksum errors.
...
...
BTRFS "dup" and ZFS "copies=2" would cover almost all storage hardware
issues that I've seen in the last 5+ years.
IMO, two copies of data on a drive you can't trust isn't significantly
better or more useful than one copy. It's roughly equivalent to making a
photocopy of your important documents and then putting both copies in the
same soggy cardboard box in a damp cellar.
If a disk gets 20 checksum errors per month out of 6TB or more of storage then 
the probability of 2 of those checksum errors hitting the same block is very 
low, even on BTRFS which I believe has a fairly random allocation for dup.  I 
believe that ZFS is designed to allocate data to reduce the possibility of 
somewhat random errors taking out multiple copies of data but haven't 
investigated the details.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

Re: Rebuild after disk fail

Russell Coker