
On Thursday, 30 January 2020 5:14:22 PM AEDT Craig Sanders via luv-main wrote:
On Tue, Jan 28, 2020 at 08:02:15PM +1100, russell@coker.com.au wrote:
Having a storage device fail entirely seems like a rare occurance. The only time it happened to me in the last 5 years is a SSD that stopped accepting writes (reads still mostly worked OK).
it's not rare at all, but a drive doesn't have to be completely non-responsive to be considered "dead". It just has to consistently cause enough errors that it results in the pool being degraded.
In recent times I've only had one disk that had such a large amount of errors, a 4TB (from memory) disk with about 12,000 errors. ~12,000 errors out of ~1,000,000,000 blocks (4K block size) means about 0.0012% errors. ZFS with copies=2 on that seems quite likely to give a good amount of your data back.
To me, that's a dead drive because it's not safe to use. it can not be trusted to reliably store data. it is junk. the only good use for it is to scrap it for the magnets.
I've had about a dozen disks in the last ~5 years that would give about 20 ZFS checksum errors a month. I got them replaced with that level of errors, who knows that they might have done if they had remained in service. Presumably if the system in question had run Ext4 we would have discovered the answer to that question.
I've had a couple of SSDs have checksum errors recently and a lot of hard drives have checksum errors. Checksum errors (where the drive returns what it considers good data but BTRFS or ZFS regard as bad data) are by far the most common failures I see of the 40+ storage devices I'm running in recent times.
a drive that consistently returns bad data is not fit for purpose. it is junk. it is a dead drive.
That's my opinion too. But sometimes the people who pay have different opinions and are happy to tolerate a small number of checksum errors.
BTRFS "dup" and ZFS "copies=2" would cover almost all storage hardware issues that I've seen in the last 5+ years.
IMO, two copies of data on a drive you can't trust isn't significantly better or more useful than one copy. It's roughly equivalent to making a photocopy of your important documents and then putting both copies in the same soggy cardboard box in a damp cellar.
If a disk gets 20 checksum errors per month out of 6TB or more of storage then the probability of 2 of those checksum errors hitting the same block is very low, even on BTRFS which I believe has a fairly random allocation for dup. I believe that ZFS is designed to allocate data to reduce the possibility of somewhat random errors taking out multiple copies of data but haven't investigated the details. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/