
Quoting "Greg Bromage" <greg@bromage.org>
Craig Sanders wrote:
one error in 10^14 bits is nothing to worry about with 500GB drives. it's starting to get worrisome with 1 and 2TB drives. It's a guaranteed error with 10+TB arrays....and even a single 3 or 4TB drive has roughly a 30-50% chance of having at least one data error.
*nod* that's the root of my concern. And the "at least one" is the issue, because once a disk fails, if there's a second error somewhere then it WILL be encountered, because the rebuild has to traverse every sector of every remaining disk.
Out of curiosity, I googled for the 10^14 bits. http://www.zdnet.com/blog/storage/how-data-gets-lost/167 Cause of data loss Perception Reality Hardware or system problem 78% 56% Human error 11% 26% Software corruption or problem 7% 9% Computer viruses 2% 4% Disaster 1-2% 1-2% I find the 26% human error interesting. IMHO a lot of IT people underestimated quite often the impact of their work and the errors they make. Sometimes a simple solution may look worse on paper but it may save the day. Because it is easy to implement and harder to break;-) For the 10^14 Bits.. that was five years ago, and I am not sure whether still valid (a Terbyte was a lot in 2007). I have overnight synchronisation of ca. 3 TeraByte per night (deltas and some full-copies after maintenance), all from mirrored disks, and haven't seen any errors over the last 18 months. ZFS scrubs did not find errors as well. Regards Peter

On Thu, Apr 11, 2013 at 07:16:12PM +1000, Peter wrote:
Quoting "Greg Bromage" <greg@bromage.org>
Craig Sanders wrote:
one error in 10^14 bits is nothing to worry about with 500GB drives. it's starting to get worrisome with 1 and 2TB drives. It's a guaranteed error with 10+TB arrays....and even a single 3 or 4TB drive has roughly a 30-50% chance of having at least one data error.
*nod* that's the root of my concern. And the "at least one" is the issue, because once a disk fails, if there's a second error somewhere then it WILL be encountered, because the rebuild has to traverse every sector of every remaining disk.
yeah, well, the data corruption may or may not be a sign of disk failure. there are numerous ways a disk write can be corrupted (controller, ram, software bugs, cosmic rays, loose cable / vibration, power spike, and many more) that don't involve disk-failure. that's the point i was making - that silent data corruption of this kind becomes at least as significant as actual disk failures as the size of the drive/array increases and as the volume of data being written increases. with large enough drives/arrays the probability of these kinds of errors increases until it becomes a certainty. we're at that point now with 3 & 4TB drives, and with large raid/raid-like arrays. which is why we need error-detecting and error-correcting filesystems like btrfs and zfs.
For the 10^14 Bits.. that was five years ago, and I am not sure whether still valid (a Terbyte was a lot in 2007).
i'm not sure where i got that 10^14 figure from. I read it after doing a google search for a quick fact-check before posting this morning. re-doing the search, i can't find the page i remember but found another that suggests i may have gotten it confused with unrecoverable read error rates (UER) from 2009-era drives. The ZFS wikipedia article i mentioned says that modern enterprise SAS drives have UERs of 1 in 10^16 (it doesn't say but i assume consumer SATA drives are still around 1 in 10^14) the article also mentions that a real-life study of 1.5 million drives in NetApp's database indicates that about 1 in every 90 SATA drives will suffer silent data corruption which will not be caught by RAID verification. and yes, one of the big reasons for moving to 4K sector drives is for better error detection and recovery. the ECC overhead would be too great for 512-byte sectors.
I have overnight synchronisation of ca. 3 TeraByte per night (deltas and some full-copies after maintenance), all from mirrored disks, and haven't seen any errors over the last 18 months. ZFS scrubs did not find errors as well.
another data point: i've got three zpools here at home, two are 4x1TB RAID-Z, the third is 4x2TB RAID-Z. all three pools are under moderately heavy home use (using a geek definition of "home use" :). i run 'zpool scrub' weekly on all of them. It's not uncommon to have 'zpool status' report that the scrub repaired some data. not exactly common, either...but it happens often enough that i'm not surprised or overly concerned when i see it. and often enough to make me glad my data and backups are on ZFS. the disks are just consumer-grade SATA drives, Seagates or WDs depending on the pool. they aren't faulty (no SMART or other errors) and two of the zpools are on decent SAS controllers (LSI SAS2008 6Gbps running "IT" mode firmware - i had a LOT more problems when i was running the "IR" firmware). the third pool is currently on a 4-port Adaptec 1430SA SATA controller but will be moved to another one of the SAS controllers when I finally get around to replacing the myth box motherboard with one that has a PCI-e 8x slot. the 1430SA isn't a bad controller, but it's nowhere near as good as my SAS controllers (and it has only 4 SATA ports rather than 8 SAS/SATA ports). IIRC it cost me slightly more than the SAS controllers did (about $130 i think, while the SAS controllers cost about $100 each). craig -- craig sanders <cas@taz.net.au>
participants (2)
-
Craig Sanders
-
Peter