
On Thu, Apr 11, 2013 at 02:10:37AM +0000, James Harper wrote:
with disks (and raid arrays) of that size, you also have to be concerned about data errors as well as disk failures - you're pretty much guaranteed to get some, either unrecoverable errors or, worse, silent corruption of the data.
Guaranteed over what time period?
any time period. it's a function of the quantity of data, not of time.
It's easy to fault your logic as I just did a full scan of my array and it came up clean.
no, it's not. your array scan checks for DISK errors. It does not check for data corruption - THAT is the huge advantage of filesystems like ZFS and btrfs, they can detect and correct data errors
If you say you are "guaranteed to get some" over, say, a 10 year period, then I guess that's fair enough. But as you don't specify a timeframe I can't really contest the point.
you seem to be confusing data corruption with MTBF or similar, it's not like that at all. it's not about disk hardware faults, it's about the sheer size of storage arrays these days making it a mathematical certainty that some corruption will occur - write errors due to, e.g., random bit-flaps, controller brain-farts, firmware bugs, cosmic rays, and so on. e.g. a typical quoted rating of 1 error per 10^14 bits is one error per 12 terabytes - i.e. your four x 3TB array is guaranteed to have at least one error in the data. one error in 10^14 bits is nothing to worry about with 500GB drives. it's starting to get worrisome with 1 and 2TB drives. It's a guaranteed error with 10+TB arrays....and even a single 3 or 4TB drive has roughly a 30-50% chance of having at least one data error.
I can say though that I do monitor the SMART values which do track corrected and uncorrected error rates, and by extrapolating those figures I can say with confidence that there is not a guarantee of unrecoverable errors.
smart values really only tell you about detected errors in the drive itself. they don't tell you *anything* about data corruption problems - for that, you actually need to check the data...and to check the data you need a redundant copy or copies AND a hash of what it's supposed to be. with mdadm, such errors can only be corrected if the data can be rewritten to the same sector or if the drive can remap a spare sector to that spot. with zfs, because it's a COW filesystem all that needs to be done is to rewrite the data.
The part that says "not visible to the host software" kind of bothers me.
yes, that's why it's a problem, and that's why a filesystem that keeps both multiple copies (mirroring or raid5/6-like) AND a hash of each block is essential for detecting and correcting errors in the data.
AFAICS these are reported via SMART and are entirely visible, with some exceptions of poor SMART implementations.
no. SMART detects disk faults, not data corruption.
personally, i wouldn't use raid-5 (or raid-6) any more. I'd use ZFS RAID-Z (raid5 equiv) or RAID-Z2 (raid6 equiv. with 2 parity disks) instead.
Putting the error correction/detection in the filesystem bothers me. Putting it at the block device level would benefit a lot more infrastructure - LVM volumes for VM's, swap partitions, etc.
having used ZFS for quite some time now, it makes perfect sense to me for it to be in the filesystem layer rather in the block level - it's the file system that knows about the data, what/where it is, and whether it's in use or not (so, faster scrubs - only need to check blocks in use rather than all blocks). but that's partly because ZFS blends the block layer and the fs layer in a way that seems unusual if you're used to ext4 or xfs or pretty much anything else except btrfs. see below for more on this topic.
I understand you can run those things on top of a filesystem also, but if you are doing this just to get the benefit of error correction then I think you might be doing it wrong.
Error correction is a big benefit, but it's not the only one. the 2nd major benefit is snapshots (fast and lightweight because ZFS is a copy-on-write or COW fs, so a snapshot is little more than just keeping a copy of the block list at the time of the snapshot, and not deleting/re-using those blocks while any snapshot references them).
Actually when I was checking over this email before hitting send it occurred to me that maybe I'm wrong about this, knowing next to nothing about ZFS as I do. Is a zpool virtual device like an LVM lv, and I can use it for things other than running ZFS filesystems on?
yes, ZFS is like a combination of mdadm, lvm, and a filesystem. a zpool is like an lvm volume group. e.g. you might allocate 4 drives to a raid-z array and call that pool "export". unlike LVM you don't have to pre-allocate fixed chunks of the volume to particular uses (e.g. filesystems or logical drives/partitions), you can dynamically change the "allocation" as needed. it's also like a filesystem in that you can mount that pool directly as, say, /export (or wherever you want) and read and write data to it. you can also create subvolumes (e.g. export/www) and mount them too. each subvolume inherits attributes (quota, compression, de-duping, and lots more) from the parent or can have individual attributes different from the parent. each subvolume can also have subvolumes (e.g. export/www/site1, export/www/site2). each of these subvolumes is like a separate filesystem that shares in the total pool size, and each can be snapshotted individually. you can create new subvolumes aka filesystems on the fly as needed, or change them (e.g. change the quota from 10G to 20G or enable compression etc) or delete them. you can also create a ZVOL, which is just like a zfs subvolume except that it appears to the system as a virtual disk - i.e. with a device node under /dev. typical use is for xen or kvm VM images. or even swap devices. as with subvolumes, they can have individual attributes like compression or de-duping, and they can also be resized if needed (resize the zvol on the zfs host, and then inside the VM you need to run xfs resize or ext4 resize so that it recognises the extra capacity). ZVOLs can also be snapshotted just like subvolumes. they can also be exported as iscsi targets, so you can, e.g., easily serve disk images to your VM compute nodes. in short: a subvolume is like a subdirectory or mount-point, while a ZVOL is like a disk image or partition (incl. an LVM partition) BTW, some of what i've written above isn't strictly accurate....i've tried to translate ZFS concepts into terms that should be familiar to someone who has worked with mdadm and LVM. as an analogy, i've done reasonably well i think. a technological pedant would probably find much to complain about. i'm more interested in having what i write be understood than in having it perfectly correct.
Despite my reservations mentioned above, ZFS is still on my (long) list of things to look into and learn about, more so given that you say it is now considered stable :)
it's definitely worth experimenting with on some spare hardware - but be warned, you will almost certainly want to convert appropriate production systems from mdadm+lvm to ZFS asap once you start playing with it. i got hooked on the idea of what ZFS is doing by experimenting with btrfs. btrfs has a lot of similar ideas, but the implementation (aside from having different goals) is many years behind ZFS. I persevered with btrfs for a while because it was in the mainline kernel and didn't require any stuffing around installing third-party code (zfs) that would never get in the mainline kernel. i lost my btrfs array (fortunately only a /backup mount, so not irreplacable) one too many times and switcbed to ZFS. it is everything i ever wanted in a filesystem and volume management - it replaces mdadm, lvm2, and the XFS and/or ext4 i was previously using. With the dkms module packages, it isn't even hard to install or use these days (add the debian wheezy zfs repo and apt-get install it) craig -- craig sanders <cas@taz.net.au>