Re: gpt and grub

newer
Re: ZFS vs RAID (was gpt and grub)

older
distributed storage

Greg Bromage

10 Apr 2013 10 Apr '13

3 a.m.

On 2013-04-09 02:40, James Harper wrote:

...

I have a server that had 4 x 1.5TB disks installed in a RAID5 configuration (except /boot is a 'RAID1' across all 4 disks). One of the disks failed recently and so was replaced with a 3TB disk,

I'd be very wary of running RAID5 on disks >2TB Remember that, when you have a disk failure, in order to rebuild the array, it needs to scan every sector of every remaining disk, then write to every sector of the replacement disk. Compare those number of read & write operations to the Mean Time Between Failures of each disk, and you're (statistically) starting to get close to the point there's a significant risk of a second drive failing prior to the rebuild finishing. (For a given definition of "significant" risk)

Attachments:

attachment.html (text/html — 923 bytes)

Show replies by date

James Harper

10 Apr 10 Apr

4:48 a.m.

New subject: gpt and grub

...

On 2013-04-09 02:40, James Harper wrote:

...
I have a server that had 4 x 1.5TB disks installed in a RAID5 configuration (except /boot is a 'RAID1' across all 4 disks). One of the disks failed recently and so was replaced with a 3TB disk,

I'd be very wary of running RAID5 on disks >2TB

Remember that, when you have a disk failure, in order to rebuild the array, it needs to scan every sector of every remaining disk, then write to every sector of the replacement disk.

Debian does a complete scan every month anyway. A HP raid controller will basically be constantly (slowly) doing a background scan during periods of low use. And a full resync on my 4x3TB array only takes 6 hours, so the window is pretty small. And in this case it's the server holding the backups, so while losing it would be inconvenient, there are other copies of the data too. Also, smart monitoring helps monitor pending failures before a hard read or write error occurs. The original replacement was done because of a smart notification - the kernel logged a single scsi timeout error sometime after that, but the RAID remained consistent and the monthly surface scan ran after that but before replacement of the disk and no hard errors were reported, even though smart reported "failure within 24 hours". With a small number of exceptions this is consistent with my experience in failed disks.

...

Compare those number of read & write operations to the Mean Time Between Failures of each disk, and you're (statistically) starting to get close to the point there's a significant risk of a second drive failing prior to the rebuild finishing. (For a given definition of "significant" risk)

For mission critical data I'd be going with RAID10 (or maybe RAID6 if I had battery backed write cache, but the performance is still pretty bad for any workload I would consider mission critical). The MTBF for the disk is given as 1000000 hours, while most other disks I've seen are around the 750000/800000 hour mark. This server runs in a cupboard in a factory though and runs significantly hotter than room temperature and every time I work on it I end up covered in dirt, but this is the first disk failure after around 3 years of hard use. The disks in question claim a "35% improvement over standard desktop drives" wrt MTBF, so the marketing hype says it's okay ;) James

Erik Christiansen

12:16 p.m.

New subject: gpt and grub

On 10.04.13 04:48, James Harper wrote:

...

The MTBF for the disk is given as 1000000 hours, while most other disks I've seen are around the 750000/800000 hour mark. This server runs in a cupboard in a factory though and runs significantly hotter than room temperature and every time I work on it I end up covered in dirt, but this is the first disk failure after around 3 years of hard use.

There are not many hours in a year, so 3 years is little compared to 10^6 hours. (That 114 year MTBF is quite impressive.) It does though halve for each 10°C rise above the temperature at which the lifetime is predicted. (The vendor hasn't run his drives for 114 years for the prediction. Running at 80°C above nominal for 6 months is equivalent to 128 years of life.) Erik -- If there are two possible paths from A to B and one is twice as long as [the other], at the beginning, the ants [or] robots start using each path equally. "Because ants taking the shorter path travel faster, the amount of pheromone (or light) deposited on that path grows faster, so more ants use that path." - http://www.bbc.co.uk/news/21956795

Rohan McLeod

10:06 p.m.

New subject: gpt and grub

Erik Christiansen wrote:

...

On 10.04.13 04:48, James Harper wrote:

...
The MTBF for the disk is given as 1000000 hours, while most other ...........snip There are not many hours in a year, so 3 years is little compared to 10^6 hours. (That 114 year MTBF is quite impressive.) It does though halve for each 10°C rise above the temperature at which the lifetime is predicted. Would the converse hold true ie. doubling for each 10°C fall bellow the temperature at which the lifetime is predicted? and

"Running at 80°C above nominal for 6 months is equivalent to 128 years of life.)" would imply the test was conducted at 0°C; is this standard or just for example? regards Rohan McLeod

Erik Christiansen

11 Apr 11 Apr

8:13 a.m.

New subject: gpt and grub

On 11.04.13 08:06, Rohan McLeod wrote:

...

Erik Christiansen wrote:

...
On 10.04.13 04:48, James Harper wrote:

...
The MTBF for the disk is given as 1000000 hours, while most other ...........snip There are not many hours in a year, so 3 years is little compared to 10^6 hours. (That 114 year MTBF is quite impressive.) It does though halve for each 10°C rise above the temperature at which the lifetime is predicted.

Would the converse hold true ie. doubling for each 10°C fall bellow the temperature at which the lifetime is predicted?

Yes, if the equipment is rated to operate at those temperatures. The examples I gave are simply based on Arrhenius' law, which has long served for describing the failure rate of electronics, in addition to its use in chemistry. It's only an approximation, as mentioned here: http://www.osti.gov/bridge/purl.cover.jsp?purl=/841248-BSrmuy/webviewable/84... The last time I used the equation was when doing qualifying testing for the LED clock in the XD Falcon. We ran about two dozen clocks at 80°C for a couple of months, then jacked it up to 100°C for some more months. I don't expect that it holds for the mechanical parts of the disk drive. The point I was making was that the hot cupboard housing the drive would reduce the equipment's lifetime in a significantly non-linear way, out of proportion to temperature rise. How low you can reduce the temperature, in expectation of a lifetime doubling, depends on the operating range of the equipment. Wet electrolytic capacitors would quickly set a limit, and commercial grade semiconductor chips are only rated 0°-70°C, industrial -40°C to 85°C, and military -55°C to 125°C. In the case of the hard drive, the mechanicals and their lubrication would probably give up first.

...

and

"Running at 80°C above nominal for 6 months is equivalent to 128 years of life.)" would imply the test was conducted at 0°C; is this standard or just for example?

OK, with only commercial or industrial grade devices in the equipment, 80°C above nominal would be out of spec, since nominal is usually 25°C. It was just an example, showing how a century-long MTBF could be extrapolated from less than a year of testing, if device ratings permit. Erik -- I really didn't foresee the Internet. But then, neither did the computer industry. Not that that tells us very much, of course - the computer industry didn't even foresee that the century was going to end. - Douglas Adams

Craig Sanders

10 Apr 10 Apr

11:59 p.m.

New subject: gpt and grub

On Wed, Apr 10, 2013 at 04:48:17AM +0000, James Harper wrote:

...

...
On 2013-04-09 02:40, James Harper wrote:

...
I have a server that had 4 x 1.5TB disks installed in a RAID5 configuration (except /boot is a 'RAID1' across all 4 disks). One of the disks failed recently and so was replaced with a 3TB disk,

I'd be very wary of running RAID5 on disks >2TB

Remember that, when you have a disk failure, in order to rebuild the array, it needs to scan every sector of every remaining disk, then write to every sector of the replacement disk.

Debian does a complete scan every month anyway. A HP raid controller will basically be constantly (slowly) doing a background scan during periods of low use.

And a full resync on my 4x3TB array only takes 6 hours, so the window is pretty small.

with disks (and raid arrays) of that size, you also have to be concerned about data errors as well as disk failures - you're pretty much guaranteed to get some, either unrecoverable errors or, worse, silent corruption of the data. this is why error-detecting and error-correcting filesystems like ZFS and btrfs exist - they're not just a good idea, they're essential with the large disk and storage array sizes common today. see, for example: http://en.wikipedia.org/wiki/ZFS#Error_rates_in_harddisks personally, i wouldn't use raid-5 (or raid-6) any more. I'd use ZFS RAID-Z (raid5 equiv) or RAID-Z2 (raid6 equiv. with 2 parity disks) instead. actually, i wouldn't have used RAID-5 without a good hardware raid controller with non-volatile write cache - the performance sucks without that - but ZFS allows you to use an SSD as ZIL (ZFS Intent Log or sync. write cache) and as read cache. if performance was more important than capacity, I'd use RAID-1 or so-called raid-"10" or ZFS mirrored disks - a ZFS pool of mirrored pairs is similar to raid-10 but with all the extra benefits (error detection, volume management, snapshots, etc) of zfs. ZFSonLinux just released version 0.61, which is the first release they're happy to say is ready for production use. i've been using prior versions for a year or two now(*) with no problems and just switched from my locally compiled packages to their release .debs (for amd64 wheezy, although they work find with sid too). http://zfsonlinux.org/debian.html BTW, btrfs just got raid5/6 emulation support too...in a year or so (after the early-adopter guinea pigs have discovered the bugs), it could be worth considering that as an alternative. my own personal experience with btrfs raid1 & raid10 emulation was quite bad, but some people swear by it and lots of bugs have been fixed since i last used it. for large disks and large arrays, it's still a better choice than ext3/4 or xfs. (*) i was using it at work on a file-server (main purpose was to be a target for rsync backups of other machines) but i switched jobs last year. AFAIK, it is still running fine. i also use it on two machines at home. one with two pools, one in active normal daily use (called "export" as a generic-but-still-useful mountpoint name) and the other called "backup" which takes zfs send backups from "export" and rsync backups from other machines on my home LAN. The other machine is my mythtv box which has a ZFS pool for the recordings - mostly for convenience if a disk dies and needs to be replaced. all three pools are under regular heavy use, without problems. my inclination is to use ZFS anywhere I would otherwise be tempted to use mdadm and/or LVM....which is pretty much everywhere since i'm inclined to use mdadm RAID-1 even on desktop machines. craig -- craig sanders <cas@taz.net.au>

James Harper

11 Apr 11 Apr

2:10 a.m.

New subject: gpt and grub

...

On Wed, Apr 10, 2013 at 04:48:17AM +0000, James Harper wrote:

...
...
On 2013-04-09 02:40, James Harper wrote:

...
I have a server that had 4 x 1.5TB disks installed in a RAID5 configuration (except /boot is a 'RAID1' across all 4 disks). One of the disks failed recently and so was replaced with a 3TB disk,

I'd be very wary of running RAID5 on disks >2TB

Remember that, when you have a disk failure, in order to rebuild the array, it needs to scan every sector of every remaining disk, then write to every sector of the replacement disk.

Debian does a complete scan every month anyway. A HP raid controller will basically be constantly (slowly) doing a background scan during periods of low use.

And a full resync on my 4x3TB array only takes 6 hours, so the window is pretty small.

with disks (and raid arrays) of that size, you also have to be concerned about data errors as well as disk failures - you're pretty much guaranteed to get some, either unrecoverable errors or, worse, silent corruption of the data.

Guaranteed over what time period? It's easy to fault your logic as I just did a full scan of my array and it came up clean. If you say you are "guaranteed to get some" over, say, a 10 year period, then I guess that's fair enough. But as you don't specify a timeframe I can't really contest the point. I can say though that I do monitor the SMART values which do track corrected and uncorrected error rates, and by extrapolating those figures I can say with confidence that there is not a guarantee of unrecoverable errors.

...

this is why error-detecting and error-correcting filesystems like ZFS and btrfs exist - they're not just a good idea, they're essential with the large disk and storage array sizes common today.

see, for example:

http://en.wikipedia.org/wiki/ZFS#Error_rates_in_harddisks

The part that says "not visible to the host software" kind of bothers me. AFAICS these are reported via SMART and are entirely visible, with some exceptions of poor SMART implementations.

...

personally, i wouldn't use raid-5 (or raid-6) any more. I'd use ZFS RAID-Z (raid5 equiv) or RAID-Z2 (raid6 equiv. with 2 parity disks) instead.

Putting the error correction/detection in the filesystem bothers me. Putting it at the block device level would benefit a lot more infrastructure - LVM volumes for VM's, swap partitions, etc. I understand you can run those things on top of a filesystem also, but if you are doing this just to get the benefit of error correction then I think you might be doing it wrong. Actually when I was checking over this email before hitting send it occurred to me that maybe I'm wrong about this, knowing next to nothing about ZFS as I do. Is a zpool virtual device like an LVM lv, and I can use it for things other than running ZFS filesystems on?

...

actually, i wouldn't have used RAID-5 without a good hardware raid controller with non-volatile write cache - the performance sucks without that - but ZFS allows you to use an SSD as ZIL (ZFS Intent Log or sync. write cache) and as read cache.

For anything for which performance is a constraint I don't use RAID5 at all. This case is an exception in that it stores backup volumes from Bacula (eg streaming writes), and only needs to write as fast as data can come off the 1GBit/sec wire, so disk performance isn't an issue here as my array can easily handle 100mbytes/second streaming writes and backup compression means it never gets sent data that fast anyway.

...

if performance was more important than capacity, I'd use RAID-1 or so-called raid-"10" or ZFS mirrored disks - a ZFS pool of mirrored pairs is similar to raid-10 but with all the extra benefits (error detection, volume management, snapshots, etc) of zfs.

Yes I use RAID10 almost exclusively these days.

...

ZFSonLinux just released version 0.61, which is the first release they're happy to say is ready for production use. i've been using prior versions for a year or two now(*) with no problems and just switched from my locally compiled packages to their release .debs (for amd64 wheezy, although they work find with sid too).

http://zfsonlinux.org/debian.html

Despite my reservations mentioned above, ZFS is still on my (long) list of things to look into and learn about, more so given that you say it is now considered stable :)

...

BTW, btrfs just got raid5/6 emulation support too...in a year or so (after the early-adopter guinea pigs have discovered the bugs), it could be worth considering that as an alternative. my own personal experience with btrfs raid1 & raid10 emulation was quite bad, but some people swear by it and lots of bugs have been fixed since i last used it. for large disks and large arrays, it's still a better choice than ext3/4 or xfs.

As above, but I'll continue to let others find bugs :) James

Matthew Cengia

2:28 a.m.

New subject: gpt and grub

On 2013-04-11 02:10, James Harper wrote: [...]

...

...
with disks (and raid arrays) of that size, you also have to be concerned about data errors as well as disk failures - you're pretty much guaranteed to get some, either unrecoverable errors or, worse, silent corruption of the data.

Guaranteed over what time period? It's easy to fault your logic as I just did a full scan of my array and it came up clean. If you say you are "guaranteed to get some" over, say, a 10 year period, then I guess that's fair enough. But as you don't specify a timeframe I can't really contest the point. [...]

With a pair of 2TB Western Digital SATA drives in my server, both in RAID 1: | mattcen@adam:tmp$ zgrep -h 'mismatches found' /var/log/syslog* | sort -n | 2012-02-05T22:05:03.118792+11:00 adam mdadm[1545]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496 | 2012-03-04T17:00:12.084923+11:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 11008 | 2012-04-01T18:20:08.394369+10:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496 | 2012-05-06T16:42:59.386193+10:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10240 | 2012-06-03T19:26:13.770869+10:00 adam mdadm[1559]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10112 | 2012-07-01T17:32:36.678284+10:00 adam mdadm[2500]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 8960 | 2012-08-05T17:40:38.175882+10:00 adam mdadm[1859]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 8960 | 2012-09-04T05:52:52.107219+10:00 adam mdadm[1859]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 8832 | 2012-11-04T17:00:11.945288+11:00 adam mdadm[2475]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 1152 | 2012-12-02T20:34:26.204077+11:00 adam mdadm[2475]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 640 | 2013-01-06T19:51:39.551018+11:00 adam mdadm[2475]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 640 With some of the systems we have at work, also in RAID 1: | mattcen@logos:tmp$ sudo zgrep -h 'mismatches found' /var/log/syslog* | sort -n | 2012-07-01T03:05:23.155717+10:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 1664 | 2012-07-02T15:39:28.372216+10:00 omega mdadm[4719]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 54528 | 2012-08-05T03:25:43.174057+10:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 1664 | 2012-08-06T19:11:30.325608+10:00 omega mdadm[4719]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 7552 | 2012-09-03T09:32:21.701674+10:00 omega mdadm[3691]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 7040 | 2012-09-15T22:29:27.704673+10:00 omega mdadm[3691]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 7040 | 2012-10-07T05:02:00.117419+11:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 512 | 2012-10-08T17:50:00.910978+11:00 omega mdadm[3691]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 6784 | 2012-11-04T03:40:25.763854+11:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 384 | 2012-11-05T20:41:37.945275+11:00 omega mdadm[3691]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 6528 | 2012-12-02T04:20:47.193559+11:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 2304 | 2012-12-03T23:18:57.661636+11:00 omega mdadm[3691]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 6400 | 2013-01-06T04:14:42.396266+11:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 2176 | 2013-02-03T05:00:35.431187+11:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 3328 | 2013-02-04T14:47:29.782264+11:00 omega mdadm[3300]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 3840 | 2013-03-03T04:34:35.005451+11:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 33152 | 2013-03-04T19:34:09.974437+11:00 omega mdadm[3300]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 5248 | 2013-04-07T03:55:32.146377+10:00 theta mdadm[1307]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 36992 | 2013-04-08T06:22:45.468885+10:00 omega mdadm[3300]: RebuildFinished event detected on md device /dev/md1, component device mismatches found: 72064 This is all on a monthly check schedule, so the unlisted months, everything came up clean. -- Regards, Matthew Cengia

James Harper

2:52 a.m.

New subject: gpt and grub

...

On 2013-04-11 02:10, James Harper wrote: [...]

...
...
with disks (and raid arrays) of that size, you also have to be concerned about data errors as well as disk failures - you're pretty much guaranteed to get some, either unrecoverable errors or, worse, silent corruption of the data.

Guaranteed over what time period? It's easy to fault your logic as I just did a full scan of my array and it came up clean. If you say you are "guaranteed to get some" over, say, a 10 year period, then I guess that's fair enough. But as you don't specify a timeframe I can't really contest the point. [...]

With a pair of 2TB Western Digital SATA drives in my server, both in RAID 1:

| mattcen@adam:tmp$ zgrep -h 'mismatches found' /var/log/syslog* | sort - n | 2012-02-05T22:05:03.118792+11:00 adam mdadm[1545]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496 | 2012-03-04T17:00:12.084923+11:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 11008 ...

Interesting. And somewhat alarming! What does smartctl -H and smartctl -a report for those drives? Does "11008" mismatches mean that 11008 bytes were found to be different, or that 11008 sectors were found to be different? In either case I would suggest to you that you have a serious problem with your servers and that this is not normal. I have many servers running linux md RAID1 and have never seen such a thing. James

Matthew Cengia

5:45 a.m.

New subject: gpt and grub

On 2013-04-11 02:52, James Harper wrote:

...

Interesting. And somewhat alarming!

What does smartctl -H and smartctl -a report for those drives?

Does "11008" mismatches mean that 11008 bytes were found to be different, or that 11008 sectors were found to be different? In either case I would suggest to you that you have a serious problem with your servers and that this is not normal. I have many servers running linux md RAID1 and have never seen such a thing.

The output of 'smartctl -a' is a superset of 'smartctl -H'. Output of the former is attached. I'm not sure whether mdadm is measuring blocks or bytes. If this were just one server, I'd be concerned that something is more wrong than it should be, but given there are at least 3 servers, at 2 different sites, with different sorts of disks, and running different distros and software, I'm less worried. Do your systems do regular RAID checks to confirm there are no mismatches, as opposed to you just not knowing about them? -- Regards, Matthew Cengia

Russell Coker

1:43 p.m.

New subject: gpt and grub

On Thu, 11 Apr 2013, James Harper <james.harper@bendigoit.com.au> wrote:

...

Does "11008" mismatches mean that 11008 bytes were found to be different, or that 11008 sectors were found to be different? In either case I would suggest to you that you have a serious problem with your servers and that this is not normal. I have many servers running linux md RAID1 and have never seen such a thing.

Linux Software RAID-1 seems to report large numbers of mismatches in a multiple of 64 when nothing appears to be wrong. It happens on all the systems I run. On Thu, 11 Apr 2013, Craig Sanders <cas@taz.net.au> wrote:

...

On Thu, Apr 11, 2013 at 02:10:37AM +0000, James Harper wrote:

...
...
with disks (and raid arrays) of that size, you also have to be concerned about data errors as well as disk failures - you're pretty much guaranteed to get some, either unrecoverable errors or, worse, silent corruption of the data.

Guaranteed over what time period?

any time period. it's a function of the quantity of data, not of time.

So far I haven't seen a corruption reported by "btrfs scrub". Admittedly I have less than 3TB of data on BTRFS at the moment.

...

...
If you say you are "guaranteed to get some" over, say, a 10 year period, then I guess that's fair enough. But as you don't specify a timeframe I can't really contest the point.

you seem to be confusing data corruption with MTBF or similar, it's not like that at all. it's not about disk hardware faults, it's about the sheer size of storage arrays these days making it a mathematical certainty that some corruption will occur - write errors due to, e.g., random bit-flaps, controller brain-farts, firmware bugs, cosmic rays, and so on.

I've got a BTRFS filesystem that was corrupted by a RAM error (I discarded a DIMM after doing all the relevant Memtest86+ tests). Currently I have been unable to get btrfsck to work on it and make it usable again. But at least I know the data was corrupted which is better than having the system keep going and make things worse.

...

...
Putting the error correction/detection in the filesystem bothers me. Putting it at the block device level would benefit a lot more infrastructure - LVM volumes for VM's, swap partitions, etc.

having used ZFS for quite some time now, it makes perfect sense to me for it to be in the filesystem layer rather in the block level - it's the file system that knows about the data, what/where it is, and whether it's in use or not (so, faster scrubs - only need to check blocks in use rather than all blocks).

http://etbe.coker.com.au/2012/04/27/btrfs-zfs-layering-violations/ There are real benefits to having separate layers, I've written about this at the above URL. But there are also significant benefits to doing things in the way that BTRFS and ZFS do it and it seems that no-one is interested in developing any other way of doing it (EG a version of Linux Software RAID that does something like RAID-Z). Also if you use ZVOLs then ZFS can be considered to be a LVM replacement with error checking (as Craig already noted). -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Jason White

11:53 p.m.

New subject: gpt and grub

Russell Coker <russell@coker.com.au> wrote:

...

I've got a BTRFS filesystem that was corrupted by a RAM error (I discarded a DIMM after doing all the relevant Memtest86+ tests). Currently I have been unable to get btrfsck to work on it and make it usable again. But at least I know the data was corrupted which is better than having the system keep going and make things worse.

Agreed. Is development effort being devoted to Btrfsck at the moment? I know it was long delayed and still not considered complete when released, but that was quite a while ago. I also know that online fsck is among their longer-term project objectives, since a full off-line scan and repair of today's high-capacity drives is considered unacceptably time-consuming in some environments.

Russell Coker

12 Apr 12 Apr

2:02 a.m.

New subject: gpt and grub

On Fri, 12 Apr 2013, Jason White <jason@jasonjgw.net> wrote:

...

Is development effort being devoted to Btrfsck at the moment? I know it was long delayed and still not considered complete when released, but that was quite a while ago.

As long as there are filesystems which can't be repaired online they need to maintain such tools. Currently there are some errors which are almost theoretically impossible to repair online (EG certain types of log corruption that might make it impossible to mount the filesystem) and some errors which are practically impossible due to the kernel being buggy (my filesystem causes a kernel oops on mount).

...

I also know that online fsck is among their longer-term project objectives, since a full off-line scan and repair of today's high-capacity drives is considered unacceptably time-consuming in some environments.

The aim is to have routine errors fixed online, the old thing of having a long fsck every certain number of days or certain number of boots has to go. But there will always be situations where a filesystem can't be fixed online. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

trentbuck＠gmail.com

3:10 a.m.

New subject: gpt and grub

Russell Coker writes:

...

...
[is btrfsck ready?] The aim is to have routine errors fixed online, the old thing of having a long fsck every certain number of days or certain number of boots has to go. But there will always be situations where a filesystem can't be fixed online.

There is, at least, a tool to pull data out of a btrfs given the number of the subvol (or tree?). I used this to recover /home subvol after my / subvol mysteriously went tits-up under 2.6.38, and btrfs refused to mount the filesystem at all (regardless of -o subvol &c options). So y'know, that wasn't a btrfsck, but it did what I needed.

James Harper

3:14 a.m.

New subject: gpt and grub

...

The aim is to have routine errors fixed online, the old thing of having a long fsck every certain number of days or certain number of boots has to go. But there will always be situations where a filesystem can't be fixed online.

What is a "routine error"? Is this an error caused by the underlying disk corruption that we have been discussing, or is that already implemented as correct-on-read? James

Russell Coker

3:18 a.m.

New subject: gpt and grub

On Fri, 12 Apr 2013, James Harper <james.harper@bendigoit.com.au> wrote:

...

...
The aim is to have routine errors fixed online, the old thing of having a long fsck every certain number of days or certain number of boots has to go. But there will always be situations where a filesystem can't be fixed online.

What is a "routine error"? Is this an error caused by the underlying disk corruption that we have been discussing, or is that already implemented as correct-on-read?

Routine errors are random small corruption to data and metadata. This are fixed on read if discovered or on a scrub otherwise. Similar corruption to superblocks and other important data structures may not be "routine" in that they can prevent mounting. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Craig Sanders

2:22 a.m.

New subject: gpt and grub

On Thu, Apr 11, 2013 at 11:43:12PM +1000, Russell Coker wrote:

...

I've got a BTRFS filesystem that was corrupted by a RAM error (I discarded a DIMM after doing all the relevant Memtest86+ tests). Currently I have been unable to get btrfsck to work on it and make it usable again. But at least I know the data was corrupted which is better than having the system keep going and make things worse.

yeah, well, btrfs is currently buggy. it's the main reason I use zfs instead of btrfs (if it was just the incomplete feature set compared to zfs, i probably wouldn't have bothered). i have no doubt that btrfs will eventually get to a safely usable state, and I hear that it's getting close....but i'm already committed to ZFS on my current machines/drives. i've read that the bugs which caused me to abandon btrfs and switch to zfs have been fixed, but i just don't have any compelling reason to go back right now.

...

...
...
Putting the error correction/detection in the filesystem bothers me. Putting it at the block device level would benefit a lot more infrastructure - LVM volumes for VM's, swap partitions, etc.

having used ZFS for quite some time now, it makes perfect sense to me for it to be in the filesystem layer rather in the block level - it's the file system that knows about the data, what/where it is, and whether it's in use or not (so, faster scrubs - only need to check blocks in use rather than all blocks).

http://etbe.coker.com.au/2012/04/27/btrfs-zfs-layering-violations/

excellent post. thanks for the reminder about it.

...

There are real benefits to having separate layers, I've written about this at the above URL.

yep, there are. personally, i think that the practical advantages of integrating the layers (as btrfs and zfs do) more than outweighs the disadvantages. in particular, the reason why RAID-Z is so much better than mdadm RAID (which is, in turn, IMO much better than most hardware RAID) is that the "raid" knows about the filesystem and the data allowing ZFS to fix data corruption as it discovers it (you lose this ability of ZFS if you give it a raid array to work with rather than JBOD) there's also the usability benefits of the btrfs and zfs tools - using them is far simpler and far less hassle than using mdadm and lvm. for many people, this will be reason enough in itself to use btrfs or zfs as the complexity of mdadm and LVM is a significant barrier to entry.

...

But there are also significant benefits to doing things in the way that BTRFS and ZFS do it and it seems that no-one is interested in developing any other way of doing it (EG a version of Linux Software RAID that does something like RAID-Z).

probably because development effort in that direction is going into btrfs and zfs, and it's hard to see any good reason to re-implement parts of zfs or btrfs in mdadm - it would be just a tick-a-box-feature without the practical benefits offered by btrfs and zfs. IMO with btrfs gaining raid5/6-like support, there'll be even less reason to use mdadm (once the initial bugs have been shaken out), even for people who don't want to use out-of-tree code like zfsonlinux. My guess is that within a few years btrfs will be the mainstream default choice (possibly with ZFS being the second most common option), and technologies like mdadm, LVM and "old-fashioned" filesystems like ext2/3/4 and XFS etc will be considered obsolete, existing mostly on legacy systems (and on VMs running on block devices exported from zfs or btrfs servers). even laptops with single small drives will commonly use btrfs because of its snapshotting and btrfs send/receive for backups (same concept as zfs send/receive). both btrfs and zfs offer enough really compelling advantages over older filesystems that I see this as inevitable (and a Good Thing).

...

Also if you use ZVOLs then ZFS can be considered to be a LVM replacement with error checking (as Craig already noted).

(and, finally, we get to the bit that motivated me to reply) it can also be considered an LVM replacement even if you don't use ZVOLs. while there are other uses for them, ZVOLs are mostly of interest to people who run kvm or xen or similar, or want to use iscsi rather than NFS or Samba to export a chunk of storage space for use by other systems. One of the common uses for LVM is to divide up a volume group (VG) into logical volumes (LV) to be formatted and mounted as particular directories - e.g. one for /, /home, /var, /usr or whatever. With LVM you have to decide in advance how much of each VG is going to be dedicated to each LV you create. LVs *can* be resized and (depending on the filesystem it's formatted with) the fs can be grown to match the new size (e.g. with xfs_growfs or resize2fs), but the procedure is moderately complicated and can't be done while the fs is mounted and in use. Practically, you can increase the size of an LV but shrinking it is best done by backup, delete the LV, recreate LV and restore. With ZFS, the analagous concept is a sub-volume or a filesystem. You can create and change a filesystem at any time, you can resize it while it is in use (including shrinking it to any size >= currently used space). In fact, you don't even have to set a quota or a reservation on it if you don't want to - its size will be limited by the total size of the pool (as shared with all other sub-volumes) (FYI a quota sets a limit on the filesystem's maximum size but does not reserve space for that fs. a reservation is guaranteeing that space in the pool WILL be available / reserved for that filesystem: http://docs.oracle.com/cd/E23823_01/html/819-5461/gazvb.html) e.g. if i have a zpool called "tank" and want to create a filesystem (aka sub-volume) to be mounted as /home with a quota of 100G and compression enabled: zfs create tank/home zfs set quota=100G tank/home zfs set compression=on tank/home zfs set mountpoint=/home tank/home if i start running out of space in my 100G /home, it is trivial to change the quota: zfs set quota=200G tank/home i don't need to unmount it (a PITA if i have open files on it, as is extremely likely with /home) or run xfs_growfs on it or do anything else. from memory, it's just as easy to to the same thing with btrfs. similarly, if i've reserved way too much space for e.g. /var and urgently need more space in /home, i can shrink /var's reservation and increase /home's quota. back in the bad old days of small disks, allocating too much space for one partition and not enough for another used to be extremely common and solving it involved time-consuming and tedious amounts of downtime with file-system juggling (backup,repartition,format,restore)....which is pretty much why the idea of "one big root filesystem" took over from the idea of lots of separate small partitions for /, /home, /var, /usr, and so on. btrfs and zfs give us back the benefits of separating filesystems like that but without the drawbacks (LVM did too, but it was much more difficult to use, so most people didn't unless they had a good reason to). BTW, you can also use zfs sub-volumes for container-style virtualisation (e.g. Solaris Containers, or FreeeBSD Jails, or OpenVZ on Linux, and the like), and apparently works quite well to save disk space with de-duping if you have hundreds of very similar VMs (with the caveat that de-duping takes shitloads of RAM and disk space is much cheaper than RAM. OTOH de-duping can offer significant performance benefits due to disk caching of the duped blocks/files). craig -- craig sanders <cas@taz.net.au> BOFH excuse #431: Borg implants are failing

James Harper

3:12 a.m.

New subject: gpt and grub

...

in particular, the reason why RAID-Z is so much better than mdadm RAID (which is, in turn, IMO much better than most hardware RAID)

Disagree about the hardware raid comment. You say "most hardware RAID", but if you consider the set of hardware RAID implementations that you would actually use on a server, mdadm is pretty feature poor. In particular the advantages of hardware RAID are: . Battery backed write cache. Bcache/flashcache offer this but they have their shortcomings, in particular that most available cache modules are still on top of the SATA channel. . Online resize/reconfigure . BIOS boot support (see recent thread "RAID, again" by me) Linux based BIOS would make some of this better though! If you are saying mdadm is better than "most hardware RAID" where "hardware RAID" is a set of all possible hardware RAID's, including the really crap one's that you wouldn't even consider using on a workstation then I guess I agree, but it's not really a fair comparison ;) Does ZFS have any native support for battery or flash backed write cache? With mdadm I can run bcache on top of it, then lvm on top of that, but with ZFS's tight integration of everything I'm not sure that would be possible and I'd have to run bcache on top of the component disks or the individual zvols, maybe (I'm probably mucking up the zfs terminology here but I hope you know what I mean). James

Russell Coker

3:21 a.m.

New subject: gpt and grub

On Fri, 12 Apr 2013, James Harper <james.harper@bendigoit.com.au> wrote:

...

Disagree about the hardware raid comment. You say "most hardware RAID", but if you consider the set of hardware RAID implementations that you would actually use on a server, mdadm is pretty feature poor. In particular the advantages of hardware RAID are:

Also Linux Software RAID doesn't have some of the problems that some hardware RAID has. For example on a HP DL-360 server some years ago I had terrible performance on SATA disks and had to upgrade to SAS. Linux Software RAID gives decent performance on SATA.

...

Does ZFS have any native support for battery or flash backed write cache?

It has the ZIL and L2ARC which can both be on any type of fast storage, a NVRAM device would do. Recent versions of ZFS won't be totally broken if the ZIL is corrupted so that's a safe thing to do on NVRAM. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Craig Sanders

4:53 a.m.

New subject: gpt and grub

On Fri, Apr 12, 2013 at 03:12:23AM +0000, James Harper wrote:

...

...
in particular, the reason why RAID-Z is so much better than mdadm RAID (which is, in turn, IMO much better than most hardware RAID)

Disagree about the hardware raid comment. You say "most hardware RAID", but if you consider the set of hardware RAID implementations that you would actually use on a server, mdadm is pretty feature poor. In particular the advantages of hardware RAID are:

with the sole exception of non-volatile write cache for HW RAID5/6, i'm including them too. RAID5's write performance sucks without a good write cache (and it has to be non-volatile or battery backed for safety), and RAID6 is even worse. as you suggest bcache and flashcache seem to offer a way around this for mdadm but i've never used either of them - i was already using zfs by the time they became available. i don't think the SATA interface speed is a deal-breaker for them because the only way around that is spending huge amounts of money.

...

. Battery backed write cache. Bcache/flashcache offer this but they have their shortcomings, in particular that most available cache modules are still on top of the SATA channel.

this is the only real advantage of hardware raid over mdadm. IMO, ZFS's ability to use an SSD or other fast block device as cache completely eliminates this last remaining superiority of hardware raid over software raid. i personally don't see any technical reason to chose hardware raid over ZFS (although CYA managerial reasons and specifications written by technologically-illiterate buffoons who have picked up some cool buzzwords like "RAID" will often override technical best practice)

...

. Online resize/reconfigure

both btrfs and zfs offer this.

...

. BIOS boot support (see recent thread "RAID, again" by me)

this is a misfeature of a crappy BIOS rather than a fault with software raid. any decent BIOS not only has the ability to choose which disk to boot from (rather than hard-code it to only boot from whichever disk is plugged into the first disk port) but will also let you specify a boot order so that it will try disk 1 followed by disk 2 and then disk 3 or whatever. they'll also typically let you press F2 or F12 or whatever at boot time to pop up a boot device selection menu. even server motherboards like supermicro let you choose the boot device and have a boot menu option accessible over IPMI.

...

Does ZFS have any native support for battery or flash backed write cache? With mdadm I can run bcache on top of it, then lvm on top of that, but with ZFS's tight integration of everything I'm not sure that would be possible and I'd have to run bcache on top of the component disks or the individual zvols, maybe (I'm probably mucking up the zfs terminology here but I hope you know what I mean).

yep. ZFS has built-in support for both "log" devices AKA synchronous write cache (ZIL or ZFS Intent Log) and "cache" or read-cache devices (aka L2ARC, "2nd-level Adaptive Replacement Cache", which is block-device based). ZIL is kind of like a write-back cache for synchronous writes, and is what makes RAID-Z performance not suck like software RAID5 does (L2ARC is in addition to the RAM-based ARC which is pretty much like your common garden-variety linux disk caching. ARC and L2ARC are used for read caching of frequently/recently accessed data as well as to support de-duplication by keeping block hashes in ARC or L2ARC) Anyway, ZFS can use any block device(s) for "log" or "cache", including a disk or partition on SATA or SAS or whatever interface. also including faster devices such as PCI-e SSDs like the Fusion-IO and PCI-e battery-backed RAM disks. they're out of my personal budget range so i haven't bothered checking on availility for x86/etc systems but I know that in the Sun world there were extremely expensive specialised cache devices sold specifically for use with ZFS. In the x86/pc world you'd just use a generic super-fast block device like the Fusion-IO, I guess. if you could afford it. dunno what they cost now but i looked into the Fusion-IO PCI-e cards for someone at work a year or two ago (they needed an extremely fast device to write massive amounts of data as it was captured by a scientific instrument camera). IIRC it was about $5K for a 1TB drive. which isn't bad considering what it was, but was out of the budget for that researcher. instead, i cobbled together a system with loads of RAM for a large ramdisk instead....clumsy, and they had to manually copy data onto more permanent storage later, but it was fast enough to keep up with the camera. hmmm, seems to be $1995 for 420GB (read speed of 1.4GB/s, write speed 700MB/s - their higher end models get 1.1GB/s write speed) at the moment according to these articles: http://www.techworld.com.au/article/458387/fusion-io_releases_1_6tb_flash_ca... http://www.computerworld.com/s/article/9226103/Fusion_io_releases_cheaper_io... i expect that bcache and flashcache could use high-speed devices like these too. for those on a tighter budget, a current SATA SSD offers about 500-550MB/s read and slightly lower write speeds, at around $1/GB. from my own experience, even a small 4GB partition on a SATA SSD for the ZIL makes a massive performance difference. and a larger cache partition also helps. BTW, the principal author of btrfs (Chris Mason, IIRC) left Oracle last year and is now working at Fusion IO. He's still working on btrfs. craig -- craig sanders <cas@taz.net.au>

Andrew McGlashan

2:42 p.m.

New subject: gpt and grub

Hi, On 11/04/2013 12:28 PM, Matthew Cengia wrote:

...

With a pair of 2TB Western Digital SATA drives in my server, both in RAID 1:

| mattcen@adam:tmp$ zgrep -h 'mismatches found' /var/log/syslog* | sort -n | 2012-02-05T22:05:03.118792+11:00 adam mdadm[1545]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496 | 2012-03-04T17:00:12.084923+11:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 11008 | 2012-04-01T18:20:08.394369+10:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496 | 2012-05-06T16:42:59.386193+10:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10240

I would lay odds (and I'm not a betting man) that /dev/md/1 is your swap file, this is perfectly normal in this case. I've seen this before, my solution to stop seeing this is to have swap on it's own RAID, and LVM volumes on different RAID devices. Cheers AndrewM

Matthew Cengia

3:29 p.m.

New subject: gpt and grub

On 2013-04-13 00:42, Andrew McGlashan wrote:

...

Hi,

On 11/04/2013 12:28 PM, Matthew Cengia wrote:

...
With a pair of 2TB Western Digital SATA drives in my server, both in RAID 1:

| mattcen@adam:tmp$ zgrep -h 'mismatches found' /var/log/syslog* | sort -n | 2012-02-05T22:05:03.118792+11:00 adam mdadm[1545]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496 | 2012-03-04T17:00:12.084923+11:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 11008 | 2012-04-01T18:20:08.394369+10:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496 | 2012-05-06T16:42:59.386193+10:00 adam mdadm[1724]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10240

I would lay odds (and I'm not a betting man) that /dev/md/1 is your swap file, this is perfectly normal in this case.

I've seen this before, my solution to stop seeing this is to have swap on it's own RAID, and LVM volumes on different RAID devices.

Andrew, You're exactly right; I'd forgotten this particular data point; all 3 systems from which I pulled these mismatch messages have swap space inside the LVM PV living on /dev/md/1, which completely explains these messages. -- Regards, Matthew Cengia

trentbuck＠gmail.com

13 Apr 13 Apr

3:32 a.m.

New subject: gpt and grub

Matthew Cengia <mattcen@gmail.com> writes:

...

...
...
| 2012-02-05T22:05:03.118792+11:00 adam mdadm[1545]: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496

I would lay odds (and I'm not a betting man) that /dev/md/1 is your swap

You're exactly right; I'd forgotten this particular data point; all 3 systems from which I pulled these mismatch messages have swap space inside the LVM PV living on /dev/md/1, which completely explains these messages.

Matt, IIRC we checked for that and found it happened even without swap. You (ICBF) would have to pull the ticket out of alloc to check.

Craig Sanders

11 Apr 11 Apr

3:51 a.m.

New subject: gpt and grub

On Thu, Apr 11, 2013 at 02:10:37AM +0000, James Harper wrote:

...

...
with disks (and raid arrays) of that size, you also have to be concerned about data errors as well as disk failures - you're pretty much guaranteed to get some, either unrecoverable errors or, worse, silent corruption of the data.

Guaranteed over what time period?

any time period. it's a function of the quantity of data, not of time.

...

It's easy to fault your logic as I just did a full scan of my array and it came up clean.

no, it's not. your array scan checks for DISK errors. It does not check for data corruption - THAT is the huge advantage of filesystems like ZFS and btrfs, they can detect and correct data errors

...

If you say you are "guaranteed to get some" over, say, a 10 year period, then I guess that's fair enough. But as you don't specify a timeframe I can't really contest the point.

you seem to be confusing data corruption with MTBF or similar, it's not like that at all. it's not about disk hardware faults, it's about the sheer size of storage arrays these days making it a mathematical certainty that some corruption will occur - write errors due to, e.g., random bit-flaps, controller brain-farts, firmware bugs, cosmic rays, and so on. e.g. a typical quoted rating of 1 error per 10^14 bits is one error per 12 terabytes - i.e. your four x 3TB array is guaranteed to have at least one error in the data. one error in 10^14 bits is nothing to worry about with 500GB drives. it's starting to get worrisome with 1 and 2TB drives. It's a guaranteed error with 10+TB arrays....and even a single 3 or 4TB drive has roughly a 30-50% chance of having at least one data error.

...

I can say though that I do monitor the SMART values which do track corrected and uncorrected error rates, and by extrapolating those figures I can say with confidence that there is not a guarantee of unrecoverable errors.

smart values really only tell you about detected errors in the drive itself. they don't tell you *anything* about data corruption problems - for that, you actually need to check the data...and to check the data you need a redundant copy or copies AND a hash of what it's supposed to be. with mdadm, such errors can only be corrected if the data can be rewritten to the same sector or if the drive can remap a spare sector to that spot. with zfs, because it's a COW filesystem all that needs to be done is to rewrite the data.

...

...
http://en.wikipedia.org/wiki/ZFS#Error_rates_in_harddisks

The part that says "not visible to the host software" kind of bothers me.

yes, that's why it's a problem, and that's why a filesystem that keeps both multiple copies (mirroring or raid5/6-like) AND a hash of each block is essential for detecting and correcting errors in the data.

...

AFAICS these are reported via SMART and are entirely visible, with some exceptions of poor SMART implementations.

no. SMART detects disk faults, not data corruption.

...

...
personally, i wouldn't use raid-5 (or raid-6) any more. I'd use ZFS RAID-Z (raid5 equiv) or RAID-Z2 (raid6 equiv. with 2 parity disks) instead.

Putting the error correction/detection in the filesystem bothers me. Putting it at the block device level would benefit a lot more infrastructure - LVM volumes for VM's, swap partitions, etc.

having used ZFS for quite some time now, it makes perfect sense to me for it to be in the filesystem layer rather in the block level - it's the file system that knows about the data, what/where it is, and whether it's in use or not (so, faster scrubs - only need to check blocks in use rather than all blocks). but that's partly because ZFS blends the block layer and the fs layer in a way that seems unusual if you're used to ext4 or xfs or pretty much anything else except btrfs. see below for more on this topic.

...

I understand you can run those things on top of a filesystem also, but if you are doing this just to get the benefit of error correction then I think you might be doing it wrong.

Error correction is a big benefit, but it's not the only one. the 2nd major benefit is snapshots (fast and lightweight because ZFS is a copy-on-write or COW fs, so a snapshot is little more than just keeping a copy of the block list at the time of the snapshot, and not deleting/re-using those blocks while any snapshot references them).

...

Actually when I was checking over this email before hitting send it occurred to me that maybe I'm wrong about this, knowing next to nothing about ZFS as I do. Is a zpool virtual device like an LVM lv, and I can use it for things other than running ZFS filesystems on?

yes, ZFS is like a combination of mdadm, lvm, and a filesystem. a zpool is like an lvm volume group. e.g. you might allocate 4 drives to a raid-z array and call that pool "export". unlike LVM you don't have to pre-allocate fixed chunks of the volume to particular uses (e.g. filesystems or logical drives/partitions), you can dynamically change the "allocation" as needed. it's also like a filesystem in that you can mount that pool directly as, say, /export (or wherever you want) and read and write data to it. you can also create subvolumes (e.g. export/www) and mount them too. each subvolume inherits attributes (quota, compression, de-duping, and lots more) from the parent or can have individual attributes different from the parent. each subvolume can also have subvolumes (e.g. export/www/site1, export/www/site2). each of these subvolumes is like a separate filesystem that shares in the total pool size, and each can be snapshotted individually. you can create new subvolumes aka filesystems on the fly as needed, or change them (e.g. change the quota from 10G to 20G or enable compression etc) or delete them. you can also create a ZVOL, which is just like a zfs subvolume except that it appears to the system as a virtual disk - i.e. with a device node under /dev. typical use is for xen or kvm VM images. or even swap devices. as with subvolumes, they can have individual attributes like compression or de-duping, and they can also be resized if needed (resize the zvol on the zfs host, and then inside the VM you need to run xfs resize or ext4 resize so that it recognises the extra capacity). ZVOLs can also be snapshotted just like subvolumes. they can also be exported as iscsi targets, so you can, e.g., easily serve disk images to your VM compute nodes. in short: a subvolume is like a subdirectory or mount-point, while a ZVOL is like a disk image or partition (incl. an LVM partition) BTW, some of what i've written above isn't strictly accurate....i've tried to translate ZFS concepts into terms that should be familiar to someone who has worked with mdadm and LVM. as an analogy, i've done reasonably well i think. a technological pedant would probably find much to complain about. i'm more interested in having what i write be understood than in having it perfectly correct.

...

Despite my reservations mentioned above, ZFS is still on my (long) list of things to look into and learn about, more so given that you say it is now considered stable :)

it's definitely worth experimenting with on some spare hardware - but be warned, you will almost certainly want to convert appropriate production systems from mdadm+lvm to ZFS asap once you start playing with it. i got hooked on the idea of what ZFS is doing by experimenting with btrfs. btrfs has a lot of similar ideas, but the implementation (aside from having different goals) is many years behind ZFS. I persevered with btrfs for a while because it was in the mainline kernel and didn't require any stuffing around installing third-party code (zfs) that would never get in the mainline kernel. i lost my btrfs array (fortunately only a /backup mount, so not irreplacable) one too many times and switcbed to ZFS. it is everything i ever wanted in a filesystem and volume management - it replaces mdadm, lvm2, and the XFS and/or ext4 i was previously using. With the dkms module packages, it isn't even hard to install or use these days (add the debian wheezy zfs repo and apt-get install it) craig -- craig sanders <cas@taz.net.au>

James Harper

4:32 a.m.

New subject: gpt and grub

...

On Thu, Apr 11, 2013 at 02:10:37AM +0000, James Harper wrote:

...
...
with disks (and raid arrays) of that size, you also have to be concerned about data errors as well as disk failures - you're pretty much guaranteed to get some, either unrecoverable errors or, worse, silent corruption of the data.

Guaranteed over what time period?

any time period. it's a function of the quantity of data, not of time.

...
It's easy to fault your logic as I just did a full scan of my array and it came up clean.

no, it's not. your array scan checks for DISK errors. It does not check for data corruption - THAT is the huge advantage of filesystems like ZFS and btrfs, they can detect and correct data errors

This is the md 'check' function that compares the two copies of the data together. If there was corruption in my RAID1 then it's incredibly unlikely that this corruption would have occurred on both disks and register as a match, at least from a disk based corruption issue.

...

...
If you say you are "guaranteed to get some" over, say, a 10 year period, then I guess that's fair enough. But as you don't specify a timeframe I can't really contest the point.

you seem to be confusing data corruption with MTBF or similar, it's not like that at all. it's not about disk hardware faults, it's about the sheer size of storage arrays these days making it a mathematical certainty that some corruption will occur - write errors due to, e.g., random bit-flaps, controller brain-farts, firmware bugs, cosmic rays, and so on.

e.g. a typical quoted rating of 1 error per 10^14 bits is one error per 12 terabytes - i.e. your four x 3TB array is guaranteed to have at least one error in the data.

Not according to my visible history of parity checks of the underlying data (when it was 4 x 1.5TB - last 3TB disk still on order). I will be monitoring it more closely now though!

...

...
I can say though that I do monitor the SMART values which do track corrected and uncorrected error rates, and by extrapolating those figures I can say with confidence that there is not a guarantee of unrecoverable errors.

smart values really only tell you about detected errors in the drive itself. they don't tell you *anything* about data corruption problems - for that, you actually need to check the data...and to check the data you need a redundant copy or copies AND a hash of what it's supposed to be.

Not entirely true. It gives reports of correctable errors, first-read-uncorrectable errors that were correct on re-read, etc. For an undetected disk read error to occur (eg one that still passed ECC or whatever correction codes are used), there would need to be significant quantities of the former, statistically speaking. I wonder if the undetected error rates differ with the 4K sector disks? That is supposed to be one of the other advantages. Of course that still doesn't detect errors that occur beyond the disk (eg pci, controller or cabling), so I guess your point still stands.

...

with mdadm, such errors can only be corrected if the data can be rewritten to the same sector or if the drive can remap a spare sector to that spot. with zfs, because it's a COW filesystem all that needs to be done is to rewrite the data.

Correct. It can be detected though.

...

...

Thanks for taking the time to write out that stuff about ZFS. I'm somewhat wiser about it all now :) James

Russell Coker

1:56 p.m.

New subject: gpt and grub

On Thu, 11 Apr 2013, James Harper <james.harper@bendigoit.com.au> wrote:

...

...
no, it's not. your array scan checks for DISK errors. It does not check for data corruption - THAT is the huge advantage of filesystems like ZFS and btrfs, they can detect and correct data errors

This is the md 'check' function that compares the two copies of the data together. If there was corruption in my RAID1 then it's incredibly unlikely that this corruption would have occurred on both disks and register as a match, at least from a disk based corruption issue.

With RAID-1 the check operation makes the second copy the same as the first in the case of discrepancy, even though the second might have contained the correct data. With Linux Software RAID-5/6 the parity is regenerated if it doesn't match the data - even though with RAID-6 it's possible to regenerate a single corrupted data sector to make the parity match. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

trentbuck＠gmail.com

12 Apr 12 Apr

2:21 a.m.

New subject: gpt and grub

Russell Coker <russell@coker.com.au> writes:

...

With RAID-1 the check operation makes the second copy the same as the first in the case of discrepancy, even though the second might have contained the correct data.

Is it smarter for a three-way RAID1?

Russell Coker

3:06 a.m.

New subject: gpt and grub

On Fri, 12 Apr 2013, "Trent W. Buck" <trentbuck@gmail.com> wrote:

...

Russell Coker <russell@coker.com.au> writes:

...
With RAID-1 the check operation makes the second copy the same as the first in the case of discrepancy, even though the second might have contained the correct data.

Is it smarter for a three-way RAID1?

No. It's not smart for any type of RAID. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

trentbuck＠gmail.com

2:20 a.m.

New subject: gpt and grub

Craig Sanders <cas@taz.net.au> writes:

...

e.g. a typical quoted rating of 1 error per 10^14 bits is one error per 12 terabytes - i.e. your four x 3TB array is guaranteed to have at least one error in the data.

I think more formally you'd say something like "the probability of no data errors over 12TB is not statistically significant" or something. The way you're phrasing it seems prone to misinterpretation, like saying if 25% of the global population is Chinese, then if I have four kids one of them is GUARANTEED to be Chinese.

...

one error in 10^14 bits is nothing to worry about with 500GB drives. it's starting to get worrisome with 1 and 2TB drives. It's a guaranteed error with 10+TB arrays....and even a single 3 or 4TB drive has roughly a 30-50% chance of having at least one data error.

^ that reads better.

Craig Sanders

2:31 a.m.

New subject: gpt and grub

On Fri, Apr 12, 2013 at 12:20:12PM +1000, Trent W. Buck wrote:

...

Craig Sanders <cas@taz.net.au> writes:

...
e.g. a typical quoted rating of 1 error per 10^14 bits is one error per 12 terabytes - i.e. your four x 3TB array is guaranteed to have at least one error in the data.

I think more formally you'd say something like "the probability of no data errors over 12TB is not statistically significant" or something.

umm, yes. that's much better.

...

The way you're phrasing it seems prone to misinterpretation, like saying if 25% of the global population is Chinese, then if I have four kids one of them is GUARANTEED to be Chinese.

so? you got a problem with that? damn all racist maths-haters :)

...

...
one error in 10^14 bits is nothing to worry about with 500GB drives. it's starting to get worrisome with 1 and 2TB drives. It's a guaranteed error with 10+TB arrays....and even a single 3 or 4TB drive has roughly a 30-50% chance of having at least one data error.

^ that reads better.

true. craig -- craig sanders <cas@taz.net.au> BOFH excuse #246: It must have been the lightning storm we had (yesterday) (last week) (last month)

4509

Age (days ago)

4512

Last active (days ago)

List overview

Download

29 comments

10 participants

participants (10)

Andrew McGlashan
Craig Sanders
Erik Christiansen
Greg Bromage
James Harper
Jason White
Matthew Cengia
Rohan McLeod
Russell Coker
trentbuck＠gmail.com

Re: gpt and grub

Rohan McLeod

Matthew Cengia

Matthew Cengia

Matthew Cengia

tags

participants (10)