
Last night after some updates on my media server, one of the disks in a mirror set failed a SMART check and was kicked out (fortunately I have good backups). It's failing on the bad sector count but I haven't done a deeper analysis yet. However it's time to bite the bullet and do some upgrade work. I'm intending to replace all of the disks with larger ones to increase the storage size and when I do, use ZFS. The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD. So I'm looking for a strategy re the implementation of ZFS. I can install up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD) -- Colin Fee tfeccles@gmail.com

On 12 July 2013 12:06, Colin Fee <tfeccles@gmail.com> wrote:
Last night after some updates on my media server, one of the disks in a mirror set failed a SMART check and was kicked out (fortunately I have good backups). It's failing on the bad sector count but I haven't done a deeper analysis yet.
However it's time to bite the bullet and do some upgrade work. I'm intending to replace all of the disks with larger ones to increase the storage size and when I do, use ZFS.
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
So I'm looking for a strategy re the implementation of ZFS. I can install up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
I should add to that I've begun reading through necessary literature and sites, and a colleague at work who uses ZFS extensively on his home Mac server stuff has given me some intro videos to watch. I guess what I'm asking is for pointers to the gotchas that most people overlook or those gems that people have gleaned from experience. -- Colin Fee tfeccles@gmail.com

Make sure you have lots of ram. If you are using raidz or raidz2 you will need to ensure your zdevs are designed from the start as they cannot change in the future. Pools are cool snapshots are good checkout btrfs. On Fri, Jul 12, 2013 at 2:30 PM, Colin Fee <tfeccles@gmail.com> wrote:
On 12 July 2013 12:06, Colin Fee <tfeccles@gmail.com> wrote:
Last night after some updates on my media server, one of the disks in a mirror set failed a SMART check and was kicked out (fortunately I have good backups). It's failing on the bad sector count but I haven't done a deeper analysis yet.
However it's time to bite the bullet and do some upgrade work. I'm intending to replace all of the disks with larger ones to increase the storage size and when I do, use ZFS.
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
So I'm looking for a strategy re the implementation of ZFS. I can install up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
I should add to that I've begun reading through necessary literature and sites, and a colleague at work who uses ZFS extensively on his home Mac server stuff has given me some intro videos to watch.
I guess what I'm asking is for pointers to the gotchas that most people overlook or those gems that people have gleaned from experience. -- Colin Fee tfeccles@gmail.com
_______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main

Lots is certainly the wrong word now that i think about it. You need 4GB min for a reasonable zfs server plus more if you use dedupe On Fri, Jul 12, 2013 at 2:43 PM, Kevin <kevin@fuber.org> wrote:
Make sure you have lots of ram.
If you are using raidz or raidz2 you will need to ensure your zdevs are designed from the start as they cannot change in the future. Pools are cool snapshots are good
checkout btrfs.
On Fri, Jul 12, 2013 at 2:30 PM, Colin Fee <tfeccles@gmail.com> wrote:
On 12 July 2013 12:06, Colin Fee <tfeccles@gmail.com> wrote:
Last night after some updates on my media server, one of the disks in a mirror set failed a SMART check and was kicked out (fortunately I have good backups). It's failing on the bad sector count but I haven't done a deeper analysis yet.
However it's time to bite the bullet and do some upgrade work. I'm intending to replace all of the disks with larger ones to increase the storage size and when I do, use ZFS.
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
So I'm looking for a strategy re the implementation of ZFS. I can install up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
I should add to that I've begun reading through necessary literature and sites, and a colleague at work who uses ZFS extensively on his home Mac server stuff has given me some intro videos to watch.
I guess what I'm asking is for pointers to the gotchas that most people overlook or those gems that people have gleaned from experience. -- Colin Fee tfeccles@gmail.com
_______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main

I'd say don't. It's nowhere near ready for production yet. Don't plan on doing anything else with the server. zfsonlinux still is not integrated to page cache and linux memory management (due for 0.7 maybe if we're lucky, but I see no one working on the code). It does fragment memory, and even if you limit it to half your RAM of even far smaller, it will still use more than you allow it to, and crash your machine after a couple of weeks of uptime (mind you, it's never lost data for me, and now that I have a hardware watchdog, I just put up with losing my NFS shares for 10s of minutes every few weeks as it detects a softlockup and reboots). It also goes on frequent go-slows, stopworks and strikes, before marching down Spring St (seriously, it goes into complete meltdown above about 92% disk usage, but even when not nearly full, will frequently get bogged down so much that NFS autofs mounts time out before succesfully mounting a share). rsync, hardlink and metadata heavy workloads are particularly bad for it. snapshots don't give you anything at all over LVM, and just introduces a different set of commands. zfs send/recv is seriously overrated compared to dd (the data is fragmented enough on read, because of COW, that incremental sends are frequently as slow as just sending the whole device in the first place). raidz is seriously inflexible compared to mdadm. Doesn't yet balance IO to different speed devices, but a patch has just been committed to HEAD (so might make it out to 0.6.2 - but I haven't yet tested it and have serious doubts). Think of zvols as lvm logical volumes. With different syntax (and invented by people that have never heard of LVM). You can't yet swap to zvols without having hard lockups every day or so (after dipping into about 200MB of a swap device), whereas I've never ever had swapping to lvm cause a problem (and we have that set up for hundreds of VMs here at $ORK). In short, overrated. I wouldn't do it again if I didn't already have a seriously large (for the hardware) amount of files and hardlinks that take about a month to migrate over to a new filesystem. ext4 on bcache on lvm (or vice versa) on mdadm sounds like a much more viable way to go in the future. bcache isn't mature yet, but I don't think it will be long given the simplicity of it, before it will be. Separation of mature layers is good. Whereas zfs is a massive spaghetti of new code that hooks into too many layers and takes half an hour to build on my machine (but at least it's now integrated into dkms on debian). On Fri, 12 Jul 2013, Kevin wrote:
Make sure you have lots of ram.
If you are using raidz or raidz2 you will need to ensure your zdevs are designed from the start as they cannot change in the future. Pools are cool snapshots are good
checkout btrfs.
On Fri, Jul 12, 2013 at 2:30 PM, Colin Fee <tfeccles@gmail.com> wrote:
On 12 July 2013 12:06, Colin Fee <tfeccles@gmail.com> wrote:
Last night after some updates on my media server, one of the disks in a mirror set failed a SMART check and was kicked out (fortunately I have good backups). It's failing on the bad sector count but I haven't done a deeper analysis yet.
However it's time to bite the bullet and do some upgrade work. I'm intending to replace all of the disks with larger ones to increase the storage size and when I do, use ZFS.
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
So I'm looking for a strategy re the implementation of ZFS. I can install up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
I should add to that I've begun reading through necessary literature and sites, and a colleague at work who uses ZFS extensively on his home Mac server stuff has given me some intro videos to watch.
I guess what I'm asking is for pointers to the gotchas that most people overlook or those gems that people have gleaned from experience. -- Colin Fee tfeccles@gmail.com
_______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
-- Tim Connors

On Fri, Jul 12, 2013 at 12:06:20PM +1000, Colin Fee wrote:
The box is based around a Gigabyte GA-880GM-USB3 mobo, with an AMD Phenom II X6 1055T cpu and buckets of RAM and the OS on a 240Gb SSD.
more than adequate. my main zfs box here is a Phenom II x6 1090T with 16GB RAM and a pair of 240GB SSDs for OS/L2ARC/ZIL. it's also my desktop machine, internet gateway, firewall, virtualisation server, and everything/anything I want to run or experiment with box. so a 1055T dedicated to just a ZFS fileserver will have no problem coping with the load.
So I'm looking for a strategy re the implementation of ZFS.
0. zfsonlinux is pretty easy to work with, easy to learn and to use. i'd recommend playing around with it to get a feel for how it works and to experiment with some features before putting it into 'production' use - otherwise you may find later on that there was a more optimal way of doing what you want. e.g. once you've got a pool set up, get into the habit of creating subvolumes on the pool for different purposes rather than just sub-directories. You can enable different quotas (soft quota or hard-limit) on each volume, and have different attributes (e.g. compression makes sense for text documents, but not for already-compressed video files). my main pool is mounted as /export (a nice, traditional *generic* name), and i have /export/home, /export/src, /export/ftp, /export/www, /export/tftp and several other subvols on it. as well as zvols for VMs. if performance is important to you then do some benchmarking on your own hardware with various configurations. Russell's bonnie++ is a great tool for this. 1. if your disks are new and have 4K sectors OR if you're likely to add 4K-sector drives in future, then create your pool with 'ashift=12' 4K sector drives, if not quite the current standard right now, are pretty close to current standard and will inevitably replace the old 512-byte sector standard in a very short time. (btw, ashift=12 works even with old 512-byte sector drives, because 4096 is a multiple of 512. there is an insignificantly tiny reduction in usable space when using ashift=12 on 512 sector drives). 2. The SSD can be partitioned so that some of it is for the OS (50-100GB should be plenty), some for a small ZIL write intent cache (4 or 8GB is heaps), and the remainder for L2ARC cache. 3. if you're intending to use some of that SSD for ZIL (write intent cache) then it's safer to have two ZIL partitions on two separate SSDs in a mirror configuration, so that if one SSD dies, you don't lose recent unflushed writes. this is one of those things that is low risk but high damage potential. in fact, an mdadm raid-1 for the OS, two non-raid L2ARC cache partitions, and mirrored ZIL is, IMO, an ideal configuration. if you can only fit one SSD in the machine, then obviously you can't do this. 4. if performance is far more important than size, then create your pool with two mirrored pairs (i.e. the equivalent of RAID-10) rather than RAID-Z1. This will give you the capacity of two drives, whereas RAID-Z1 would give you the capacity of 3 of your 4 drives. It also has the advantage of being relatively easier/cheaper to expand, just add another mirrored pair of drives to the pool. expanding a RAID-Z involves either adding another complete RAID-Z to the pool (i.e. 4 drives, so that you have a pool consisting of two RAID-Zs) or replacing each individual drive one after the other. e.g. 4 x 2TB drives would give 4TB total in RAID-10, compared to 6TB if you used RAID-Z1. I use RAID-Z1 on my home file server and find the performance to be acceptable. The only time I really find it slow is when a 'zpool scrub' is running (i have a weekly cron job on my box)...my 4TB zpools are now about 70% full, so it takes about 6-8 hours for scrub to run. It's only a problem if i'm using the machine late at night. I also use RAID-Z1 on my mythtv box. I've never had an issue with performance on that, although I do notice if I transcode/cut-ads more than 3 or 4 recordings at once (but that's a big IO load for a home server, read 2GB or more - much more for HD recordings - and write 1GB or so for each transcode, all simultaneous). It's mostly large sequential files, which is a best-case scenario for disk performance.
I can install up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
if you ever plan to add more drives AND have a case that can physically fit them, then I highly recommend LSI 8-port SAS PCI-e 8x controllers. They're dirt cheap (e.g. IBM M1015 can be bought new off ebay for under $100), high performance, and can take 8 SAS or SATA drives. They're much better and far cheaper than any available SATA expansion cards. craig -- craig sanders <cas@taz.net.au> BOFH excuse #188: ..disk or the processor is on fire.

On Fri, 12 Jul 2013, Craig Sanders wrote:
e.g. once you've got a pool set up, get into the habit of creating subvolumes on the pool for different purposes rather than just sub-directories. You can enable different quotas (soft quota or hard-limit) on each volume, and have different attributes (e.g. compression makes sense for text documents, but not for already-compressed video files).
Note that `mv <tank/subvol1>/blbah <tank/subvol2>/foo` to different subvols on the same device still copies-then-deletes the data as if its a separate filesystem (it is - think of it just like lvms) despite the fact that it's all just shared storage in the one pool. Slightly inconvenient and surprising, but makes sense in a way (even if zfs could be convinced to mark a set of blocks from a file as changing its owner from subvol1 to subvol2, how do you teach `mv` that this inter-mount move is special?
1. if your disks are new and have 4K sectors OR if you're likely to add 4K-sector drives in future, then create your pool with 'ashift=12'
Do it anyway, out of habit. You will one day have 4k sectors, and migrating is a bitch (when you've maxed out the number of drives in your 4 bay NAS).
3. if you're intending to use some of that SSD for ZIL (write intent cache) then it's safer to have two ZIL partitions on two separate SSDs in a mirror configuration, so that if one SSD dies, you don't lose recent unflushed writes. this is one of those things that is low risk but high damage potential.
Terribly terribly low risk. It has to die, undetected, at the same time that the machine suffers from a power failure (ok, so these tend to be correlated failures). (I say undetected, because zpool status will warn you when any of its devices are unhealthy - set up a cron.hourly to monitor zpool status). And if it does die, then COW will mean you'll get a consistent 5 second old copy of your data. Which for non-database, non mailserver workloads (and database workloads for those of us who pay zillions of dollars for oracle, and then don't use it to its full potential), doesn't actually matter (5 seconds old data vs non-transactional data that would be 5 seconds old if the power failed 5 seconds ago? Who cares?) For me, SLOG just means that hardlinks over NFS don't take quite so long. Sometimes I even see SLOG usage grow up to 10MB(!) for a few seconds at a time! -- Tim Connors

Tim Connors writes:
Note that `mv <tank/subvol1>/blbah <tank/subvol2>/foo` to different subvols on the same device still copies-then-deletes the data as if its a separate filesystem (it is - think of it just like lvms) despite the fact that it's all just shared storage in the one pool. Slightly inconvenient and surprising, but makes sense in a way (even if zfs could be convinced to mark a set of blocks from a file as changing its owner from subvol1 to subvol2, how do you teach `mv` that this inter-mount move is special?
GNU cp has --reflink which AIUI is extra magic for ZFS & btrfs users. It wouldn't really surprise me if there was extra magic for mv as well, though I doubt it would apply in this case (as you say, inter-device move). Under btrfs, if / is btrfs root and you make a subvol "opt", it'll be visible at /opt even if you forget to mount it. Maybe in that case, mv *is* that clever, if only by accident. Incidentally, I have this in my .bashrc: grep -q btrfs /proc/mounts && alias cp='cp --reflink=auto' (The grep is just because it's shorter than checking if cp is new enough to have --reflink.)

On Fri, 12 Jul 2013, Tim Connors <tconnors@rather.puzzling.org> wrote:
Note that `mv <tank/subvol1>/blbah <tank/subvol2>/foo` to different subvols on the same device still copies-then-deletes the data as if its a separate filesystem (it is - think of it just like lvms) despite the fact that it's all just shared storage in the one pool. Slightly inconvenient and surprising, but makes sense in a way (even if zfs could be convinced to mark a set of blocks from a file as changing its owner from subvol1 to subvol2, how do you teach `mv` that this inter-mount move is special?
cp has the --reflink option, mv could use the same mechanism if it was supported across subvols. I believe that there is work on supporting reflink across subvols in BTRFS and there's no technical reason why ZFS couldn't do it too. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sat, 13 Jul 2013 01:43:02 AM Russell Coker wrote:
cp has the --reflink option, mv could use the same mechanism if it was supported across subvols. I believe that there is work on supporting reflink across subvols in BTRFS and there's no technical reason why ZFS couldn't do it too.
Cross-subvol reflinks were merged for the 3.6 kernel. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=36... You can't reflink across mount points though, it was mentioned by Christoph Hellwig back in 2011 when he first NAK'd the cross-subvol patches http://www.spinics.net/lists/linux-btrfs/msg09229.html . cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

On Fri, 12 Jul 2013, Colin Fee <tfeccles@gmail.com> wrote:
So I'm looking for a strategy re the implementation of ZFS. I can install up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
Firstly plan what you are doing especially regarding boot. Do you want to have /boot be a RAID-1 across all 4 of the disks? Do you want it just on the SSD? My home server uses BTRFS (which is similar to ZFS in many ways) for mass storage and has a SSD for the root filesystem. That means that when some server task uses a lot of IO capacity on the mass storage it won't slow down local workstationa access. If you have a shared workstation/server plan then this might be worth considering. Don't bother with ZIL or L2ARC, most home use has no need for more performance than modern hard drives can provide and it's best to avoid the complexity. On Fri, 12 Jul 2013, Kevin <kevin@fuber.org> wrote:
You need 4GB min for a reasonable zfs server plus more if you use dedupe
I've had a server with 4G have repeated kernel OOM failures running ZFS even though dedupe wasn't enabled. I suggest that 8G is the bare minimum. On Fri, 12 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
0. zfsonlinux is pretty easy to work with, easy to learn and to use.
Actually it's a massive PITA compared to every filesystem that most Linux users have ever used. You need DKMS to build the kernel modules and then the way ZFS operates is very different from traditional Linux filesystems.
1. if your disks are new and have 4K sectors OR if you're likely to add 4K-sector drives in future, then create your pool with 'ashift=12'
4K sector drives, if not quite the current standard right now, are pretty close to current standard and will inevitably replace the old 512-byte sector standard in a very short time.
If you are using disks that don't use 4K sectors then they are probably small by modern standards in which case you don't have such a need for ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sat, Jul 13, 2013 at 01:43:40AM +1000, Russell Coker wrote:
On Fri, 12 Jul 2013, Colin Fee <tfeccles@gmail.com> wrote:
So I'm looking for a strategy re the implementation of ZFS. I can install up to 4 SATA disks onto the mobo (5 in total with one slot used by the SSD)
Firstly plan what you are doing especially regarding boot.
Do you want to have /boot be a RAID-1 across all 4 of the disks?
not a good idea with ZFS. Don't give partitions to ZFS, give it entire disks. from the zfsonlinux faq: http://zfsonlinux.org/faq.html#PerformanceConsideration "Create your pool using whole disks: When running zpool create use whole disk names. This will allow ZFS to automatically partition the disk to ensure correct alignment. It will also improve interoperability with other ZFS implementations which honor the wholedisk property."
Do you want it just on the SSD?
much better to do it this way, which is one of the reasons I suggested a second SSD to have mdadm raid-1 for / and /boot.
Don't bother with ZIL or L2ARC, most home use has no need for more performance than modern hard drives can provide and it's best to avoid the complexity.
if he's got an SSD then partitioning it to give some L2ARC and ZIL is easy enough, and both of them will provide noticable benefits even for home use. it takes a little bit of advance planning to set up (i.e. partitioning the SSD and then issuing 'zfs add' commands, but it's set-and-forget. doesn't add any ongoing maintainence complexity. e.g. here's the 'zfs history' from when I installed my old SSD in my myth box after upgrading the SSD on my main system. it had previously been running for almost a year without L2ARC or ZIL (but with 16GB RAM, just because I had some spare RAM lying around) 2013-04-09.09:59:01 zpool add export log scsi-SATA_Patriot_WildfirPT1131A00006353-part5 2013-04-09.09:59:07 zpool add export cache scsi-SATA_Patriot_WildfirPT1131A00006353-part6
On Fri, 12 Jul 2013, Kevin <kevin@fuber.org> wrote:
You need 4GB min for a reasonable zfs server plus more if you use dedupe
I've had a server with 4G have repeated kernel OOM failures running ZFS even though dedupe wasn't enabled. I suggest that 8G is the bare minimum.
i'd agree with that. I haven't had the OOM failures even on 4GB (possibly because i've always used ZFS with either 8+GB or with some SSD L2ARC) but ZFS does benefit from more RAM.
On Fri, 12 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
0. zfsonlinux is pretty easy to work with, easy to learn and to use.
Actually it's a massive PITA compared to every filesystem that most Linux users have ever used.
yeah, well, it's a bit more complicated than mkfs. but a *lot* less complicated than mdadm and lvm. and gaining the benefits of sub-volumes or logical volumes of any kind is going to add some management complexity whether you use btrfs(8), zfs(8), or (worse) lvcreate/lvextend/lvresize/lvwhatever.
You need DKMS to build the kernel modules and then the way ZFS operates is very different from traditional Linux filesystems.
dkms automates and simplifies the entire compilation process, so it just takes time. it's not complicated or difficult. or you could just use the zfsonlinux binary repo for the distribution of your choice and install a pre-compiled binary. for debian, instructions are here: http://zfsonlinux.org/debian.html craig -- craig sanders <cas@taz.net.au> BOFH excuse #54: Evil dogs hypnotised the night shift

On Sat, 13 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
Firstly plan what you are doing especially regarding boot.
Do you want to have /boot be a RAID-1 across all 4 of the disks?
not a good idea with ZFS. Don't give partitions to ZFS, give it entire disks.
from the zfsonlinux faq:
http://zfsonlinux.org/faq.html#PerformanceConsideration
"Create your pool using whole disks: When running zpool create use whole disk names. This will allow ZFS to automatically partition the disk to ensure correct alignment. It will also improve interoperability with other ZFS implementations which honor the wholedisk property."
Who's going to transfer a zpool of disks from a Linux box to a *BSD or Solaris system? Almost no-one. If Solaris is even a consideration then just use it right from the start, ZFS is going to work better on Solaris anyway. If you have other reasons for choosing the OS (such as Linux being better for pretty much everything other than ZFS) then you're probably not going to change.
Don't bother with ZIL or L2ARC, most home use has no need for more performance than modern hard drives can provide and it's best to avoid the complexity.
if he's got an SSD then partitioning it to give some L2ARC and ZIL is easy enough, and both of them will provide noticable benefits even for home use.
it takes a little bit of advance planning to set up (i.e. partitioning the SSD and then issuing 'zfs add' commands, but it's set-and-forget. doesn't add any ongoing maintainence complexity.
But it does involve more data transfer. Modern SSDs shouldn't wear out, but I'm not so keen on testing that theory. For a system with a single SSD you will probably have something important on it. Using it for nothing but ZIL/L2ARC might be a good option, but also using it for boot probably wouldn't.
On Fri, 12 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
0. zfsonlinux is pretty easy to work with, easy to learn and to use.
Actually it's a massive PITA compared to every filesystem that most Linux users have ever used.
yeah, well, it's a bit more complicated than mkfs. but a *lot* less complicated than mdadm and lvm.
I doubt that claim. It's very difficult to compare complexity, but the layered design of mdadm and lvm makes it easier to determine what's going on IMHO.
and gaining the benefits of sub-volumes or logical volumes of any kind is going to add some management complexity whether you use btrfs(8), zfs(8), or (worse) lvcreate/lvextend/lvresize/lvwhatever.
It's true that some degree of complexity is inherent in solving more complex problems. That doesn't change the fact that it's difficult to work with. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sat, Jul 13, 2013 at 02:42:22PM +1000, Russell Coker wrote:
On Sat, 13 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
from the zfsonlinux faq:
http://zfsonlinux.org/faq.html#PerformanceConsideration
"Create your pool using whole disks: When running zpool create use whole disk names. This will allow ZFS to automatically partition the disk to ensure correct alignment. It will also improve interoperability with other ZFS implementations which honor the wholedisk property."
Who's going to transfer a zpool of disks from a Linux box to a *BSD or Solaris system? Almost no-one.
read the first reason again, automatic alignment of the partitions increases performance - and it's the sort of thing that it's a PITA to do manually, calculating the correct starting locations for all partitions. it'll be a lot more significant to most linux users than the second reason...as you say, that's irrelevant to most people.
But it does involve more data transfer. Modern SSDs shouldn't wear out, but I'm not so keen on testing that theory. For a system with a single SSD you will probably have something important on it. Using it for nothing but ZIL/L2ARC might be a good option, but also using it for boot probably wouldn't.
if you're that paranoid about SSDs wearing out then just don't use them for anything except a read-cache. or better yet, not at all. it really isn't worth worrying about, though. current generation SSDs are at least as reliable as hard disks, and with longevity to match. they're not going to wear out any faster than a hard disk.
On Fri, 12 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
0. zfsonlinux is pretty easy to work with, easy to learn and to use.
Actually it's a massive PITA compared to every filesystem that most Linux users have ever used.
yeah, well, it's a bit more complicated than mkfs. but a *lot* less complicated than mdadm and lvm.
I doubt that claim. It's very difficult to compare complexity, but the layered design of mdadm and lvm makes it easier to determine what's going on IMHO.
mdadm alone is simple enough and well documented, with great built-in help - about on a par with zfs. they're both very discoverable, which is particularly important for tools that you don't use daily, you typically only use them when adding or replacing hardware or when something has gone wrong. the btrfs tools are similarly easy to learn. lvm is the odd one out - it's reasonably well documented, but the built in help is crap, and it it is not easy to discover how to use the various lv tools or, indeed, which of the tools is the appropriate one to use - "is it lvresize or lvextend to resize a partition, or both but in what order"? and, unlike lvm, with both btrfs and zfs, once you've created the pool you don't need to remember or specify the device names of the disks in the pool unless you're replacing one of them - most operations (like creating, deleting, changing attributes of a volume or zvol) are done with just the name of the pool and volume. this leads to short, simple, easily understood command lines. btrfs and zfs are at a higher abstraction level, allowing you to work with the big picture of volumes and pools rather than get bogged down in little details of disks and partitions. the 'zpool history' command is also invaluable - you can easily see *exactly* which commands you used to make changes to the pool or a file system again - built in time-stamped cheat-notes of everything you've done to the pool. this makes it trivially easy to remember how you did something six months ago.
and gaining the benefits of sub-volumes or logical volumes of any kind is going to add some management complexity whether you use btrfs(8), zfs(8), or (worse) lvcreate/lvextend/lvresize/lvwhatever.
It's true that some degree of complexity is inherent in solving more complex problems. That doesn't change the fact that it's difficult to work with.
if you mean 'zfs is difficult to work with', then you must be using some definition of 'difficult' of which i was previously unaware. craig -- craig sanders <cas@taz.net.au> BOFH excuse #347: The rubber band broke

On 14/07/13 01:53, Craig Sanders wrote:
and, unlike lvm, with both btrfs and zfs, once you've created the pool you don't need to remember or specify the device names of the disks in the pool unless you're replacing one of them
Can you clarify what you mean by this? My volume group is /dev/vg_glimfeather. I haven’t interacted with the raw PV, /dev/sda6, in over a year. All modern distros autodetect all VGs upon boot.

On Sun, Jul 14, 2013 at 12:15:59PM +1000, Jeremy Visser wrote:
On 14/07/13 01:53, Craig Sanders wrote:
and, unlike lvm, with both btrfs and zfs, once you've created the pool you don't need to remember or specify the device names of the disks in the pool unless you're replacing one of them
Can you clarify what you mean by this?
My volume group is /dev/vg_glimfeather. I haven?t interacted with the raw PV, /dev/sda6, in over a year.
here's an example from the lvextend man page: Extends the size of the logical volume "vg01/lvol10" by 54MiB on physi- cal volume /dev/sdk3. This is only possible if /dev/sdk3 is a member of volume group vg01 and there are enough free physical extents in it: lvextend -L +54 /dev/vg01/lvol10 /dev/sdk3 and another from lvresize: Extend a logical volume vg1/lv1 by 16MB using physical extents /dev/sda:0-1 and /dev/sdb:0-1 for allocation of extents: lvresize -L+16M vg1/lv1 /dev/sda:0-1 /dev/sdb:0-1 to extend or resize the volume, you've got to tell it which disk(s) to allocate the free space from. with zfs, you'd just do 'zfs set quota=nnnn pool/volume', and not have to give a damn about which particular disks in the pool the extra space came from. actually, 'set reservation' would be closer in meaning to the lvm commands - 'quota' is is from shared space used by all volumes on that pool so overcommit is possible (even default). e.g. on a 1TB pool, you can create 5 volumes each with quotas of 1TB. 'reservation' actually reserves that amount of space for that volume from the pool. no other volume gets to use that reserved space.
All modern distros autodetect all VGs upon boot.
craig -- craig sanders <cas@taz.net.au> BOFH excuse #57: Groundskeepers stole the root password

On Sun, Jul 14, 2013 at 4:54 PM, Craig Sanders <cas@taz.net.au> wrote:
On Sun, Jul 14, 2013 at 12:15:59PM +1000, Jeremy Visser wrote:> My volume group is /dev/vg_glimfeather. I haven?t interacted with the
raw PV, /dev/sda6, in over a year.
here's an example from the lvextend man page:
Extends the size of the logical volume "vg01/lvol10" by 54MiB on physi- cal volume /dev/sdk3. This is only possible if /dev/sdk3 is a member of volume group vg01 and there are enough free physical extents in it:
lvextend -L +54 /dev/vg01/lvol10 /dev/sdk3
If you look at the synopsis for the man page, you'll realise that the physical device is completely optional. It's for when you might have multiple PVs and wish to force this extend to happen on one in particular. For most use-cases you don't need to specify it at all.
and another from lvresize:
Extend a logical volume vg1/lv1 by 16MB using physical extents /dev/sda:0-1 and /dev/sdb:0-1 for allocation of extents:
lvresize -L+16M vg1/lv1 /dev/sda:0-1 /dev/sdb:0-1
to extend or resize the volume, you've got to tell it which disk(s) to allocate the free space from.
Nope. You can, but you don't have to. / Brett

On Sun, Jul 14, 2013 at 05:25:29PM +1000, Brett Pemberton wrote:
to extend or resize the volume, you've got to tell it which disk(s) to allocate the free space from.
Nope. You can, but you don't have to.
oh well, consider that an example of lvm's difficulty to learn and remember. I've been using LVM for a lot longer than ZFS but I still find it difficult and complicated to work with, and the tools are clumsy and awkward. nowadays, i see them as first generation pool & volume management tools, functional but inelegant. later generation tools like btrfs and ZFS not only add more features (like compression, de-duping, etc etc) by blurring the boundaries between block devices and file systems but more importantly they add simplicity and elegance that is missing from the first generation. and as for lvextend, lvreduce, and lvresize - it's completely insane for there to be three tools that do basically the same job. sure, i get that there are historical and compatibility reasons to keep lvextend and lvreduce around, but they should clearly be marked in the man page as "deprecated, use lvresize instead" (or even just "same as lvresize") because for a newcomer to LVM or even a long term but infrequent user (as is normal, lvm isn't exactly daily-usage) they just add confusion - i have to remind myself every time I use them that extend doesn't mean anything special in LVM, it's not something extra i have to do before or after a resize, it's just a resize. craig -- craig sanders <cas@taz.net.au> BOFH excuse #144: Too few computrons available.

On Mon, 15 Jul 2013 14:30:43 +1000 Craig Sanders <cas@taz.net.au> wrote:
oh well, consider that an example of lvm's difficulty to learn and remember. I've been using LVM for a lot longer than ZFS but I still find it difficult and complicated to work with, and the tools are clumsy and awkward.
They are designed for first-generation filesystems where fsck is a standard utility that's regularly used as opposed to more modern filesystems where there is no regularly used fsck because they are just expected to work all the time. The ratio of block device size to RAM for affordable desktop systems is now around 500:1 (4T:8G) where it was about 35:1 in the mid 90's (I'm thinking 70M disk and 2M of RAM). The time taken to linearly read an entire block device is now around 10 hours when it used to be about 2 minutes in 1990 (I'm thinking of a 1:1 interleave 70MB MFM disk doing 500KB/s). Those sort of numbers change everything about filesystems, and they aren't big filesystems. The people who designed ZFS were thinking of arrays of dozens of disks as a small filesystem. Ext4 and LVM still have their place. I'm a little nervous about the extra complexity of ZFS and BTRFS when combined with a system that lacks ECC RAM. I've already had two BTRFS problems that required backup/format/restore due to memory errors.
nowadays, i see them as first generation pool & volume management tools, functional but inelegant. later generation tools like btrfs and ZFS not only add more features (like compression, de-duping, etc etc) by blurring the boundaries between block devices and file systems but more importantly they add simplicity and elegance that is missing from the first generation.
True, but they also add complexity. Did you ever run debugfs on an ext* filesystem? It's not that hard to do. I really don't want to imagine running something similar on BTRFS or ZFS.
and as for lvextend, lvreduce, and lvresize - it's completely insane for there to be three tools that do basically the same job.
It's pretty much standard in Unix to have multiple ways of doing it. gzip -d / zcat cp / tar / cat / mv all have overlapping functionality tar / cpio
sure, i get that there are historical and compatibility reasons to keep lvextend and lvreduce around, but they should clearly be marked in the man page as "deprecated, use lvresize instead" (or even just "same as lvresize") because for a newcomer to LVM or even a long term
So it's a man page issue. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Mon, Jul 15, 2013 at 03:25:29PM +1000, Russell Coker wrote:
They are designed for first-generation filesystems where fsck is a standard utility that's regularly used as opposed to more modern filesystems where there is no regularly used fsck because they are just expected to work all the time.
and as you point out, disks are getting so large that fsck is just impractical. fsck on boot even more so - 10 hours downtime just to reboot? (i particularly hate ext2/3/4's default mount-count and interval settings for auto-fsck on boot...i don't need an fsck just because this is the first time i've rebooted the server in 6 months)
Ext4 and LVM still have their place.
ext4, yes - e.g. on laptops and in VM images. not so sure about lvm, there isn't anything it does that zfs and btrfs don't do better.
I'm a little nervous about the extra complexity of ZFS and BTRFS when combined with a system that lacks ECC RAM. I've already had two BTRFS problems that required backup/format/restore due to memory errors.
i'm far more nervous about undetected errors in current size hard disks using traditional filesystems like ext4 and xfs. RAM is cheap, and the price difference for ECC is not prohibitive. e.g. without spending too much time shopping around for best prices, I found Kingston 8GB DDR3-133 non-ECC being sold pretty much everywhere for around $75. ECC is a bit harder to find but i spotted that megabuy is selling Kingston 8GB DDR3-1333 ECC RAM for $97. http://www.megabuy.com.au/kingston-kvr13e98i-8gb-1333mhz-ddr3-ecc-cl9-dimm-i... they also have DDR3-1600 ECC for the same price. that's about 30% more for ECC, but still adds up to a difference of less than $50 for a 16GB RAM system. oddly, mwave is currently selling 4GB ECC Kingston RAM for $22.29 http://www.mwave.com.au/product/sku-aa64505-kingston_4gb_memory_ddr31333mhz_... that price is so low I suspect a sale to get rid of an overstocked item, or maybe an error. BTW, all AMD CPU motherboards (even on home/hobbyist m/bs like from asus, asrock, gigabyte etc) support ECC and have done for several years. Some Intel home/hobbyist motherboards do too (until recently they disabled ECC on non-server CPUs as part of their usual artificial market segmentation practices). All server boards support ECC, of course.
nowadays, i see them as first generation pool & volume management [...]
True, but they also add complexity. Did you ever run debugfs on an ext* filesystem? It's not that hard to do. I really don't want to imagine running something similar on BTRFS or ZFS.
zfs should prevent (detect and correct) the kind of small errors that are fixable manually with debugfs, and for anything more serious restoring from backup is the only practical solution.
and as for lvextend, lvreduce, and lvresize - it's completely insane for there to be three tools that do basically the same job.
It's pretty much standard in Unix to have multiple ways of doing it.
yeah, from multiple alternative implementations, or by using variations of multiple small tools together in a pipeline. not generally in the same software package. when it is within the one package, it's generally for legacy reasons (i.e. backwards-compatibility, so as to not break scripts that were written to use the old tool) and clearly documented as such.
So it's a man page issue.
documentation in general but yes, that's what i said. craig -- craig sanders <cas@taz.net.au> BOFH excuse #448: vi needs to be upgraded to vii

A most enlightening thread so far. I've begun playing with ZFS within a VM to get used to the various commands and their outcomes etc.
From what I've been reading elsewhere what I like about ZFS is the reduction (or is simplification more appropriate) of something like this (hoping my ASCII art holds up):
[/] [/usr] [/var] [/home] | | | | +---------------------------+ | volume group "vg" | +---------------------------+ | | +-----------+ +-----------+ |PV /dev/foo| |PV /dev/bar| +-----------+ +-----------+ | | | | sda sdb sdc sdd ....to something like this: [zfs "tank/"] [zfs "tank/usr"] [zfs "tank/home"] | | | +-------------------------------------------------+ | stroage pool "tank" | +-------------------------------------------------+ | | | | sda sdb sdc sdd I also like how block-device management knows about the file system or vice-versa and can therefore cope with errors. I've experienced the disconnection between fs, mdadm and the disk not long ago when a corruption occurred in an episode of the West Wing that we didn't discover until we re-watched it. The corruption was even passed onto the back-up copies. Recovery was as simple as re-capturing from source but a pain none the less. One thing I read yesterday was that zfs can be prone to issues from fragmentation - is there a preventive strategy or remedial measure to take into account? On 15 July 2013 14:30, Craig Sanders <cas@taz.net.au> wrote:
On Sun, Jul 14, 2013 at 05:25:29PM +1000, Brett Pemberton wrote:
to extend or resize the volume, you've got to tell it which disk(s) to allocate the free space from.
Nope. You can, but you don't have to.
oh well, consider that an example of lvm's difficulty to learn and remember. I've been using LVM for a lot longer than ZFS but I still find it difficult and complicated to work with, and the tools are clumsy and awkward.
nowadays, i see them as first generation pool & volume management tools, functional but inelegant. later generation tools like btrfs and ZFS not only add more features (like compression, de-duping, etc etc) by blurring the boundaries between block devices and file systems but more importantly they add simplicity and elegance that is missing from the first generation.
and as for lvextend, lvreduce, and lvresize - it's completely insane for there to be three tools that do basically the same job. sure, i get that there are historical and compatibility reasons to keep lvextend and lvreduce around, but they should clearly be marked in the man page as "deprecated, use lvresize instead" (or even just "same as lvresize") because for a newcomer to LVM or even a long term but infrequent user (as is normal, lvm isn't exactly daily-usage) they just add confusion - i have to remind myself every time I use them that extend doesn't mean anything special in LVM, it's not something extra i have to do before or after a resize, it's just a resize.
craig
-- craig sanders <cas@taz.net.au>
BOFH excuse #144:
Too few computrons available. _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
-- Colin Fee tfeccles@gmail.com

On Mon, 15 Jul 2013, Colin Fee wrote:
A most enlightening thread so far. I've begun playing with ZFS within a VM to get used to the various commands and their outcomes etc.
From what I've been reading elsewhere what I like about ZFS is the reduction (or is simplification more appropriate) of something like this (hoping my ASCII art holds up):
[/] [/usr] [/var] [/home] | | | | +---------------------------+ | volume group "vg" | +---------------------------+ | | +-----------+ +-----------+ |PV /dev/foo| |PV /dev/bar| +-----------+ +-----------+ | | | | sda sdb sdc sdd
....to something like this:
[zfs "tank/"] [zfs "tank/usr"] [zfs "tank/home"] | | | +-------------------------------------------------+ | stroage pool "tank" | +-------------------------------------------------+ | | | | sda sdb sdc sdd
You can do that with lvm, by the way. Set up arbitrary redundancy of desired logical volumes too.
One thing I read yesterday was that zfs can be prone to issues from fragmentation - is there a preventive strategy or remedial measure to take into account?
Don't ever use more than 80% of your file system? Yeah, I know that's not a very acceptable alternative. It's funny, with much much reduced memory usage, btrfs has the default bahaviour of caching its free block list (possibly only when running low on space?). ZFS doesn't, so when it start getting low, it has to do a linear search of where to place the next copy-on-write block, so absolutely every single write (including rewrites, this being the brave new world of COW) becomes a painful year long process. -- Tim Connors

On Mon, Jul 15, 2013 at 03:38:07PM +1000, Tim Connors wrote:
One thing I read yesterday was that zfs can be prone to issues from fragmentation - is there a preventive strategy or remedial measure to take into account?
Don't ever use more than 80% of your file system?
btrfs has a 'balance' command to rebalance storage of files across the disks (e.g. after you've added or replaced disks in the pool. That has the side-effect of de-fragging files. of course, on a large pool, it takes ages to run. zfs doesn't have this.
Yeah, I know that's not a very acceptable alternative.
it's not that bad. you're supposed to keep an eye on your systems and add more disks to the pool (or replace current disks with larger ones) as it starts getting full. my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably should start thinking about replacing the drives with 4x2TB soon...or deleting some crap. craig -- craig sanders <cas@taz.net.au> BOFH excuse #108: The air conditioning water supply pipe ruptured over the machine room

On Tue, Jul 16, 2013 at 12:28 PM, Craig Sanders <cas@taz.net.au> wrote:
On Mon, Jul 15, 2013 at 03:38:07PM +1000, Tim Connors wrote:
One thing I read yesterday was that zfs can be prone to issues from fragmentation - is there a preventive strategy or remedial measure to take into account?
Don't ever use more than 80% of your file system?
btrfs has a 'balance' command to rebalance storage of files across the disks (e.g. after you've added or replaced disks in the pool. That has the side-effect of de-fragging files.
of course, on a large pool, it takes ages to run.
zfs doesn't have this.
balance in btrfs is also used to change the pool replication (single to raid1 (1 replica) to raid 10(1 replica with striping)) and i would imagine for moving in and out of parity based replication systems. (erasure codes)

Craig Sanders <cas@taz.net.au> wrote:
btrfs has a 'balance' command to rebalance storage of files across the disks (e.g. after you've added or replaced disks in the pool. That has the side-effect of de-fragging files.
of course, on a large pool, it takes ages to run.
Btrfs also has online de-fragmentation, performed while the file system is otherwise idle. In addition, there's a defragment command that the administrator can invoke on files and directories.

On Wed, 17 Jul 2013 06:03:13 PM Jason White wrote:
Btrfs also has online de-fragmentation, performed while the file system is otherwise idle. In addition, there's a defragment command that the administrator can invoke on files and directories.
But be aware that snapshot aware defrag is only in kernel v3.9 and later, in other words earlier kernels will (if you defrag) possibly end up duplicating blocks as part of the defrag. See commit 38c227d87c49ad5d173cb5d4374d49acec6a495d for more. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

On Tue, 16 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably should start thinking about replacing the drives with 4x2TB soon...or deleting some crap.
2*4TB would give you twice the storage you've currently got with less noise and power use. 3*4TB in a RAID-Z1 will give you 3* the storage you've currently got. How do you replace a zpool? I've got a system with 4 small disks in a zpool that I plan to replace with 2 large disks. I'd like to create a new zpool with the 2 large disks, do an online migration of all the data, and then remove the 4 old disks. Also are there any issues with device renaming? For example if I have sdc, sdd, sde, sdf used for a RAID-Z1 and I migrate the data to a RAID-1 on sdg and sdh, when I reboot the RAID-1 will be sdc and sdd, will that cause problems? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Happy SysAdmin Day http://sysadminday.com/ -- Tom Robinson 19 Thomas Road Mobile: +61 4 3268 7026 Healesville, VIC 3777 Home: +61 3 5962 4543 Australia GPG Key: 8A4CB7A7 CONFIDENTIALITY: Copyright (C). This message with any appended or attached material is intended for addressees only and may not be copied or forwarded to or used by other parties without permission.

On Thu, Jul 25, 2013 at 07:13:25PM +1000, Russell Coker wrote:
On Tue, 16 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably should start thinking about replacing the drives with 4x2TB soon...or deleting some crap.
2*4TB would give you twice the storage you've currently got with less noise and power use.
and faster than raidz or raid5 too, but it would only give me a total of 4TB in a mirrored/raid-1 pair. seagate ST4000DM000 drives seem to be the cheapest the moment at $195...so a total of $390. noise isn't an issue (the system fans are louder than the drives, and they're low noise fans), and power isn't much of a concern (the drives use 4.5W each idle, 6W in use) i've been reading up on drive reviews recently - advance planning for the upgrade - and while the ST4000DM000 has got good reviews, the WD RED drives seem better. 4TB RED drives aren't available yet, and 3TB WD30EFRX drives cost $158 each.
3*4TB in a RAID-Z1 will give you 3* the storage you've currently got.
my current plan is to replace my 4x1TB raidz backup pool with one or two mirrored pairs of either 3TB or 4TB drives. i'll wait a while and hope the 4TB WD REDs are reasonably priced. if they're cheap enough i'll buy four. or i'll get seagates. otherwise i'll get either a 2 x 4TB drives or 4 x 3TB. most likely the former as I can always add another pair later (which is one of my reasons for switching from raidz to mirrored pairs - cheaper upgrades. i've always preferred raid1 or mirrors anyway). i'm in no hurry, i can wait to see what happens with pricing. and i've read rumours that WD are likely to be releasing 5TB drives around the same time as their 4TB drives too, which should make things "interesting". my current 1TB drives can also be recycled into two more 2x1TB pairs to put some or all of them back into the backup pool - giving e.g. a total of 2x4TB + 2x1TB + 2x1TB or 6TB usable space. Fortunately, I have enough SAS/SATA ports and hot-swap drive bays to do that. I'll probably just use two of them with the idea of replacing them with 4TB or larger drives in a year or three. i don't really need more space in my main pool. growth in usage is slow and I can easily delete a lot of stuff (i'd get back two hundred GB or more if i moved my ripped CD and DVD collection to the zpool on my myth box which is only about 60% full - and i could easily bring that back to 30 or 40% by deleting old recordings i've already watched)
How do you replace a zpool? I've got a system with 4 small disks in a zpool that I plan to replace with 2 large disks. I'd like to create a new zpool with the 2 large disks, do an online migration of all the data, and then remove the 4 old disks.
there are two ways. 1. replace each drive individually with 'zpool replace olddrive newdrive' commands. this takes quite a while (depending on the amount of data that needs to be resilvered). when every drive in a vdev (a single drive, a mirrored pair, a raidz) in the pool is replaced with a larger one, the extra space is immediately available. for a 4-drive raidz it would take a very long time to replace each one individually. IMO this is only worthwhile for replacing a failed or failing drive. 2. create a new pool, 'zfs send' the old pool to the new one and destroy the old pool. this is much faster, but you need enough sata/sas ports to have both pools active at the same time. it also has the advantage of rewriting the data according to the attributes of the new pool (e.g. if you had gzip compression on the old pool and lz4[1] compression on the new your data will be recompressed with lz4 - highly recommended BTW) if you're converting from a 4-drive raidz to a 2-drive mirrored pair, then this is the only way to do it. [1] http://wiki.illumos.org/display/illumos/LZ4+Compression
Also are there any issues with device renaming? For example if I have sdc, sdd, sde, sdf used for a RAID-Z1 and I migrate the data to a RAID-1 on sdg and sdh, when I reboot the RAID-1 will be sdc and sdd, will that cause problems?
http://zfsonlinux.org/faq.html#WhatDevNamesShouldIUseWhenCreatingMyPool using sda/sdb/etc has the usual problems of drives being renamed if the kernel detects them in a different order (e.g. new kernel version, different load order for driver modules, variations in drive spin up time, etc) and is not recommended except for testing/experimental pools. it's best to use the /dev/disk/by-id device names. they're based on the model and serial number of the drive so are guaranteed unique and will never change. e.g. my backup pool currently looks like this (i haven't bothered running zpool upgrade on it yet to take advantage of lz4 and other improvements) pool: backup state: ONLINE status: The pool is formatted using a legacy on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on software that does not support feature flags. scan: scrub repaired 160K in 4h21m with 0 errors on Sat Jul 20 06:03:58 2013 config: NAME STATE READ WRITE CKSUM backup ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000528AS_6VP3FWAG ONLINE 0 0 0 ata-ST31000528AS_9VP4RPXK ONLINE 0 0 0 ata-ST31000528AS_9VP509T5 ONLINE 0 0 0 ata-ST31000528AS_9VP4P4LN ONLINE 0 0 0 errors: No known data errors craig -- craig sanders <cas@taz.net.au> BOFH excuse #193: Did you pay the new Support Fee?

On Fri, Jul 26, 2013 at 11:37 AM, Craig Sanders <cas@taz.net.au> wrote:
On Thu, Jul 25, 2013 at 07:13:25PM +1000, Russell Coker wrote:
On Tue, 16 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably should start thinking about replacing the drives with 4x2TB soon...or deleting some crap.
2*4TB would give you twice the storage you've currently got with less noise and power use.
and faster than raidz or raid5 too, but it would only give me a total of 4TB in a mirrored/raid-1 pair. seagate ST4000DM000 drives seem to be the cheapest the moment at $195...so a total of $390.
noise isn't an issue (the system fans are louder than the drives, and they're low noise fans), and power isn't much of a concern (the drives use 4.5W each idle, 6W in use)
i've been reading up on drive reviews recently - advance planning for the upgrade - and while the ST4000DM000 has got good reviews, the WD RED drives seem better. 4TB RED drives aren't available yet, and 3TB WD30EFRX drives cost $158 each.
I would not give the red drives much consideration as they are mostly a marketing change over the blacks. Based on reports from companies like backblaze I would consider hitachi deskstars if the budget permitted else go with seagate DM series Anything advertised for enterprise purposes is really pointless. WD blacks or reds are also appropriate but avoid the greens (stupid idle timeout.)
How do you replace a zpool? I've got a system with 4 small disks in a zpool that I plan to replace with 2 large disks. I'd like to create a new zpool with the 2 large disks, do an online migration of all the data, and then remove the 4 old disks.
*snipping text*
if you're converting from a 4-drive raidz to a 2-drive mirrored pair, then this is the only way to do it.
the static vdev sizing is truly one of the few annoying things about zfs

On 26/07/13 11:53, Kevin wrote:
On Fri, Jul 26, 2013 at 11:37 AM, Craig Sanders <cas@taz.net.au <mailto:cas@taz.net.au>> wrote:
On Thu, Jul 25, 2013 at 07:13:25PM +1000, Russell Coker wrote:
On Tue, 16 Jul 2013, Craig Sanders <cas@taz.net.au <mailto:cas@taz.net.au>> wrote:
my main zpools (4x1TB in RAIDZ1) are about 70% full. I probably should start thinking about replacing the drives with 4x2TB soon...or deleting some crap.
2*4TB would give you twice the storage you've currently got with less noise and power use.
and faster than raidz or raid5 too, but it would only give me a total of 4TB in a mirrored/raid-1 pair. seagate ST4000DM000 drives seem to be the cheapest the moment at $195...so a total of $390.
noise isn't an issue (the system fans are louder than the drives, and they're low noise fans), and power isn't much of a concern (the drives use 4.5W each idle, 6W in use)
i've been reading up on drive reviews recently - advance planning for the upgrade - and while the ST4000DM000 has got good reviews, the WD RED drives seem better. 4TB RED drives aren't available yet, and 3TB WD30EFRX drives cost $158 each.
I would not give the red drives much consideration as they are mostly a marketing change over the blacks. Based on reports from companies like backblaze I would consider hitachi deskstars if the budget permitted else go with seagate DM series Anything advertised for enterprise purposes is really pointless.
WD blacks or reds are also appropriate but avoid the greens (stupid idle timeout.)
Have I done something wrong? I've been running the WD Greens for nearly two years without any issues (raidz mirror). There's a flash tool (wdidle3) that allows you to change the LCC timeout to a maximum of five minutes or disable it completely. Here's at least one reference http://alfredoblogspage.blogspot.com.au/2013/01/western-digital-wd20earx-loa...
How do you replace a zpool? I've got a system with 4 small disks in a zpool that I plan to replace with 2 large disks. I'd like to create a new zpool with the 2 large disks, do an online migration of all the data, and then remove the 4 old disks.
*snipping text*
if you're converting from a 4-drive raidz to a 2-drive mirrored pair, then this is the only way to do it.
the static vdev sizing is truly one of the few annoying things about zfs
_______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
-- Tom Robinson 19 Thomas Road Mobile: +61 4 3268 7026 Healesville, VIC 3777 Home: +61 3 5962 4543 Australia GPG Key: 8A4CB7A7 CONFIDENTIALITY: Copyright (C). This message with any appended or attached material is intended for addressees only and may not be copied or forwarded to or used by other parties without permission.

On Fri, Jul 26, 2013 at 12:20:57PM +1000, Tom Robinson wrote:
Have I done something wrong? I've been running the WD Greens for nearly two years without any issues (raidz mirror).
if they're working without error for you, i wouldn't worry about it.
There's a flash tool (wdidle3) that allows you to change the LCC timeout to a maximum of five minutes or disable it completely.
i didn't reflash my WD Greens, I reflashed my SAS cards to "IT" mode instead...haven't had a problem with them since then.
Here's at least one reference
http://alfredoblogspage.blogspot.com.au/2013/01/western-digital-wd20earx-loa...
too bad that's on blogspot.com.au i really hate blogspot pages, crappy (and really bloody irritating) javascript and they don't work at all with NoScript. I don't even bother clicking on a link to an interesting-sounding article if i notice the URL is blogspot. craig -- craig sanders <cas@taz.net.au> BOFH excuse #173: Recursive traversal of loopback mount points

On Fri, Jul 26, 2013 at 11:53:10AM +1000, Kevin wrote:
I would not give the red drives much consideration as they are mostly a marketing change over the blacks.
given that 3TB WD Black drives are about $70 more then 3TB WD RED, that would make the RED drives a bargain.
Based on reports from companies like backblaze I would consider hitachi deskstars if the budget permitted else go with seagate DM series
hitachi 4TB Deskstar 0S03358 is about $270 at the moment, much more expensive than the seagate ST4000DM000 at $195.
Anything advertised for enterprise purposes is really pointless.
true.
WD blacks or reds are also appropriate but avoid the greens (stupid idle timeout.)
i use WD Greens in my myth box. my main /export pool is made up of a mixture of WD Green 1TB and seagate ST31000528AS. i had to reflash my LSI-based SAS card to "IT" mode to avoid idle timeout problems, but since I did that they've been working flawlessly.
the static vdev sizing is truly one of the few annoying things about zfs
yep. it would be nice to do stuff like just add a fifth drive to a 4-drive raidz and have it rebalance all the data. but you can't do that (and zfs doesn't do rebalancing like btrfs either). craig -- craig sanders <cas@taz.net.au> BOFH excuse #111: The salesman drove over the CPU board.

On Fri, 26 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
noise isn't an issue (the system fans are louder than the drives, and they're low noise fans), and power isn't much of a concern (the drives use 4.5W each idle, 6W in use)
With noise it's not an issue of the loudest thing is the one you hear. The noise from different sources adds and you can have harmonics too. As an aside, I've got a 6yo Dell PowerEdge T105 that's CPU fan is becoming noisy. Any suggestions on how to find a replacement?
How do you replace a zpool? I've got a system with 4 small disks in a zpool that I plan to replace with 2 large disks. I'd like to create a new zpool with the 2 large disks, do an online migration of all the data, and then remove the 4 old disks.
there are two ways.
1. replace each drive individually with 'zpool replace olddrive newdrive' commands. this takes quite a while (depending on the amount of data that needs to be resilvered). when every drive in a vdev (a single drive, a mirrored pair, a raidz) in the pool is replaced with a larger one, the extra space is immediately available. for a 4-drive raidz it would take a very long time to replace each one individually.
IMO this is only worthwhile for replacing a failed or failing drive.
2. create a new pool, 'zfs send' the old pool to the new one and destroy the old pool. this is much faster, but you need enough sata/sas ports to have both pools active at the same time. it also has the advantage of rewriting the data according to the attributes of the new pool (e.g. if you had gzip compression on the old pool and lz4[1] compression on the new your data will be recompressed with lz4 - highly recommended BTW)
if you're converting from a 4-drive raidz to a 2-drive mirrored pair, then this is the only way to do it.
According to the documentation you can add a new vdev to an existing pool. So if you have a pool named "tank" you can do the following to add a new RAID-1: zpool add tank mirror /dev/sde2 /dev/sdf2 Then if you could remove a vdev you could easily migrate a pool. However it doesn't appear possible to remove a vdev, you can remove a device that contains mirrored data but nothing else. Using a pair of mirror vdevs would allow you to easily upgrade the filesystem while only replacing two disks. But the down-side to that is that if two disks fail that could still lose most of your data while a RAID-Z2 over 4 disks would give the same capacity as 2*RAID-1 but cover you in the case of two failures.
Also are there any issues with device renaming? For example if I have sdc, sdd, sde, sdf used for a RAID-Z1 and I migrate the data to a RAID-1 on sdg and sdh, when I reboot the RAID-1 will be sdc and sdd, will that cause problems?
http://zfsonlinux.org/faq.html#WhatDevNamesShouldIUseWhenCreatingMyPool
using sda/sdb/etc has the usual problems of drives being renamed if the kernel detects them in a different order (e.g. new kernel version, different load order for driver modules, variations in drive spin up time, etc) and is not recommended except for testing/experimental pools.
it's best to use the /dev/disk/by-id device names. they're based on the model and serial number of the drive so are guaranteed unique and will never change.
zpool replace tank sdd /dev/disk/by-id/ata-ST4000DM000-1F2168_Z300MHWF-part2 That doesn't seem to work. I used the above command to replace a disk and then after that process was complete I rebooted the system and saw the following: # zpool status pool: tank state: ONLINE scan: scrub repaired 0 in 11h38m with 0 errors on Thu Aug 1 11:38:57 2013 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd2 ONLINE 0 0 0 It appears that ZFS is scanning /dev for the first device node with a suitable UUID.
e.g. my backup pool currently looks like this (i haven't bothered running zpool upgrade on it yet to take advantage of lz4 and other improvements)
pool: backup state: ONLINE status: The pool is formatted using a legacy on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on software that does not support feature flags. scan: scrub repaired 160K in 4h21m with 0 errors on Sat Jul 20 06:03:58 2013 config:
NAME STATE READ WRITE CKSUM backup ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000528AS_6VP3FWAG ONLINE 0 0 0 ata-ST31000528AS_9VP4RPXK ONLINE 0 0 0 ata-ST31000528AS_9VP509T5 ONLINE 0 0 0 ata-ST31000528AS_9VP4P4LN ONLINE 0 0 0
errors: No known data errors
Strangely that's not the way it works for me. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sat, Aug 10, 2013 at 08:11:32PM +1000, Russell Coker wrote:
As an aside, I've got a 6yo Dell PowerEdge T105 that's CPU fan is becoming noisy. Any suggestions on how to find a replacement?
sorry, no idea.
How do you replace a zpool? [...] zpool that I plan to replace with 2 large disks. I'd like to create a new zpool with the 2 large disks, do an online migration of all the data, and then remove the 4 old disks.
there are two ways.
1. replace each drive individually [...] 2. create a new pool, 'zfs send' the old pool to the new one [...]
According to the documentation you can add a new vdev to an existing pool.
yep, but as you note, you can't remove a vdev from a pool...so it's not much use for replacing a zpool.
Using a pair of mirror vdevs would allow you to easily upgrade the filesystem while only replacing two disks. But the down-side to that is that if two disks fail that could still lose most of your data while a RAID-Z2 over 4 disks would give the same capacity as 2*RAID-1 but cover you in the case of two failures.
yep, but you've got the performance of raidz2 rather than mirrored pairs (which, overall, is not as bad as raid5/raid6 but still isn't great). that may be exactly what you want, but you need to know the tradeoffs for what you're choosing.
zpool replace tank sdd /dev/disk/by-id/ata-ST4000DM000-1F2168_Z300MHWF-part2
That doesn't seem to work. I used the above command to replace a disk and then after that process was complete I rebooted the system and saw the following:
firstly, you don't need to type the full /dev/disk/by-id/ path. just ata-ST4000DM000-1F2168_Z300MHWF-part2 would do. typing the full path isn't wrong - works just as well, it just takes longer to type and uglifies the output of zpool status and iostat. secondly, why add a partition and not a whole disk? third, is sdd2 the same drive/partition as ata-ST4000DM000-1F2168_Z300MHWF-part2? if so, then it added the correct drive. to get the naming "correct"/"consistent", did you try 'zpool export tank' and then 'zpool import -d /dev/disk/by-id' ?
It appears that ZFS is scanning /dev for the first device node with a suitable UUID.
sort of. it remembers (in /etc/zfs/zpool.cache) what drives were in each pool and only scans if you tell it to with zpool import. if zpool.cache lists sda, sdb, sdc etc then that's how they'll appear in 'zpool status'. and if zpool.cache lists /dev/disk/by-id names then that's what it'll show. try hexdumping zpool.cache to see what i mean.
Strangely that's not the way it works for me.
i created mine with the ata-ST31999528AS_* names, so that's what's in my zpool.cache file. to fix on your system, export and import as mentioned above. craig -- craig sanders <cas@taz.net.au> BOFH excuse #108: The air conditioning water supply pipe ruptured over the machine room

On Mon, 15 Jul 2013, Craig Sanders wrote:
On Sun, Jul 14, 2013 at 05:25:29PM +1000, Brett Pemberton wrote:
to extend or resize the volume, you've got to tell it which disk(s) to allocate the free space from.
Nope. You can, but you don't have to.
oh well, consider that an example of lvm's difficulty to learn and remember. I've been using LVM for a lot longer than ZFS but I still find it difficult and complicated to work with, and the tools are clumsy and awkward.
Really? You haven't come across HP-UX's LVM or veritas cluster storage's lvm then :) I find zfs obtuse, awkward, inflexible, prone to failure and unreliable. One day when I got sick of the linux kernel's braindead virtual memory management, I tried to install debian kfreebsd, but gave up before I finished installation because not having lvm seemed so primitive. I was probably just trying to use the wrong tool for the job.
nowadays, i see them as first generation pool & volume management tools, functional but inelegant. later generation tools like btrfs and ZFS not only add more features (like compression, de-duping, etc etc) by blurring the boundaries between block devices and file systems but more importantly they add simplicity and elegance that is missing from the first generation.
Does anyone use zfs's dedup in practice? Completely useless. Disk is a heck of a lot cheaper than RAM. Easier to add more disk too compared to maxing out a server and having to replace *everything* to add more RAM (dissapointingly, I didn't pay $250 to put 32GB in my new laptop. I hoped to replace the 16GB with 32GB when it becomes more economical, but when I opened it up, I found there were only 2 sockets. According to the manual that wasn't supplied with the laptop, the other two sockets are under the heatsink wedged underneath the keyboard, and not user-replacable. According to ebay, within the past couple of weeks, they did start selling 16GB sodimms, but no idea whether my haswell motherboard will be able to take them when it comes time to upgrade (which is probably very soon, judging by how iceape's memory usage just bloated even futher beyond reasonableness)).
and as for lvextend, lvreduce, and lvresize - it's completely insane for there to be three tools that do basically the same job. sure, i get that there are historical and compatibility reasons to keep lvextend and lvreduce around, but they should clearly be marked in the man page as "deprecated, use lvresize instead" (or even just "same as lvresize")
Au contraire. If you use lvresize habitually, one day you're going to accidentally shrink your LV instead of expand it, and the filesystem below it will then start accessing beyond end of device, with predictably catastrophic results. Use lvextend prior to resize2fs, and resize2fs shrink prior to lvreduce, and you'll be right. -- Tim Connors

On Mon, Jul 15, 2013 at 03:33:09PM +1000, Tim Connors wrote:
On Mon, 15 Jul 2013, Craig Sanders wrote:
I've been using LVM for a lot longer than ZFS but I still find it difficult and complicated to work with, and the tools are clumsy and awkward.
Really? You haven't come across HP-UX's LVM or veritas cluster storage's lvm then :)
they're not that much worse than lvm. OK, so call them 1st generation, lvm gen 1.5 and lvm2 gen 1.75
I find zfs obtuse, awkward, inflexible, prone to failure and unreliable.
you must be using a different zfs than the one i'm using because that description is the exact opposite in every respect of my experience with zfs.
One day when I got sick of the linux kernel's braindead virtual memory management, I tried to install debian kfreebsd, but gave up before I finished installation because not having lvm seemed so primitive. I was probably just trying to use the wrong tool for the job.
probably. the freebsd kernel has zfs built in, so zfs would be right tool there.
Does anyone use zfs's dedup in practice? Completely useless.
yes, people do. it's very heavily used on virtualisation servers, where there are numerous almost-identical copies of the same VM with minor variations. it's also useful on backup servers where you end up with dozens or hundreds of copies of the same files (esp. if you're backing up entire systems, including OS)
Disk is a heck of a lot cheaper than RAM. Easier to add more disk too compared to
yep, that fits my usage pattern too...i don't have that much duplicate data. i'm probably wasting less than 200GB or so on backups of linux systems in my backup pool including snapshots, so it's not worth it to me to enable de-duping. other people have different requirements.
maxing out a server and having to replace *everything* to add more RAM (dissapointingly, I didn't pay $250 to put 32GB in my new laptop. I hoped to replace the 16GB with 32GB when it becomes more economical, but when I opened it up, I found there were only 2 sockets. According to the manual that wasn't supplied with the laptop, the other two sockets are under the heatsink wedged underneath the keyboard, and not user-replacable. According to ebay, within the past couple of weeks, they did start selling 16GB sodimms, but no idea whether my haswell motherboard will be able to take them when it comes time to upgrade (which is probably very soon, judging by how iceape's memory usage just bloated even futher beyond reasonableness)).
i've found 16GB more than adequate to run 2 4TB zpools, a normal desktop with-the-lot (including firefox with dozens of windows and hundreds of tabs open at the same time) and numerous daemons (squid, apache, named, samba, and many others). of course, a desktop system is a lot easier to upgrade than a laptop if and when 32GB becomes both affordable and essential. FYI: NoScript and AdBlock Plus stop javascript from using too much RAM as well as spying on you.
Au contraire. If you use lvresize habitually, one day you're going to accidentally shrink your LV instead of expand it, and the filesystem below it will then start accessing beyond end of device, with predictably catastrophic results. Use lvextend prior to resize2fs, and resize2fs shrink prior to lvreduce, and you'll be right.
the risk of typing '-' rather than '+' does not scare me all that much. i tend to check and double-check potentially dangerous command lines before i hit enter, anyway. craig -- craig sanders <cas@taz.net.au> BOFH excuse #286: Telecommunications is downgrading.

Craig Sanders <cas@taz.net.au> wrote:
On Mon, Jul 15, 2013 at 03:33:09PM +1000, Tim Connors wrote: [...]
Does anyone use zfs's dedup in practice? Completely useless.
yes, people do. it's very heavily used on virtualisation servers, where there are numerous almost-identical copies of the same VM with minor variations.
it's also useful on backup servers where you end up with dozens or hundreds of copies of the same files (esp. if you're backing up entire systems, including OS)
Are you actually talking about retroactive deduplication here, or just COW? IMHO taking a little extra care when copying VM images or taking backups and ensuring use of snapshots and/of --reflink are usually good enough as opposed to going back and hunting for duplicate data. [...]
Au contraire. If you use lvresize habitually, one day you're going to accidentally shrink your LV instead of expand it, and the filesystem below it will then start accessing beyond end of device, with predictably catastrophic results. Use lvextend prior to resize2fs, and resize2fs shrink prior to lvreduce, and you'll be right.
the risk of typing '-' rather than '+' does not scare me all that much.
That of course assumes you're using a relative size. If you're using an absolute size this is far less obvious. That's the other thing: lvextend -L 32G on a 64G LV will do nothing, as would lvreduce -L 64G on a 32G LV. This is useful when ensuring an LV meets minimum size requirements and saves significant (potentially buggy) testing code.
i tend to check and double-check potentially dangerous command lines before i hit enter, anyway.
craig
-- Sent from my phone. Please excuse my brevity. Regards, Matthew Cengia

On Tue, Jul 16, 2013 at 05:49:49PM +1000, Matthew Cengia wrote:
it's also useful on backup servers where you end up with dozens or hundreds of copies of the same files (esp. if you're backing up entire systems, including OS)
Are you actually talking about retroactive deduplication here, or just COW? IMHO taking a little extra care when copying VM images or taking backups and ensuring use of snapshots and/of --reflink are usually good enough as opposed to going back and hunting for duplicate data.
neither. zfs has a de-dupe attribute which can be set. it then keeps a table of block hashes so that blocks about to be written which have a hash already in the table get substituted with a pointer to the original block. Like the compression attribute, it only affects writes performed after the attribute has been set so it isn't retroactive. NOTE: de-duplication takes enormous amounts of RAM, and slows down write performance significantly. it's really not worth doing unless you are certain that you have a very large proportion of duplicate data....and once it has been turned on for a pool, it can't be turned off without destroying, re-creating, and restoring the pool (technically, you can turn off the dedup attribute but you'll still be paying the overhead price for it but without any benefit). Here's the original post announcing de-duplication in zfs, from 2009 (still Sun, but old Sun webpages are now on oracle.com) https://blogs.oracle.com/bonwick/entry/zfs_dedup for more details, see http://zfsonlinux.org/docs.html which contains links to: . Illumos ZFS Documentation . ZFS on Linux User Guide . Solaris ZFS Administration Guide there are also numerous blog posts (of varying quality and cluefulness) describing how to do stuff with zfs or how something in zfs works, just a google search away.
That of course assumes you're using a relative size. If you're using an absolute size this is far less obvious. That's the other thing: lvextend -L 32G on a 64G LV will do nothing, as would lvreduce -L 64G on a 32G LV. This is useful when ensuring an LV meets minimum size requirements and saves significant (potentially buggy) testing code.
or i could just use zfs where there is no risk of making a mistake like that. IIRC, zfs will refuse to set the quota below current usage, but even if you could accidentally set the quota on a subvol to less than its current usage, zfs isnt going to start deleting or truncating files to make them fit. it's a different conceptual model to LVM, anyway. With lvm you allocate chunks of disk to be dedicated to specific LVs. if you allocate too much to one LV, you waste space. if you allocate too little, you fill up the fs too quickly. detailed forward planning is almost essential if you don't want to be micro-managing storage (as in "i'm running out of space on /home so take 50GB from /usr and add it to /home") all the time. with zfs, it's a global pool of space usable by all filesystems created on it - the default quota is no quota, so any and all sub-volumes can use up all available space. you can set quotas so that subvolumes (and children of that subvolume) are limited to using no more than X amount of the total pool (but there's still no guaranteed/reserved space - it's all shared space from the pool). with zfs, if /home is running out of quota then I can just do something like 'zfs set quota=xxxx export/home' as long as there's still free space available in the pool. you can also set a 'reservation' where a certain amount of storage in the pool is reserved for just one subvolume (and its descendants). reserved storage still comes from the pool, but no other subvolume or zvol can use any of it. both reservations and quotas are soft - you can change them at any time, very easily and without any hassle. reservations are more similar to lvm space allocations than are quotas, but even reservations are flexible - you can increase, decrease, or unset them at whim. craig -- craig sanders <cas@taz.net.au> BOFH excuse #67: descramble code needed from software company

On Tue, 16 Jul 2013, Craig Sanders wrote:
On Mon, Jul 15, 2013 at 03:33:09PM +1000, Tim Connors wrote:
I find zfs obtuse, awkward, inflexible, prone to failure and unreliable.
you must be using a different zfs than the one i'm using because that description is the exact opposite in every respect of my experience with zfs.
I've got one for you; just came in on the mailing list (someone got stale mounts when their kernel nfs server restarted): ZFS and the NFS kernel server are not that tightly integrated. When you do a 'zfs set sharenfs="foo,bar" pool/vol' of a 'zfs share pool/vol', the zfs tools just give a call to the NFS kernel server saying 'Hey I want you to share this over NFS'. If the NFS kernel server is restarted it unshares everything and only reads back whatever is in /etc/exports. This is actually expected as NFS doesn't know anything about ZFS. Doing a 'zfs share -a' exports all your NFS/SMB shares again. When you don't use your system native tools, or when someone tries to solve something at the wrong layer (zfs trying to mess around with NFS? That's almost as bad as the GUI at my previous workplace that tried to keep track of the state of something and then changing that state through a different layer to usual and then remembering the old state instead of directly querying it), you've got to expect problems. This particular problem I get around by just ignoring ZFS's NFS settings. I have no idea what value they're meant to add.
One day when I got sick of the linux kernel's braindead virtual memory management, I tried to install debian kfreebsd, but gave up before I finished installation because not having lvm seemed so primitive. I was probably just trying to use the wrong tool for the job.
probably. the freebsd kernel has zfs built in, so zfs would be right tool there.
Not when I tried it, mind you.
Does anyone use zfs's dedup in practice? Completely useless.
yes, people do. it's very heavily used on virtualisation servers, where there are numerous almost-identical copies of the same VM with minor variations.
Even then, for it to be worthwhile, when you have 800TB of VMs deployed, you can't easily dedup that (although from memory, a TB of RAM only costs about $100K). For any VMs I've ever seen, the identical shared data isn't all that much (our templates are 40GB in size) compared to the half TB deployed on average per VM. Hardly seems worth all the pain.
it's also useful on backup servers where you end up with dozens or hundreds of copies of the same files (esp. if you're backing up entire systems, including OS)
On my home backup server, the backup software dedups at the file level (but shared between VMs - achieved by hardlinking all files detected to be the same, comparing actual content rather than hash collisions). It does a very good job according to its own stats. Block level dedup is a bit overkill except if you're backing up raw VM snapshot images.
yep, that fits my usage pattern too...i don't have that much duplicate data. i'm probably wasting less than 200GB or so on backups of linux systems in my backup pool including snapshots, so it's not worth it to me to enable de-duping.
other people have different requirements.
I acknowledge that there are some uninteresting systems out there that are massively duplicated SOEs with bugger-all storage. Might fit that pattern. And yet I believe our VDI appliances that they're trying to roll out at work *still* won't be backed by ZFS with dedup.
i've found 16GB more than adequate to run 2 4TB zpools, a normal desktop with-the-lot (including firefox with dozens of windows and hundreds of tabs open at the same time) and numerous daemons (squid, apache, named, samba, and many others).
of course, a desktop system is a lot easier to upgrade than a laptop if and when 32GB becomes both affordable and essential.
Unfortunately, little Atom NASes seem to max out at 4GB.
Au contraire. If you use lvresize habitually, one day you're going to accidentally shrink your LV instead of expand it, and the filesystem below it will then start accessing beyond end of device, with predictably catastrophic results. Use lvextend prior to resize2fs, and resize2fs shrink prior to lvreduce, and you'll be right.
the risk of typing '-' rather than '+' does not scare me all that much.
Like Matthew said, the issue is when you provide absolute size, and might get the units wrong. Woops, just shrank it to 1000th of its original size!
i tend to check and double-check potentially dangerous command lines before i hit enter, anyway.
No <up>-<up>-<enter> sudo reboots? ;P -- Tim Connors

On Wed, Jul 17, 2013 at 12:17:18PM +1000, Tim Connors wrote:
On Tue, 16 Jul 2013, Craig Sanders wrote:
On Mon, Jul 15, 2013 at 03:33:09PM +1000, Tim Connors wrote:
I find zfs obtuse, awkward, inflexible, prone to failure and unreliable.
you must be using a different zfs than the one i'm using because that description is the exact opposite in every respect of my experience with zfs.
I've got one for you; just came in on the mailing list (someone got stale mounts when their kernel nfs server restarted):
ZFS and the NFS kernel server are not that tightly integrated.
yep, that's a known issue - the integration with NFS and CIFS was designed for solaris, and it's a low-priority issue to get it as tightly integrated for linux. it also doesn't qualify as either "prone to failure" or "unreliable" personally, i prefer defining NFS & CIFS exports in /etc/exports and smb.conf anyway. if i wanted to make use of the zfs attributes, it wouldn't be at all difficult to write a script to generate the appropriate config fragments for samba and exports. i'd start with a loop like this: for fs in $(zfs get sharesmb,sharenfs -t filesystem) ; do [...] done i've already got code to generate smb.conf stanzas for websites in my vhosting system (so that web sites can be shared to appropriate users on the LAN or via VPN) - it's trivial. and generating /etc/exports is even easier. BTW, speaking of samba, zfs snapshots also work nicely with shadow copy if you're backing up windows PCs. http://forums.freebsd.org/showthread.php?t=32282
Even then, for it to be worthwhile, when you have 800TB of VMs deployed, you can't easily dedup that (although from memory, a TB of RAM only costs about $100K).
no, de-duping would not be practical at that scale.
For any VMs I've ever seen, the identical shared data isn't all that much (our templates are 40GB in size) compared to the half TB deployed on average per VM. Hardly seems worth all the pain.
your usage obviously can't make use of de-duping - but for lots of virtualisation scenarios (e.g. web-hosting with lots of little web-servers) it can make sense.
it's also useful on backup servers where you end up with dozens or hundreds of copies of the same files (esp. if you're backing up entire systems, including OS)
On my home backup server, the backup software dedups at the file level (but shared between VMs - achieved by hardlinking all files detected to be the same, comparing actual content rather than hash collisions). It does a very good job according to its own stats.
yeah, well, i hate backuppc - every time i've tried to use it, it's been a disaster. rsync/zfs send and snapshots work far better for me. somebody, might have been you, mentioned in this thread that rsync is a worst-case load for zfs....not in my experience. but the link farms in backuppc absolutely kill performance on every filesystem i've tried it on, including zfs.
I acknowledge that there are some uninteresting systems out there that are massively duplicated SOEs with bugger-all storage. Might fit that pattern. And yet I believe our VDI appliances that they're trying to roll out at work *still* won't be backed by ZFS with dedup.
i agree - there are only very limited circumstances where de-duping is worthwhile. in most cases, it's just too expensive - the cost of ram vs disk or ssd is prohibitive. but there *are* some scenarios where it is the right solution, or at least a good solution.
i've found 16GB more than adequate to run 2 4TB zpools, [...]
Unfortunately, little Atom NASes seem to max out at 4GB.
"Don't do that then" this kind of idiotic arbitrary limitation is the main reason i prefer to build my own than to buy an appliance. not only is it usually cheaper (sometimes far cheaper - half or a third of the total cost because appliance NASes are absurd over-priced for what you get), but you get much better hardware with much better specs. and upgradability and repairitude. (i also really like that I'm the one making the design trade-off decisions when building my own, rather than accepting someone else's choices. i can design the system to suit my exact needs rather than adapt my needs to fit the capabilities of a generic appliance) for qnap and similar applicances you're looking at over $500 for 2 bay NAS - two drives, that's crap. for 5 or 6 drive bays, it's $800 - $1000. 8 bays, around $1300. and that's with no drives. Who in their right mind would pay that when you can build your own 8 or even 12 bay (or more if you use 2.5" drives) ZFS system with a great motherboard and CPU and bucketloads of ram for between $500 and $800? you can even still use an atom or fusion cpu if low-power is a requirement (but, really, with that many drives the CPU power usage isn't that big a deal, and a better CPU helps with compression) for HP microsevers and the like, i posted about that in luv-talk a few days ago, with rough costings for a slightly cheaper and significantly better AMD Fusion based system.
i tend to check and double-check potentially dangerous command lines before i hit enter, anyway.
No <up>-<up>-<enter> sudo reboots? ;P
nope, i have shutdown and reboot excluded from bash history: export HISTIGNORE='shutdown*:reboot*' having to retype shutdown commands saves me from the aaaaaaararggggghhhhhh! feeling of accidentally rebooting when i didn't mean to...i've done that too many times in the past. craig -- craig sanders <cas@taz.net.au> BOFH excuse #179: multicasts on broken packets

On Sun, 14 Jul 2013, Craig Sanders wrote:
On Sun, Jul 14, 2013 at 12:15:59PM +1000, Jeremy Visser wrote:
On 14/07/13 01:53, Craig Sanders wrote:
and, unlike lvm, with both btrfs and zfs, once you've created the pool you don't need to remember or specify the device names of the disks in the pool unless you're replacing one of them
Can you clarify what you mean by this?
My volume group is /dev/vg_glimfeather. I haven?t interacted with the raw PV, /dev/sda6, in over a year.
here's an example from the lvextend man page:
Extends the size of the logical volume "vg01/lvol10" by 54MiB on physi- cal volume /dev/sdk3. This is only possible if /dev/sdk3 is a member of volume group vg01 and there are enough free physical extents in it:
lvextend -L +54 /dev/vg01/lvol10 /dev/sdk3
Yes, but why would you do that instead of just letting lvm pick where to put its extents? If I add a disk in lvm or zfs, I have to either zfs add tank <dev-id> (or is it zpool?) or vgextend <vg> <dev-node>. Same difference. In both cases, I have to read the manpage to remember quite how to use it.
and another from lvresize:
Extend a logical volume vg1/lv1 by 16MB using physical extents /dev/sda:0-1 and /dev/sdb:0-1 for allocation of extents:
lvresize -L+16M vg1/lv1 /dev/sda:0-1 /dev/sdb:0-1
Wow, damn. Maybe the manpage is deficient, or maybe you're reading the wrong section. It's *definitely* not that complicated. Just cargo cult off the net like everyone else! -- Tim Connors

On Mon, Jul 15, 2013 at 12:02:34PM +1000, Tim Connors wrote:
Wow, damn. Maybe the manpage is deficient, or maybe you're reading the wrong section. It's *definitely* not that complicated.
Just cargo cult off the net like everyone else!
cargo-culting from the man-page obviously doesn't work - as Brett highlighted, the examples given are way more than complicated than they need to be. *all* of the examples in the lvextend & lvresize man pages have the PhysicalVolumePath on the command line. craig -- craig sanders <cas@taz.net.au> BOFH excuse #316: Elves on strike. (Why do they call EMAG Elf Magic)

On Mon, 15 Jul 2013, Craig Sanders wrote:
On Mon, Jul 15, 2013 at 12:02:34PM +1000, Tim Connors wrote:
Wow, damn. Maybe the manpage is deficient, or maybe you're reading the wrong section. It's *definitely* not that complicated.
Just cargo cult off the net like everyone else!
cargo-culting from the man-page obviously doesn't work - as Brett highlighted, the examples given are way more than complicated than they need to be.
*all* of the examples in the lvextend & lvresize man pages have the PhysicalVolumePath on the command line.
Ah so they do. But the EBNF usage gives PhysicalVolumePath in optional square brackets. PhysicalVolumePath is of course useful if you want to move /usr mostly to ssd but keep the rest preferentially going to your 3TB spinning rust, but have it all in the same VG anyway. -- Tim Connors

One small problem with LVM (and a reason not to see it as a valid comparison to) that ZFS and BTRFS to solve is that lvm snapshots suck. http://johnleach.co.uk/words/613/lvm-snapshot-performance TL;DR lvm snapshots rewrite the original data before writing the change.

On Mon, Jul 15, 2013 at 02:58:37PM +1000, Tim Connors wrote:
Ah so they do. But the EBNF usage gives PhysicalVolumePath in optional square brackets.
true, but when you're operating in full-on cargo-cult mode you skip all that verbose waffle and jump straight to the Examples section :-) craig -- craig sanders <cas@taz.net.au> BOFH excuse #152: My pony-tail hit the on/off switch on the power strip.

On Sun, 14 Jul 2013, Craig Sanders wrote:
On Sat, Jul 13, 2013 at 02:42:22PM +1000, Russell Coker wrote:
On Sat, 13 Jul 2013, Craig Sanders <cas@taz.net.au> wrote:
from the zfsonlinux faq:
http://zfsonlinux.org/faq.html#PerformanceConsideration
"Create your pool using whole disks: When running zpool create use whole disk names. This will allow ZFS to automatically partition the disk to ensure correct alignment. It will also improve interoperability with other ZFS implementations which honor the wholedisk property."
Other than the compatibility reason stated there, this FAQ, and the differing behaviour of the scheduling elevator dependent upon whether it's a disk or a partition, has always *smelt* to me. If dealing with a whole disk, it still creates -part1 & -part9 anyway.
Who's going to transfer a zpool of disks from a Linux box to a *BSD or Solaris system? Almost no-one.
read the first reason again, automatic alignment of the partitions increases performance - and it's the sort of thing that it's a PITA to do manually, calculating the correct starting locations for all partitions. it'll be a lot more significant to most linux users than the second reason...as you say, that's irrelevant to most people.
eh? parted has done proper alignment by default since just after the dinosaurs were wiped out. I was actually surprised the other day when ext4.ko issued an alignment warning about a filesystem on one of the new Oracle database VMs at work, and I looked at the many year old template it was deployed from, and alignment was already correct. The Oracle forums (I was surprised that there was actually an Oracle community who talk out in the open on web fora - I thought it was all enterprisey type people who don't like to talk) mentioned this was a known issue with how Oracle does redo log IO, and it was just a cosmetic issue.
and, unlike lvm, with both btrfs and zfs, once you've created the pool you don't need to remember or specify the device names of the disks in the pool unless you're replacing one of them - most operations (like creating, deleting, changing attributes of a volume or zvol) are done with just the name of the pool and volume. this leads to short, simple, easily understood command lines.
eh? That's pretty close to the occasions where you need to specify raw device names in lvm too. -- Tim Connors

On Sun, Jul 14, 2013 at 02:07:41PM +1000, Tim Connors wrote:
Other than the compatibility reason stated there, this FAQ, and the differing behaviour of the scheduling elevator dependent upon whether it's a disk or a partition, has always *smelt* to me.
it's not about whether it's a disk or a partition, it's about the alignment of the partition with a multiple of the sector size. 4K alignment works for both 512-byte sector drives and 4K or "advanced format" drives
If dealing with a whole disk, it still creates -part1 & -part9 anyway.
yep, it creates a GPT partition table so that the data partiton is 4K-aligned. interesting discussion here: https://github.com/zfsonlinux/zfs/issues/94 in theory, zfs could use the raw disk and start at sector 0, but one of the risks there is that careless use of tools like grub-install (or fdisk/gdisk/parted etc) could bugger up your pool....and there's really no good reason not to use a partition table, at worst you lose the first 1MB (2048 x 512 bytes) of each disk.
eh? parted has done proper alignment by default since just after the dinosaurs were wiped out.
it's still quite common for drives to be partitioned so that the first partition starts on sector 63 rather than 2048 - which works fine for 512-byte sectors but not so great for 4K-sector drives. and i know i've spent significant amounts of time in the past, tediously calculating the optimum offsets for partitions to use with mdadm. craig -- craig sanders <cas@taz.net.au> BOFH excuse #424: operation failed because: there is no message for this error (#1014)

On Sat, 13 Jul 2013, Craig Sanders wrote:
You need DKMS to build the kernel modules and then the way ZFS operates is very different from traditional Linux filesystems.
dkms automates and simplifies the entire compilation process, so it just takes time. it's not complicated or difficult.
It's unreliable. I've had to rescue my headless NAS more than once when zfs failed to build. In kernel ext4 has the benefits of being a little more mature, and in-tree. -- Tim Connors

On Sun, Jul 14, 2013 at 01:57:16PM +1000, Tim Connors wrote:
[...dkms...]
It's unreliable. I've had to rescue my headless NAS more than once when zfs failed to build.
what, you rebooted without checking that the compile had completed successfully and the modules were correctly installed? craig -- craig sanders <cas@taz.net.au> BOFH excuse #442: Trojan horse ran out of hay
participants (12)
-
Brett Pemberton
-
Chris Samuel
-
Colin Fee
-
Craig Sanders
-
Jason White
-
Jeremy Visser
-
Kevin
-
Matthew Cengia
-
Russell Coker
-
Tim Connors
-
Tom Robinson
-
trentbuck@gmail.com