
Hi, I've begun testing/commissioning a new storage server at home, trying to make it expandable in the future, as 3tb drives become cheaper. I decided to give ZFS on FreeBSD a try, and this is what I'm on now, however the issues I'm having apply equally to zfsonlinux/ZFS using FUSE, so I'm hoping to grab some help here. I knew upfront that you can't add devices to a zpool, so to make this expandable, I took the approach of maxing out my case hard drive space, with the plan of simply replacing the drives with bigger capacity as they fail, or as new drives become cheap. My current system has 2 x 64GB SSDs for OS, and a ZIL (mirrored) and started with 8 x 500gb drives. A mix of PATA and SATA, which isn't ideal, but is what I had spare. Yesterday I did a replace of one of the 500gb drives with a 1.5TB drive I also had spare. The issue I've hit is that although the zpool has seen an increase in capacity, the zfs filesystem has not. Googling and IRC have been unable to help with this, so running it by LUV, considering this portion of the system is OS independent. My relevant history lines are: 2012-07-17.20:04:15 zpool create -f storage raidz ada0 ada1 ada4 ada5 ada6 ada7 ada8 ada9 2012-07-17.20:04:41 zfs create storage/tbla 2012-07-17.20:04:52 zfs set sharenfs=on storage/tbla 2012-07-17.20:04:59 zfs set atime=off storage/tbla 2012-07-17.20:05:15 zpool add -f storage log mirror/gm0s2a 2012-07-17.20:19:27 zfs sharenfs=-maproot=0:0 storage/tbla 2012-07-21.13:33:29 zpool offline storage ada8 ... reboot, replace drive ... 2012-07-21.15:20:40 zpool replace storage 5231172179341362844 ada8 2012-07-21.16:06:17 zpool set autoexpand=on storage ... ^^^ note that this was run while the resilver was in operation ... 2012-07-21.16:46:16 zpool export storage 2012-07-21.16:46:34 zpool import storage 2012-07-21.16:54:02 zpool set autoexpand=off storage 2012-07-21.16:54:07 zpool set autoexpand=on storage 2012-07-21.17:34:12 zpool scrub storage [root@swamp /usr/home/brett]# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT storage 3.62T 701G 2.94T 18% 1.00x ONLINE - Before the replace, storage was showing 3.1T size, I believe. So it jumped by 500gb, which seems about right. [root@swamp /usr/home/brett]# zfs list NAME USED AVAIL REFER MOUNTPOINT storage 609G 2.51T 50.4K /storage storage/tbla 609G 2.51T 609G /storage/tbla [root@swamp /usr/home/brett]# df -h Filesystem Size Used Avail Capacity Mounted on /dev/mirror/gm0s1a 18G 3.1G 14G 18% / devfs 1.0k 1.0k 0B 100% /dev storage 2.5T 50k 2.5T 0% /storage storage/tbla 3.1T 609G 2.5T 19% /storage/tbla I'm also slightly confused as to why storage shows a size of 2.5T, while storage/tbla has 3.1. And clearly neither has the 3.62 which is in the zpool. Can anyone explain this to me? Note that this is still in testing phase, so I'm happy to start from scratch, destroy the FS and try again with all 500gb drives, then do the replace again, if I missed a step somehow. It's more important to me that I get the procedure correct, so that once this is production, I know what's going on. cheers, / Brett

On Sun, Jul 22, 2012 at 03:12:22PM +1000, Brett Pemberton wrote:
My current system has 2 x 64GB SSDs for OS, and a ZIL (mirrored) and
is the ZIL mirrored, or just the OS partitions? i'd set that up as maybe 10 or 20GB mirrored for the OS, plus a small (1-4GB) ZIL on each SSD, swap space, and the remainder of each SSD as two separate LOG devices. in total that would give 2-8GB ZIL (very generous) and about 80GB of LOG for read cache. in linux terms: sda1 & sdb1: OS 10-20GB (raid-1) sda2 & sdb2: swap ??? (optional, non-mirrored) sda3 & sdb3: ZIL 1 and ZIL 2 1-4GB (non-mirrored) sda4 & sdb4: LOG 1 and LOG 2 about 40GB (non-mirrored) the biggest variable is how much you want for the mirrored OS partitions. that could be a lot smaller then 10-20GB if the system won't have much installed and/or you're planning to have /var and other stuff on your zpool. or it could be more if you intend to use the zpool just for bulk data storage, with everything else on the SSD.
started with 8 x 500gb drives. A mix of PATA and SATA, which isn't ideal, but is what I had spare.
Yesterday I did a replace of one of the 500gb drives with a 1.5TB drive I also had spare. The issue I've hit is that although the zpool has seen an increase in capacity, the zfs filesystem has not.
like md raid, you won't see an increase in filesystem capacity until all drives in a zdev (not the entire pool, the zdev) have been upgraded. so if you have 8x500gb in one zdev then you have to upgrade all 8 drives before you see the extra space, but if you have 8 drives in 4 mirrored (raid1) zdevs, then you'll see the increased capacity after upgrading just two drives. summary & examples: 8x500gb drives in raidz-1 is about 3.5TB. more storage capacity now but to upgrade capacity you have to replace all 8 drives. 8x500gb drives in 4 mirrored zdevs is about 2TB. less storage but cheaper to upgrade in stages later (two drives at a time). This will also give better performance. alternatively, 8 drives could also be set up as 2 raidz zdevs, giving about 3TB capacity. to increase that later you'd have to upgrade four drives at a time. and if you don't care about redundant copies of your data, you could have 8 zdevs with one drive each, giving about 4TB of space upgradable one drive at a time. (BTW, this would be slightly better than just using md raid0, but given that the only sane use for raid0 is as fast scratch space for data you can afford to lose, it's hard to see much benefit in using zfs. more convenient drive replacement and maybe ssd caching might make it worthwhile) i'll leave it up to someone else to answer the rest of your questions, i'm tired and just can't force my brain to concentrate enough to get it right.
Note that this is still in testing phase, so I'm happy to start from scratch, destroy the FS and try again with all 500gb drives, then do the replace again, if I missed a step somehow. It's more important to me that I get the procedure correct, so that once this is production, I know what's going on.
good idea. it's a shame more people don't do this kind of experimentation before diving in. craig -- craig sanders <cas@taz.net.au> BOFH excuse #19: floating point processor overflow

On Sun, Jul 22, 2012 at 4:14 PM, Craig Sanders <cas@taz.net.au> wrote:
On Sun, Jul 22, 2012 at 03:12:22PM +1000, Brett Pemberton wrote:
My current system has 2 x 64GB SSDs for OS, and a ZIL (mirrored) and
is the ZIL mirrored, or just the OS partitions?
Right now, the entire block devices are mirrored. So yes, the ZIL is. And I'm not really intending on putting in a read cache, as I doubt it'll match my use-case. I wouldn't expect a speed up, and performance isn't my goal anyway.
like md raid, you won't see an increase in filesystem capacity until all drives in a zdev (not the entire pool, the zdev) have been upgraded.
Ahh, I caught the wrong end of a stick, and assumed otherwise. Re-reading said stick, I believe the point at hand was JBOD, basically, so in that case you would see the increase on a replace. Bit disappointing, as I had chalked that down as a major advantage over mdadm. I'm still confused as to why zpool list showed an increase in capacity after the replace operation. Thinking about it further, I believe my best compromise might be to add one more drive into a 5.25 bay, and do: 3 x raidz, each with 3 drives. [500,500,500], [500,500,500], [1500,500,500] ~ 3tb And if I replace the two drives in the last vdev with the best bang for buck right now, 2tb, this will jump to 5.5tb (guessing) Giving me a fully populated max of 12tb. Seems a nice mix of upgrade-ability, storage and redundancy to me. cheers, / Brett

On Mon, Jul 23, 2012 at 12:54:12PM +1000, Brett Pemberton wrote:
On Sun, Jul 22, 2012 at 4:14 PM, Craig Sanders <cas@taz.net.au> wrote:
On Sun, Jul 22, 2012 at 03:12:22PM +1000, Brett Pemberton wrote:
My current system has 2 x 64GB SSDs for OS, and a ZIL (mirrored) and is the ZIL mirrored, or just the OS partitions?
Right now, the entire block devices are mirrored. So yes, the ZIL is.
there's no benefit in having the ZIL mirrored, but it's probably not worth the hassle of changing it.
Bit disappointing, as I had chalked that down as a major advantage over mdadm.
the advantage here over mdadm (and/or lvm) is that you don't have to stuff around with manually resizing the fs. if autoexpand is on, as soon as a vdev has usable extra capacity, the pool uses it. "usable" depends on the exact details of the vdev(s).
I'm still confused as to why zpool list showed an increase in capacity after the replace operation.
iirc zpool knows about the extra capacity of the 1.5TB drive, but it's not available to be used by a filesystem.
Thinking about it further, I believe my best compromise might be to add one more drive into a 5.25 bay, and do:
3 x raidz, each with 3 drives.
[500,500,500], [500,500,500], [1500,500,500] ~ 3tb
sounds good.
And if I replace the two drives in the last vdev with the best bang for buck right now, 2tb,
bb4b depends on the diff in price between 1.5TB and 2TB drives (if you can still get 1.5TB drives - i'm not sure if they're still available new - maybe you can find some old but unused stock on ebay).
this will jump to 5.5tb (guessing)
unless/until you also replace the 1.5TB with a 2TB, you will only be able to use 1.5TB of each drive in that zdev. raidz-1 is like raid-5 so gives n-1 capacity for n disks. so, ignoring any potential gain from zfs compression or de-duping[1], 3 x 1.5TB drives = 3TB raidz. so that would be 1TB + 1TB + 3TB = total 5TB. upgrading that 1.5TB to 2TB as well would give you 1+1+4 = 6TB...an extra TB for the price of a 2TB drive (approx $130). probably not worth it. imo hold onto the money until 2TB drives are a lot cheaper or one of the drives dies and needs replacing or until it's time to upgrade one of the other zdevs to 2 or 3TB drives. [1] re: de-duping - not worth the bother, imo. with the amount of ram needed for it to be viable (estimates range from 1GB to 5GB RAM per TB of disk, just for the L2ARC), it's cheaper just to have more & bigger disks. zfs compression, OTOH, is definitely worth-while if most of the data you're storing is not already compressed.
Giving me a fully populated max of 12tb.
or more with 3 and 4TB drives.
Seems a nice mix of upgrade-ability, storage and redundancy to me.
yep. craig -- craig sanders <cas@taz.net.au> BOFH excuse #198: Post-it Note Sludge leaked into the monitor.

[1] re: de-duping - not worth the bother, imo. with the amount of ram
needed for it to be viable (estimates range from 1GB to 5GB RAM per TB of disk, just for the L2ARC), it's cheaper just to have more & bigger disks.
I'm actually about to give that a test, enabling it for a specific backup section of my fs, which may have a few dupes. Just want to see exactly how much RAM it will now use, and how much the dedup helps. I'm doing this by creating a new fs for this section, moving the files into there, and then turning dedup on for that FS. Presume this is the proper method. That section is around 400GB. The machine in question currently has 16GB of RAM, so it'll be interesting to see how things go with it on.
zfs compression, OTOH, is definitely worth-while if most of the data you're storing is not already compressed.
Again, this will be limited to certain directories. Is the best practice to create separate filesystems for those areas? / Brett

On Mon, Jul 23, 2012 at 02:44:26PM +1000, Brett Pemberton wrote:
I'm doing this by creating a new fs for this section, moving the files into there, and then turning dedup on for that FS. Presume this is the proper method.
almost. turn de-dupe on first. dupe-checking is done at time of write. same applies to enabling compression (only files written after compression is enabled will be compressed), and balancing of files over all spindles if you add another vdev to a zpool (btrfs has an auto-rebalance option, zfs doesn't. it's the one nice thing that btrfs has that zfs doesn't. OTOH zfs can be trusted not to lose your data and btrfs can't yet). you could manually de-dupe, compress, rebalance, etc by writing a script to copy & delete each file, but the technical term for this procedure is "a massive PITA" :)
That section is around 400GB. The machine in question currently has 16GB of RAM, so it'll be interesting to see how things go with it on.
planning to post your findings here? i'd be interested to read how it turns out.
zfs compression, OTOH, is definitely worth-while if most of the data you're storing is not already compressed.
Again, this will be limited to certain directories. Is the best practice to create separate filesystems for those areas?
you enable/disable compression on a per-filesystem basis, so yes. note, however, that the one minor drawback with multiple filesystems vs just a subdirectory is that even though the files are on the same physical disks, moving files from one zfs fs to another is a copy-and-delete operation (so time-consuming). i.e. same as having multiple filesystems on partitions or LVM volumes. craig -- craig sanders <cas@taz.net.au>

On Mon, 23 Jul 2012, Craig Sanders <cas@taz.net.au> wrote:
Right now, the entire block devices are mirrored. So yes, the ZIL is.
there's no benefit in having the ZIL mirrored, but it's probably not worth the hassle of changing it.
http://en.wikipedia.org/wiki/Zfs#ZFS_cache:_L2ARC.2C_ZIL The below paragraph is from the above Wikipedia page. It seems that mirroring the log device is a really good idea, particularly if it's a SSD. Also it would be interesting to know whether the zfsonlinux code isn't one of the "earlier versions of ZFS" mentioned. The write SSD cache is called the Log Device, and it is used by the ZIL (ZFS Intent Log). ZIL basically turns synchronous writes into asynchronous writes, which helps e.g. NFS or databases.[39] All data is written to the ZIL like a journal log, but only read after a crash. Thus, the ZIL data is normally never read. Every once in a while, the ZIL will flush the data to the zpool, this is called Transaction Group Commit. In case there is no separate log device added to the zpool, a part of the zpool will automatically be used as ZIL, thus there is always a ZIL on every zpool. It is important that the log device use a disk with low latency, for superior performance a disk consisting of battery backed up RAM, such as the ZeusRAM should be used. Because the log device is written to a lot, an SSD disk will eventually be worn out, but a RAM disk will not. If the log device is lost, it is possible to lose the latest writes, therefore the log device should be mirrored. In earlier versions of ZFS, loss of the log device could result in loss of the entire zpool, therefore one should upgrade ZFS if planning to use a separate log device. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Mon, Jul 23, 2012 at 04:41:53PM +1000, Russell Coker wrote:
there's no benefit in having the ZIL mirrored, but it's probably not worth the hassle of changing it.
http://en.wikipedia.org/wiki/Zfs#ZFS_cache:_L2ARC.2C_ZIL
The below paragraph is from the above Wikipedia page. It seems that mirroring the log device is a really good idea, particularly if it's a SSD.
nice to know, i didn't realise that. will have to look into this in more detail, might have to get another SSD so I can mirror my ZIL. it's probably not urgent though - SSD erase lifetimes are a lot better than they used to be, and also have good wear-leveling algorithms built-in...and most SSDs reserve a sizable percentage of their capacity to 'replace' worn out sections (which is one of the reasons you see 120GB SSDS rather than 128GB)
Also it would be interesting to know whether the zfsonlinux code isn't one of the "earlier versions of ZFS" mentioned.
nope. zfsonlinux is based on the latest version 28 ZFS, the most-recent (and probably last) open source release from Oracle....same as what illumos and recent versions of freebsd have. due to oracle's abandoment of open solaris and related technologies, illumos etc have effectively forked zfs and all future open-source improvements will come from there (the zfsonlinux project is a participant in that process). craig -- craig sanders <cas@taz.net.au>
participants (3)
-
Brett Pemberton
-
Craig Sanders
-
Russell Coker