
On Sun, Oct 14, 2012 at 09:01:49PM -0700, Daniel Pittman wrote:
On Sun, Oct 14, 2012 at 7:25 PM, Russell Coker <russell@coker.com.au> wrote:
I'm looking at converting some Xen servers to ZFS. This includes a couple of servers for a reasonable size mail store (8,000,000 files and 600G of Maildir storage).
For much of the Xen on ZFS stuff I'll just use zvols for block devices and then use regular Linux filesystems such as Ext3 inside them. This isn't particularly efficient but for most DomUs it doesn't matter at all. Most of the DomUs have little disk access as they don't do much writing and have enough cache to cover most reads.
For the mail spool a zvol would be a bad idea, fsck on a 400G Ext3/4 filesystem is a bad thing and having the double filesystem overhead of Ext3/4 on top of a zvol is going to suck for the most disk intensive filesystem.
zvol is more like an LVM logical volume than a filesystem, so the overhead isn't nearly as much as this comment suggests.
yep.
That said, running ext3 (especially) or ext4 on top of it is going to be slower, and means you can't use the RAID style features of ZFS, and you give up object level checksums.
that's not exactly true - the guest won't know anything about the ZFS features, but the ZFS file-server certainly will....the zvol will be a chunk of allocated space from one of the zpools on the system. it can optionally be sparse-allocated (for thin-provisioning, greatly reduces space used, but performance can suffer). The zvol has all the benefits of the zfs pool, including snapshotting and cloning, COW, error checking and recovery, SSD read and write caching. The zvol can be backed up (or moved to another ZFS server) with 'zfs send' & 'zfs receive'. It can also be exported as an iscsi volume (e.g. so that a remote virtualisation cpu node can access the volume storage on the zfs file server). cloning is particularly useful for VMs - in short, set up a 'template' VM image, clean it up (e.g. run 'apt-get clean', delete /etc/udev/rules.d/70-persistent-net.rules, and so on), snapshot it, and then clone the snapshot whenever you need a new VM. you could even, for example, build a squeeze 6.0 VM template, snapshot it, then later boot it up and upgrade to 6.01, 6.02, ..., 6.06, and have a cleaned up snapshot of each point-release, any of which could be cloned into a new VM at any time.
From the guest VM's point-of-view, it's just a disk with nothing special about it.
ext3 or ext4 performance in the guest will be similar to performance if the guest were given an LVM lv. I haven't done any benchmarking to compare zvol with lv (mostly because and I can't afford to add 4 drives to my ZFS server just to test LVM lv vs ZFS zvol performance), but I can give a subjective anecdote that the performance improvement from using a ZFS zvol instead of a qcow2 disk image is about the same as using an LVM lv instead of a qcow2 file. i.e. *much* faster. if i had to guess, i'd say that there are probably some cases where LVM (with its nearly direct raw access to the underlying disks) would be faster than ZFS zvols but in most cases, ZFS' caching, compression, COW and so on would give the performance advantage to ZFS. ZFS's other advantages, especially lightweight and unlimited snapshots, make it worth using over LVM anyway. FYI, here are the details on one of several zvols of various sizes that I have on my home ZFS server. They're all used by KVM virtual machines. # zfs get all export/sid NAME PROPERTY VALUE SOURCE export/sid type volume - export/sid creation Sun Mar 25 14:19 2012 - export/sid used 5.16G - export/sid available 694G - export/sid referenced 1.91G - export/sid compressratio 1.69x - export/sid reservation none default export/sid volsize 5G local export/sid volblocksize 8K - export/sid checksum on default export/sid compression on inherited from export export/sid readonly off default export/sid copies 1 default export/sid refreservation 5.16G local export/sid primarycache all default export/sid secondarycache all default export/sid usedbysnapshots 0 - export/sid usedbydataset 1.91G - export/sid usedbychildren 0 - export/sid usedbyrefreservation 3.25G - export/sid logbias latency default export/sid dedup off default export/sid mlslabel none default export/sid sync standard default export/sid refcompressratio 1.69x - export/sid written 1.91G - Note that this zvol has compression enabled - this would be a good choice for a mail server's storage disk - mail is highly compressible. depending on available RAM in the server and the kind of mail typically received (e.g. multiple copies of the same email), de-duping the zvol may also be worthwhile.
Any suggestions?
I would aim to run ZFS in the mail domU, and treat the zvol as a "logical volume" block device. You will have some overhead from the double checksums, but robust performance. It treats the underlying dom0 ZFS as a fancy LVM, essentially. You probably also need to allocate substantially more memory to the domU than you would otherwise.
That's really not needed. Most VMs just need fast, reliable storage, and know or care exactly what the underlyingstorage is (nor should they have to) - it's abstracted away as a virtio disk, /dev/vda or /dev/vdb or as an iscsi disk. There may be some exceptions where the VM needs to run ZFS itself on a bunch of zvols, but the only real use-case i've found is for experimenting with and testing zfs itself (e.g. i've created numerous zvols of a few hundred MB each and used them in a VM to create a zpool from them) being able to snapshot and zfs send within the VM itself could be useful. OTOH rsync provides a similar incremental backup. craig -- craig sanders <cas@taz.net.au>

On Wed, 17 Oct 2012, Craig Sanders wrote:
There may be some exceptions where the VM needs to run ZFS itself on a bunch of zvols, but the only real use-case i've found is for experimenting with and testing zfs itself (e.g. i've created numerous zvols of a few hundred MB each and used them in a VM to create a zpool from them)
being able to snapshot and zfs send within the VM itself could be useful. OTOH rsync provides a similar incremental backup.
I have a samba server, and it is very handy to clone and mount yesterday's snapshot to retrieve data quickly. Regards Peter

On 2012-10-17 13:54, Peter Ross wrote:
On Wed, 17 Oct 2012, Craig Sanders wrote:
There may be some exceptions where the VM needs to run ZFS itself on a bunch of zvols, but the only real use-case i've found is for experimenting with and testing zfs itself (e.g. i've created numerous zvols of a few hundred MB each and used them in a VM to create a zpool from them)
being able to snapshot and zfs send within the VM itself could be useful. OTOH rsync provides a similar incremental backup.
I have a samba server, and it is very handy to clone and mount yesterday's snapshot to retrieve data quickly.
Peter, if you don't already know about it, and are running Windows desktops, you may like to investigate the Samba Shadow Copy VFS modules: http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/VFS.html#id265169... It makes snapshots acecssible via the "Previous Versions" tab of the Windows File Properties dialogue box. -- Regards, Matthew Cengia

Hi Matthew,
if you don't already know about it, and are running Windows desktops, you may like to investigate the Samba Shadow Copy VFS modules: http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/VFS.html#id265169...
It makes snapshots acecssible via the "Previous Versions" tab of the Windows File Properties dialogue box.
I wasn't aware of it. At the moment I do mounts on request only, the retrieval is manual work by the user. I want to upgrade to Samba 4 soon, that may part of the revamp then. Thanks Peter

On Wed, 17 Oct 2012, Craig Sanders <cas@taz.net.au> wrote:
From the guest VM's point-of-view, it's just a disk with nothing special about it.
ext3 or ext4 performance in the guest will be similar to performance if the guest were given an LVM lv.
In theory the write performance of Ext3/4 should benefit significantly from the way that the zpool layer manages somewhat contiguous writes thus avoiding write seek time. Creating a file in Ext3/4 involves writing to the directory, the Inode table, a block count table, and the journal - all of which tend to be at different parts of the disk. With ZFS it should theoretically be just the overhead of a journal entry without having to write the rest. I haven't had a chance to test this theory due to a lack of hardware suitable for running ZFS. It seems that the minimum hardware for doing a test is a system with 4G of RAM, a 64bit CPU, and at least 3 disks. Currently I don't have access to such a test system.
if i had to guess, i'd say that there are probably some cases where LVM (with its nearly direct raw access to the underlying disks) would be faster than ZFS zvols but in most cases, ZFS' caching, compression, COW and so on would give the performance advantage to ZFS.
If you want to optimise for read performance without caching (IE no L2ARC and a working set significantly bigger than RAM) then LVM would probably win in some ways.
Note that this zvol has compression enabled - this would be a good choice for a mail server's storage disk - mail is highly compressible. depending on available RAM in the server and the kind of mail typically received (e.g. multiple copies of the same email), de-duping the zvol may also be worthwhile.
The last time I checked the average message size on a medium size mail spool it was about 70K. The headers are essentially impossible to dedup as they differ in the final stage of delivery even if a single SMTP operation was used to send to multiple local users. Deduping the message body seems unlikely to provide a significant benefit as there usually aren't that many duplicates, not even when you count spam and jokes - I'm assuming that ZFS is even capable of deduplicating files which have the duplicate part at different offsets, but I don't care enough about this to even look it up. For every server I run that has any duplicate content RAM is a more limited resource than disk space. For example the server which is full of raw files from digital cameras is never going to benefit from dedup even though it has enough RAM to run it. So there's no possibility of me gaining anything from it. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 17/10/12 14:15, Russell Coker wrote:
Note that this zvol has compression enabled - this would be a good choice for a mail server's storage disk - mail is highly compressible. depending on available RAM in the server and the kind of mail typically received (e.g. multiple copies of the same email), de-duping the zvol may also be worthwhile.
Do be very careful with ZFS dedup -- as you increase the amount of data on the disk, the amount of memory required to hold the dedup tables in memory goes up. Initially this reduces the amount of memory available for caching, which hurts performance. After a while, the amount of memory required *exceeds* the size of the ARC, and at this point writes effectively stop working. (They do continue to work, but it is soooooooooooooooooo sloooooooooow as to be useless.) So when speccing up a machine for ZFS w/dedup, keep that in mind -- as a rough guide, you need something like 5G or ARC per TB of data, and your ARC is a 1/4 of your system memory.. so you'll need 20G of RAM for 1TB of deduped data! I've seen people suggest using an SSD as l2arc, and thus enabling the dedup tables to live there; I tried it myself and although it was better than before, it still wasn't really great. Here's an article with more info: http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe

On Wed, Oct 17, 2012 at 02:15:58PM +1100, Russell Coker wrote:
On Wed, 17 Oct 2012, Craig Sanders <cas@taz.net.au> wrote:
Note that this zvol has compression enabled - this would be a good choice for a mail server's storage disk - mail is highly compressible. depending on available RAM in the server and the kind of mail typically received (e.g. multiple copies of the same email), de-duping the zvol may also be worthwhile.
The last time I checked the average message size on a medium size mail spool it was about 70K.
compression would bring that down to (very roughly) an average of about 5-15K per message.
The headers are essentially impossible to dedup as they differ in the final stage of delivery even if a single SMTP operation was used to send to multiple local users. Deduping the message body seems unlikely to provide a significant benefit as there usually aren't that many duplicates, not even when you count spam and jokes -
the scenario I was thinking of was internal email memos sent to "all staff", with a stupidly large word .doc or .pdf file attached. for an ISP mail server, de-duping isn't likely to help much (if at all). For a small-medium business or corporate mail server, it could help a lot.
I'm assuming that ZFS is even capable of deduplicating files which have the duplicate part at different offsets, but I don't care enough about this to even look it up.
zfs de-duping is done at block level. if a block's hash is an exact match with another block's hash then it can be de-duped.
For every server I run that has any duplicate content RAM is a more limited resource than disk space. For example the server which is full of raw files from digital cameras is never going to benefit from dedup even though it has enough RAM to run it. So there's no possibility of me gaining anything from it.
me too. i don't use zfs de-dupe at all. it is, IMO, of marginal use. adding more disks (or replacing with larger disks) is almost always going to be cheaper and better. but there are some cases where it could be useful...so I don't want to dismiss it just because I have no personal need for it. Editing large video files, perhaps. multiple cycles of edit & versioned save would use not much more space than the original file + the size of the diffs. VMs are quite often touted as a good reason for de-duping - hundreds of almost identical zvols. I remain far from convinced that de-duping is the best use of available RAM on a virtualisation server, or that upgrading/adding disks wouldn't be better. craig -- craig sanders <cas@taz.net.au>

On Wed, 17 Oct 2012, Craig Sanders <cas@taz.net.au> wrote:
The last time I checked the average message size on a medium size mail spool it was about 70K.
compression would bring that down to (very roughly) an average of about 5-15K per message.
Yes, that could be a real win. If multiple processes are doing synchronous writes at the same time does ZFS bundle them in the same transaction? At busy times I have 12 processes doing synchronous delivery at the same time. If ZFS was to slightly delay a couple of them to create a bundle of 5+ synchronous file writes in the same operation it could improve overall performance and save bandwidth for the occasional read from disk.
The headers are essentially impossible to dedup as they differ in the final stage of delivery even if a single SMTP operation was used to send to multiple local users. Deduping the message body seems unlikely to provide a significant benefit as there usually aren't that many duplicates, not even when you count spam and jokes -
the scenario I was thinking of was internal email memos sent to "all staff", with a stupidly large word .doc or .pdf file attached.
for an ISP mail server, de-duping isn't likely to help much (if at all). For a small-medium business or corporate mail server, it could help a lot.
True. But if that sort of thing comprises a significant portion of your email then there are better ways of solving the problem. A Wiki is often one part of the solution to that problem.
I'm assuming that ZFS is even capable of deduplicating files which have the duplicate part at different offsets, but I don't care enough about this to even look it up.
zfs de-duping is done at block level. if a block's hash is an exact match with another block's hash then it can be de-duped.
So I guess it does no good for email then unless your MTA stores the attachments as separate files (IE not Maildir).
me too. i don't use zfs de-dupe at all. it is, IMO, of marginal use. adding more disks (or replacing with larger disks) is almost always going to be cheaper and better. but there are some cases where it could be useful...so I don't want to dismiss it just because I have no personal need for it.
Presumably the Sun people who dedicated a lot of engineering and testing time to developing the feature had some reason to do so.
Editing large video files, perhaps. multiple cycles of edit & versioned save would use not much more space than the original file + the size of the diffs.
For uncompressed video that could be the case. One of my clients currently has some problems with that sort of thing. They are using local non-RAID storage on Macs and then saving the result to the file server because of file transfer problems with files >2G.
VMs are quite often touted as a good reason for de-duping - hundreds of almost identical zvols. I remain far from convinced that de-duping is the best use of available RAM on a virtualisation server, or that upgrading/adding disks wouldn't be better.
For a VM you have something between 500M and 5G of OS data, if it's closer to 5G then it's probably fairly usage specific so less to dedup. For most of the VMs I run the application data vastly exceeds the OS data so the savings would be at most 10% or less. Not to mention the fact that the most common VM implementations use local storage which means that with a dozen VMs on a single system of which multiple distributions are in use there is little opportunity for dedup. Finally if you have 5G OS images for virtual machines then you could have more than 500 such images on a 3TB disk, so even if you can save disk space it's still going to be easier and cheaper to buy more disk. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, 17 Oct 2012, Craig Sanders wrote:
The headers are essentially impossible to dedup as they differ in the final stage of delivery even if a single SMTP operation was used to send to multiple local users. Deduping the message body seems unlikely to provide a significant benefit as there usually aren't that many duplicates, not even when you count spam and jokes -
the scenario I was thinking of was internal email memos sent to "all staff", with a stupidly large word .doc or .pdf file attached.
for an ISP mail server, de-duping isn't likely to help much (if at all). For a small-medium business or corporate mail server, it could help a lot.
I'm assuming that ZFS is even capable of deduplicating files which have the duplicate part at different offsets, but I don't care enough about this to even look it up.
zfs de-duping is done at block level. if a block's hash is an exact match with another block's hash then it can be de-duped.
And guess what happens when 200 bytes into the message, Delivered-To: changes from 123@abc.corp to 1234@abc.corp? Every subsequent byte is out by one and no subsequent block looks the same.
Editing large video files, perhaps. multiple cycles of edit & versioned save would use not much more space than the original file + the size of the diffs.
Would multiple large video edits that insert/delete a frame here or there result in a non-integer amount of filesystem blocks to be inserted? I can't imagine things lining up on filesystem blocks sizes neatly like that.
VMs are quite often touted as a good reason for de-duping - hundreds of almost identical zvols. I remain far from convinced that de-duping is the best use of available RAM on a virtualisation server, or that upgrading/adding disks wouldn't be better.
We don't do it. Meh, 20GB per VM common, when the rest of the 500GB-20TB is unique on each system. I get the feeling parts of our SAN are zfs underneath, but the SAN controller is going to have a bit of trouble keeping track of cache for hundreds of TB of disk. -- Tim Connors

On Wed, 17 Oct 2012, Craig Sanders wrote:
From the guest VM's point-of-view, it's just a disk with nothing special about it.
ext3 or ext4 performance in the guest will be similar to performance if the guest were given an LVM lv.
I haven't done any benchmarking to compare zvol with lv (mostly because and I can't afford to add 4 drives to my ZFS server just to test LVM lv vs ZFS zvol performance), but I can give a subjective anecdote that the performance improvement from using a ZFS zvol instead of a qcow2 disk image is about the same as using an LVM lv instead of a qcow2 file.
i.e. *much* faster.
if i had to guess, i'd say that there are probably some cases where LVM (with its nearly direct raw access to the underlying disks) would be faster than ZFS zvols but in most cases, ZFS' caching, compression, COW and so on would give the performance advantage to ZFS.
Just make sure the sum of the zvols remains below 80% of the total disk usage, I guess. ext4 + lvm can effectively use more than 99% of disk space (I've done it for years), but the moment you try to do lots of rewrites on a device on zfs, the lack of a free-space-cache that btrfs has means that the highly fragmented nature of the remaining 20% of space makes zfs completely unusable the first time you try it (multi-minute pauses and 250kB/s write rates vs 100MB/s. I quickly bought new disks). I'm actually regretting my move to zfs because of it - I can hardly afford to repeat the month it took to rsync the backuppc pool to zfs in the first place. If ext4+md aint broke, don't fix it. Sure it sucked, but the alternatives suck harder. The 200 or so VMs at work are split across datastores many of which are 99% full, and they're still surprisingly healthy. I'm guess the SAN is *not* zfs based. -- Tim Connors

On Tue, 30 Oct 2012, Tim Connors <tconnors@rather.puzzling.org> wrote:
if i had to guess, i'd say that there are probably some cases where LVM (with its nearly direct raw access to the underlying disks) would be faster than ZFS zvols but in most cases, ZFS' caching, compression, COW and so on would give the performance advantage to ZFS.
Just make sure the sum of the zvols remains below 80% of the total disk usage, I guess. ext4 + lvm can effectively use more than 99% of disk space (I've done it for years), but the moment you try to do lots of rewrites on a device on zfs, the lack of a free-space-cache that btrfs has means that the highly fragmented nature of the remaining 20% of space makes zfs completely unusable the first time you try it (multi-minute pauses and 250kB/s write rates vs 100MB/s. I quickly bought new disks).
That's one of several bad reports I've read about ZFS performance on mostly full filesystems. But it shouldn't be a big deal. I've got some servers that are all less than 25% full which I plan to convert to ZFS, even with snapshots I doubt that they will go over 50%. For my home server I currently have 1TB of RAID-1 which is more than 90% full (really bad given all the different Ext4 filesystems some of which are 99% full). When I convert that to ZFS on 3TB disks it will be a lot easier and I won't have issues of filesystem A filling while filesystem B is empty. If I upgrade my home server from 3TB disks when they have 2.5TB of data instead of waiting until they have 3TB of data it won't be that much of a big deal.
I'm actually regretting my move to zfs because of it - I can hardly afford to repeat the month it took to rsync the backuppc pool to zfs in the first place. If ext4+md aint broke, don't fix it. Sure it sucked, but the alternatives suck harder.
Ext4+md is broken. Linux software RAID has no support for recovering from silent data corruption, not even on RAID-6. Such corruption is becoming more common as data volumes steadily increase. We need BTRFS/ZFS features for data integrity. We also need something better than Ext4 fsck times. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Tue, Oct 30, 2012 at 08:37:39PM +1100, Tim Connors wrote:
Just make sure the sum of the zvols remains below 80% of the total disk usage, I guess.
yep, that's well documented. for read-mostly usage, it's not a problem...but write performance can really suck on ZFS as the disk gets close to full.
I'm actually regretting my move to zfs because of it - I can hardly afford to repeat the month it took to rsync the backuppc pool to zfs in the first place.
IMO, backuppc and zfs (or btrfs) are not a good mix. admittedly, i don't like backuppc much (my personal experiences with it have been pretty bad), but that's just my subjective preference. the real problem is that backuppc is creating squillions of hard links in order to retain a backup history - which made sense in the days before filesystem (btrfs, zfs) or volume level (lvm) snapshots were available. in other words, it's solving a problem that isn't actually a problem with these filesystems. you've got a multi-year investment in backuppc and obviously don't want to lose your backup history....my suggestion is that you investigate ways to convert your existing backuppc archive into a series of zfs snapshots. i doubt very much that there's an existing tool to do this, but it's probably not difficult to write a shell script to automate the process of iterating through the list of backuppc "snapshots", restore each one individually to a zfs filesystem, and snapshot. repeat until they're all done. write the script so that it can be stopped and started at will (e.g. skip any restores that have already been done) so you can run it only during idle times (or just use ionice to make it a very low priority IO job). once it has completed, you'd use rsync+snapshot for your backups. unless you used ZFS de-duping (with all the hardware requirements - massive amounts of RAM and L2ARC) you'd lose multi-host de-duping, but you'd keep de-duping on a per backup-host base due to the COW nature of ZFS and zfs snapshots. this would, no doubt, take a LONG time to run...probably nowhere near as long as it took you to rsync squillions of hard links (huge numbers of hard-links are a problem for rsync - it has to keep track of them all in memory by inode number). you'd also have to have a second zpool to restore the backups to. craig -- craig sanders <cas@taz.net.au> BOFH excuse #350: paradigm shift...without a clutch

Craig Sanders <cas@taz.net.au> wrote:
IMO, backuppc and zfs (or btrfs) are not a good mix.
admittedly, i don't like backuppc much (my personal experiences with it have been pretty bad), but that's just my subjective preference.
the real problem is that backuppc is creating squillions of hard links in order to retain a backup history - which made sense in the days before filesystem (btrfs, zfs) or volume level (lvm) snapshots were available. in other words, it's solving a problem that isn't actually a problem with these filesystems.
Nowadays, I am in the habit of running git init in any directory in which I edit files and therefore wish to maintain a history. Thereafter, I simply start committing changes. A backup then simply comprises a copy of the directory (with its .git subdirectory included). Obnam is a relatively new backup tool that looks interesting. Like a file system, though, I would wait for new backup software to mature before entrusting data to it.

Jason White <jason@jasonjgw.net> wrote:
Obnam is a relatively new backup tool that looks interesting. Like a file system, though, I would wait for new backup software to mature before entrusting data to it.
Also of interest is Btrfs send/receive: http://lwn.net/Articles/506244/ (According to the article, ZFS has it too, so it may solve problems for ZFS users interested in this discussion.)

On Fri, 2 Nov 2012, Jason White <jason@jasonjgw.net> wrote:
Also of interest is Btrfs send/receive: http://lwn.net/Articles/506244/
(According to the article, ZFS has it too, so it may solve problems for ZFS users interested in this discussion.)
The difference is that ZFS send/recv has been used in production for years while the BTRFS functionality is an experimental patch set that was released less than four months ago. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 02/11/12 13:43, Jason White wrote:
Also of interest is Btrfs send/receive: http://lwn.net/Articles/506244/
This code is still in active development, there were changes sent for it for 3.6 after the merge window closed and the first RC was released that Linus rejected as being too late. There is at least one patch being discussed on the btrfs list as a possible submission for the next 3.7 RC.
(According to the article, ZFS has it too, so it may solve problems for ZFS users interested in this discussion.)
Interestingly, this has been stated on the list: http://permalink.gmane.org/gmane.comp.file-systems.btrfs/21015 # That's the whole point of the btrfs-send design: It's very easy # to receive on different filesystems. A generic receiver is in # preparation. And to make it even more generic: A sender using # the same stream format is also in preparation for zfs. So you may well be able to do bidirectional exchanges between btrfs and ZFS filesystems. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

On Fri, 2 Nov 2012, Chris Samuel wrote:
(According to the article, ZFS has it too, so it may solve problems for ZFS users interested in this discussion.)
Interestingly, this has been stated on the list:
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/21015
# That's the whole point of the btrfs-send design: It's very easy # to receive on different filesystems. A generic receiver is in # preparation. And to make it even more generic: A sender using # the same stream format is also in preparation for zfs.
So you may well be able to do bidirectional exchanges between btrfs and ZFS filesystems.
Cute! And just to be sure, you're not just talking about snapshot incrementals, are you? You'd be able to do this for the entire filesystem full snapshot? Hopefully it'd be able to get close to wire/disk speed. -- Tim Connors

On 02/11/12 15:18, Tim Connors wrote:
Cute! And just to be sure, you're not just talking about snapshot incrementals, are you? You'd be able to do this for the entire filesystem full snapshot? Hopefully it'd be able to get close to wire/disk speed.
No idea sorry, you'd need to ask on the btrfs list I'm afraid. It's not something I'm following in detail, I just happened to notice it on the train on the way in this morning. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Hi all,
Cute! And just to be sure, you're not just talking about snapshot incrementals, are you? You'd be able to do this for the entire filesystem full snapshot? Hopefully it'd be able to get close to wire/disk speed.
Just for ZFS (under FreeBSD): I have throttle inbetween (15 Megabit/second) because it was "hammering" the disks to a stand still for the rest of the system. Yep, this, the 20% free space and the generous use of RAM are my main issues with ZFS but they are not big enough to outweight the advantages I see in everyday administration. The snapshots (and zfs send/receive - I mirror every mission-critical filesystem to a "partner" machine) are part of the advantages, it is all very easy. Regards Peter

On Fri, 2 Nov 2012, Peter Ross wrote:
Just for ZFS (under FreeBSD): I have throttle inbetween (15 Megabit/second) because it was "hammering" the disks to a stand still for the rest of the system.
Yep, this, the 20% free space and the generous use of RAM are my main issues with ZFS but they are not big enough to outweight the advantages I see in everyday administration.
"Generous memory usage" indeed. My zfs server with 4GB of ram crashes once a week on average with zfs arc backtraces or just livelocks with txg_sync taking 100% CPU usage (with no zfs access, which is bad for a machine with zfs mounted /var by necessity of /boot being a 128MB flash IDE card and / being a 8GB USB nanostick). Each -rc release says "we finally fixed that txg_sync and arc_reclaim issue", but alas it never comes good. My favourite is where sometimes mysteriously arc usage increases tens of times beyond arc_max before the machine grinds to a halt. Someone ballsed up the memory accounting! (Just saying this just in case anyone thought zfs was preduction ready yet. Apparently the same issues sometimes affects bsd and indiana).
The snapshots (and zfs send/receive - I mirror every mission-critical filesystem to a "partner" machine) are part of the advantages, it is all very easy.
I am surprised people find zfs send so useful though. My only use for it would be if I could get data of it onto another filesystem at close to wire speed :) I mean, I thought it was pretty routine back in days of yore to do an: lvm snapshot dd if=/dev/base/home-snap | ssh blah dd of=/dev/base/home Ok, a couple more commands, but infinitely flexible. -- Tim Connors

On Fri, 2 Nov 2012, Tim Connors wrote:
On Fri, 2 Nov 2012, Peter Ross wrote:
"Generous memory usage" indeed. My zfs server with 4GB of ram crashes once a week on average with zfs arc backtraces or just livelocks with txg_sync taking 100% CPU usage (with no zfs access, which is bad for a machine with zfs mounted /var by necessity of /boot being a 128MB flash IDE card and / being a 8GB USB nanostick). Each -rc release says "we finally fixed that txg_sync and arc_reclaim issue", but alas it never comes good. My favourite is where sometimes mysteriously arc usage increases tens of times beyond arc_max before the machine grinds to a halt. Someone ballsed up the memory accounting!
(Just saying this just in case anyone thought zfs was preduction ready yet. Apparently the same issues sometimes affects bsd and indiana).
I am using ZFS under FreeBSD in production for ca. 18 months (and used it under OpenSolaris before but there is was more a proof of concept, and then came Oracle..) Amongst the machines are two with 3 or 4GB of RAM but they are doing "low-volume work" (DNS, DHCP, a wiki, a mailing list server, a syslog server etc) The others have 16 GB each, and ARC is limited to 8 GB (vfs.zfs.arc_max=8589934592) because a lot of stuff is more hungry for I/O than memory (proxy servers, file servers, database servers etc.) I haven't seen an issue. At the beginning I was bothering a FreeBSD mailing list with a "ZFS problem" (when I backed up virtual disks via ssh) but at the end it was netstack related, since then my loader.conf has "net.graph.maxdata=65536" in it. Well, it wasn't easy to spot, and it helped to solve older cases. ZFS adds a level of complexity, and I wouldn't mind to have that "outsourced" to the periphery so it does not interfere with the "main system" (imagine a RAID controller with a ZFS API to the main system, and an internal processor+memory to serve it)
I am surprised people find zfs send so useful though. My only use for it would be if I could get data of it onto another filesystem at close to wire speed :)
You will run another stuff as I do.. Anyway, 15 MBit/sec is enough for a high-volume document server system I maintained before, and many other cases. Someone has to produce 15MBit/sec (useful;-) data per second first.
I mean, I thought it was pretty routine back in days of yore to do an: lvm snapshot dd if=/dev/base/home-snap | ssh blah dd of=/dev/base/home
Ok, a couple more commands, but infinitely flexible.
Nothing wrong with that but I don't need the ropes around it anymore, if I take care not to fall in the potential traps (now known to me). Cheers Peter
participants (8)
-
Chris Samuel
-
Craig Sanders
-
Jason White
-
Matthew Cengia
-
Peter Ross
-
Russell Coker
-
Tim Connors
-
Toby Corkindale