Re: cache, Xen, and zfsonlinux

17 Oct 2012

      On Sun, Oct 14, 2012 at 09:01:49PM -0700, Daniel Pittman wrote:
...
On Sun, Oct 14, 2012 at 7:25 PM, Russell Coker <russell@coker.com.au> wrote:
...
I'm looking at converting some Xen servers to ZFS.  This includes a
couple of servers for a reasonable size mail store (8,000,000 files
and 600G of Maildir storage).
For much of the Xen on ZFS stuff I'll just use zvols for block
devices and then use regular Linux filesystems such as Ext3 inside
them.  This isn't particularly efficient but for most DomUs it
doesn't matter at all.  Most of the DomUs have little disk access
as they don't do much writing and have enough cache to cover most
reads.
For the mail spool a zvol would be a bad idea, fsck on a 400G Ext3/4
filesystem is a bad thing and having the double filesystem overhead
of Ext3/4 on top of a zvol is going to suck for the most disk
intensive filesystem.
zvol is more like an LVM logical volume than a filesystem, so the
overhead isn't nearly as much as this comment suggests.
yep.
...
That said, running ext3 (especially) or ext4 on top of it is going to
be slower, and means you can't use the RAID style features of ZFS, and
you give up object level checksums.
that's not exactly true - the guest won't know anything about the ZFS
features, but the ZFS file-server certainly will....the zvol will be
a chunk of allocated space from one of the zpools on the system. it
can optionally be sparse-allocated (for thin-provisioning, greatly
reduces space used, but performance can suffer). The zvol has all the
benefits of the zfs pool, including snapshotting and cloning, COW,
error checking and recovery, SSD read and write caching. The zvol can
be backed up (or moved to another ZFS server) with 'zfs send' & 'zfs
receive'. It can also be exported as an iscsi volume (e.g. so that a
remote virtualisation cpu node can access the volume storage on the zfs
file server).

cloning is particularly useful for VMs - in short, set up a
'template' VM image, clean it up (e.g. run 'apt-get clean', delete
/etc/udev/rules.d/70-persistent-net.rules, and so on), snapshot it, and
then clone the snapshot whenever you need a new VM.

you could even, for example, build a squeeze 6.0 VM template, snapshot
it, then later boot it up and upgrade to 6.01, 6.02, ..., 6.06, and
have a cleaned up snapshot of each point-release, any of which could be
cloned into a new VM at any time.
...
From the guest VM's point-of-view, it's just a disk with nothing special
about it.
ext3 or ext4 performance in the guest will be similar to performance if
the guest were given an LVM lv.

I haven't done any benchmarking to compare zvol with lv (mostly because
and I can't afford to add 4 drives to my ZFS server just to test LVM lv
vs ZFS zvol performance), but I can give a subjective anecdote that the
performance improvement from using a ZFS zvol instead of a qcow2 disk
image is about the same as using an LVM lv instead of a qcow2 file.

i.e. *much* faster.

if i had to guess, i'd say that there are probably some cases where LVM
(with its nearly direct raw access to the underlying disks) would be
faster than ZFS zvols but in most cases, ZFS' caching, compression, COW
and so on would give the performance advantage to ZFS.

ZFS's other advantages, especially lightweight and unlimited snapshots,
make it worth using over LVM anyway.

FYI, here are the details on one of several zvols of various sizes that
I have on my home ZFS server. They're all used by KVM virtual machines.

# zfs get all export/sid
NAME        PROPERTY              VALUE                  SOURCE
export/sid  type                  volume                 -
export/sid  creation              Sun Mar 25 14:19 2012  -
export/sid  used                  5.16G                  -
export/sid  available             694G                   -
export/sid  referenced            1.91G                  -
export/sid  compressratio         1.69x                  -
export/sid  reservation           none                   default
export/sid  volsize               5G                     local
export/sid  volblocksize          8K                     -
export/sid  checksum              on                     default
export/sid  compression           on                     inherited from export
export/sid  readonly              off                    default
export/sid  copies                1                      default
export/sid  refreservation        5.16G                  local
export/sid  primarycache          all                    default
export/sid  secondarycache        all                    default
export/sid  usedbysnapshots       0                      -
export/sid  usedbydataset         1.91G                  -
export/sid  usedbychildren        0                      -
export/sid  usedbyrefreservation  3.25G                  -
export/sid  logbias               latency                default
export/sid  dedup                 off                    default
export/sid  mlslabel              none                   default
export/sid  sync                  standard               default
export/sid  refcompressratio      1.69x                  -
export/sid  written               1.91G                  -

Note that this zvol has compression enabled - this would be a good
choice for a mail server's storage disk - mail is highly compressible.
depending on available RAM in the server and the kind of mail typically
received (e.g. multiple copies of the same email), de-duping the zvol
may also be worthwhile.
...
...
Any suggestions?
I would aim to run ZFS in the mail domU, and treat the zvol as a
"logical volume" block device.  You will have some overhead from the
double checksums, but robust performance.  It treats the underlying
dom0 ZFS as a fancy LVM, essentially.  You probably also need to
allocate substantially more memory to the domU than you would
otherwise.
That's really not needed.  Most VMs just need fast, reliable storage,
and know or care exactly what the underlyingstorage is (nor should they
have to) - it's abstracted away as a virtio disk, /dev/vda or /dev/vdb
or as an iscsi disk.

There may be some exceptions where the VM needs to run ZFS itself
on a bunch of zvols, but the only real use-case i've found is for
experimenting with and testing zfs itself (e.g. i've created numerous
zvols of a few hundred MB each and used them in a VM to create a zpool
from them)

being able to snapshot and zfs send within the VM itself could be useful.
OTOH rsync provides a similar incremental backup.

craig

-- 
craig sanders <cas@taz.net.au>

Re: cache, Xen, and zfsonlinux

Craig Sanders