Re: experience with XFS

6 Apr 2012

      On Fri, Apr 06, 2012 at 11:43:40AM -0400, Robin Humble wrote:
...
fair enough. I guess if remapped sectors are incrementing up in the
drive's smart data then it's probably working, but if the drives are
just timing out SCSI cmds randomly (should they really do that?!) then
that wouldn't show up.
they're consumer drives so, yeah, they have long retry times. enterprise
drives are far quicker to return an error...consumer drives keep on
trying (non-tunable), and the kernel has a few retries too (tunable) so
it can sometimes take a minute or more. that can be enough for zfs (or
the SAS card, if it's still flashed with raid firmware like mine is) to
decide the drive is failing.
...
...
any idea what's causing the deadlocks?
the traces and some builds back and forward through git commits give
some idea. I'm guessing attempts by lustre to send data using zero copy
write hooks in zfs are racing with (non-zero copy?) metadata (attr)
updates. I'll email Brian and see if he can suggest something and/or
which mailing list or jira or github issue to post to.
I assume regular ZFS is ok and stable because it doesn't attempt zero
copy writes.
ah, okay.    well, at least you know it is a priority for the LLNL
zfs + lustre project :)
...
...
only when writing, or reading too?  random or sequential writes?
just writes. sometimes sequential and sometimes random. always with at
least 32 and often with 128 1M i/o's in flight from clients.
you got any SSDs for ZIL?
...
so I guess you're running with ashift=12 and a limit on zfs_arc_max?
yep.  4GB zfs_arc_max.  I set it to that when my machine only had 8GB
RAM, and didn't bother changing it when in upgraded to 16GB.  seems
good for a shared desktop/server running a few bloated apps like chromium.

$ cat /etc/modprobe.d/zfs.conf 
# use minimum 1GB and maxmum of 4GB RAM for ZFS ARC
options zfs zfs_arc_min=1073741824 zfs_arc_max=4294967296 

and yes, using ashift=12 because some of my drives are 4K sectors
("advanced format") and i expect all of them will be replaced with 4K
drives in the not-too-distant future.

on my zfs backup server with 24GB RAM at work, i have:

# use minimum 4GB and maxmum of 12GB RAM for ZFS ARC
options zfs zfs_arc_min=4294967296 zfs_arc_max=12884901888

i'd be comfortable increasing that up to 20 or 22GB. usage is mostly
writes (rsync backups), so it doesn't matter much.
...
I'm also using zfs_prefetch_disable=1 (helps lustre reads), but apart
from that no other zfs tweaks, no l2arc SSDs yet etc.
L2ARC isn't any use for writes anyway, except indirectly as it helps
reduce the read i/o load on the disks.

i'd try putting in a good fast SLC SSD as ZIL and see if that helps.
it should certainly smooth out the metadata updates.

for ZIL you probably don't need more than a few GB (maybe as much as
4 or even 8. OTOH it doesn't hurt to have more ZIL than you need, it
just doesn't get used, so use the entire device) or so for this purpose
so write speed of the SSD is far more important than capacity. and for
multiple simultaneous writers you're probably better off with multiple
small ZIL devices than one larger one.

note, however that larger SSDs tend to be faster than smaller ones due
to internal raid0-like configuration of the individual flash chips.
120GB and 240GB SSDs, for example, are noticably faster than 60GB.

excess space could be just wasted or perhaps the SSD could be
partitioned and the excess used as L2ARC (but that would impact on the
ZIL partition's write performance, so maybe better to just use the
entire ssd as ZIL regardless of size).

fortunately, log (ZIL) and cache (L2ARC) devices can be added and
removed at whim with zfs, so you can experiment easily with different
configurations.

watching 'zfs iostat -v' will tell you ZIL usage.

craig

-- 
craig sanders <cas@taz.net.au>

BOFH excuse #327:

The POP server is out of Coke

Re: experience with XFS

Craig Sanders