experience with XFS

Anthony Shipman

23 Mar 2012 23 Mar '12

3:11 p.m.

I've been running a Centos 6.2 system for two months now. I've been using XFS for the /home file system since I had the impression it would be better for large file systems by avoiding long fsck times. At the same time the nvidia driver has proved to be buggy and has crashed the system several times. XFS has the habit of zeroing out some files each time there is a crash. This would be understandable for files that were being written around the time of the crash. But I've had files be erased that were created hours before the crash and were read-only after creation. I've set some sysctl parameters down to 5 seconds to flush more frequenctly fs.xfs.xfssyncd_centisecs = 500 fs.xfs.xfsbufd_centisecs = 100 fs.xfs.age_buffer_centisecs = 500 fs.xfs.filestream_centisecs = 500 (The file system is mounted with options "defaults,relatime") You wouldn't expect files last modified more than 12 hours before the crash to be erased but that is what has just happened this Friday evening. Perhaps there is some care and feeding of XFS that I'm not doing? Should I be using xfs_repair? I expect I'll just switch to ext4. -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Show replies by date

Jason White

23 Mar 23 Mar

10:30 p.m.

Anthony Shipman <als@iinet.net.au> wrote:

...

XFS has the habit of zeroing out some files each time there is a crash. This would be understandable for files that were being written around the time of the crash. But I've had files be erased that were created hours before the crash and were read-only after creation.

I've been running XFS for years on several machines and never seen this. I would suggest taking it up on the XFS list, because it isn't normal behaviour.

Brett Pemberton

11:57 p.m.

On Sat, Mar 24, 2012 at 2:11 AM, Anthony Shipman <als@iinet.net.au> wrote:

...

At the same time the nvidia driver has proved to be buggy and has crashed the system several times.

What nvidia driver? What version? I've been running 280.13, and I have to say I haven't seen a crash due to nvidia in literally years. Perhaps the hardware is to blame? / Brett

Anthony Shipman

24 Mar 24 Mar

1:45 a.m.

On Sat, 24 Mar 2012 10:57:42 am Brett Pemberton wrote:

...

On Sat, Mar 24, 2012 at 2:11 AM, Anthony Shipman <als@iinet.net.au> wrote:

...
At the same time the nvidia driver has proved to be buggy and has crashed the system several times.

What nvidia driver? What version?

I've been running 280.13, and I have to say I haven't seen a crash due to nvidia in literally years. Perhaps the hardware is to blame?

/ Brett

This is the typical kernel backtrace for the oops that precedes the system crash. Mar 24 12:25:57 kernel: ? warn_slowpath_common+0x87/0xc0 Mar 24 12:25:57 kernel: ? match_pci_dev_by_id+0x0/0x70 Mar 24 12:25:57 kernel: ? os_map_kernel_space+0x85/0xf0 [nvidia] Mar 24 12:25:57 kernel: ? warn_slowpath_null+0x1a/0x20 Mar 24 12:25:57 kernel: ? __ioremap_caller+0x35f/0x390 Mar 24 12:25:57 kernel: ? pci_conf1_read+0xc3/0x120 Mar 24 12:25:57 kernel: ? pci_conf1_read+0xc3/0x120 Mar 24 12:25:57 kernel: ? ioremap_cache+0x14/0x20 Mar 24 12:25:57 kernel: ? os_map_kernel_space+0x85/0xf0 [nvidia] Mar 24 12:25:57 kernel: ? _nv014557rm+0xeb/0x10b [nvidia] Mar 24 12:25:57 kernel: ? _nv009538rm+0x89/0x142 [nvidia] Mar 24 12:25:57 kernel: ? _nv014127rm+0xb8/0x102 [nvidia] Mar 24 12:25:57 kernel: ? _nv014167rm+0x58/0x9e [nvidia] Mar 24 12:25:57 kernel: ? _nv014137rm+0xbe/0x2f0 [nvidia] Mar 24 12:25:57 kernel: ? _nv014172rm+0xab/0x174 [nvidia] Mar 24 12:25:57 kernel: ? _nv014136rm+0x50/0x5d [nvidia] Mar 24 12:25:57 kernel: ? _nv014112rm+0x9ef/0xb29 [nvidia] Mar 24 12:25:57 kernel: ? _nv012389rm+0x174/0x662 [nvidia] Mar 24 12:25:57 kernel: ? _nv012389rm+0xf2/0x662 [nvidia] Mar 24 12:25:57 kernel: ? _nv003997rm+0x1e5/0x1deb [nvidia] Mar 24 12:25:57 kernel: ? _nv004022rm+0xa96e/0xd078 [nvidia] Mar 24 12:25:57 kernel: ? _nv004022rm+0x9070/0xd078 [nvidia] Mar 24 12:25:57 kernel: ? _nv009823rm+0x25/0x40 [nvidia] Mar 24 12:25:57 kernel: ? _nv014614rm+0x7c8/0x942 [nvidia] Mar 24 12:25:57 kernel: ? _nv001086rm+0x522/0x7a1 [nvidia] Mar 24 12:25:57 kernel: ? rm_init_adapter+0xae/0x1bb [nvidia] Mar 24 12:25:57 kernel: ? enable_irq+0x64/0xa0 Mar 24 12:25:57 kernel: ? nv_kern_open+0x494/0x7f0 [nvidia] Mar 24 12:25:57 kernel: ? chrdev_open+0x125/0x230 Mar 24 12:25:57 kernel: ? chrdev_open+0x0/0x230 Mar 24 12:25:57 kernel: ? __dentry_open+0x10a/0x360 Mar 24 12:25:57 kernel: ? selinux_inode_permission+0x72/0xb0 Mar 24 12:25:57 kernel: ? security_inode_permission+0x1f/0x30 Mar 24 12:25:57 kernel: ? nameidata_to_filp+0x54/0x70 Mar 24 12:25:57 kernel: ? do_filp_open+0x6c0/0xd60 Mar 24 12:25:57 kernel: ? alloc_fd+0x92/0x160 Mar 24 12:25:57 kernel: ? do_sys_open+0x69/0x140 Mar 24 12:25:57 kernel: ? sys_chown+0x68/0xa0 Mar 24 12:25:57 kernel: ? sys_open+0x20/0x30 Mar 24 12:25:57 kernel: ? system_call_fastpath+0x16/0x1b /var/log/messages had this warning a little later: Mar 24 12:27:49 kernel: WARNING: at kernel/sched.c:5914 thread_return+0x232/0x79d() Mar 24 12:27:49 kernel: ? warn_slowpath_common+0x87/0xc0 Mar 24 12:27:49 kernel: ? warn_slowpath_null+0x1a/0x20 Mar 24 12:27:49 kernel: ? thread_return+0x232/0x79d Mar 24 12:27:49 kernel: ? schedule_hrtimeout_range+0x13d/0x160 Mar 24 12:27:49 kernel: ? add_wait_queue+0x46/0x60 Mar 24 12:27:49 kernel: ? __pollwait+0x75/0xf0 Mar 24 12:27:49 kernel: ? poll_schedule_timeout+0x39/0x60 Mar 24 12:27:49 kernel: ? do_sys_poll+0x45b/0x520 Mar 24 12:27:49 kernel: ? __pollwait+0x0/0xf0 Mar 24 12:27:49 kernel: ? pollwake+0x0/0x60 Mar 24 12:27:49 kernel: ? pollwake+0x0/0x60 Mar 24 12:27:49 kernel: ? pollwake+0x0/0x60 Mar 24 12:27:49 kernel: ? pollwake+0x0/0x60 Mar 24 12:27:49 kernel: ? pollwake+0x0/0x60 Mar 24 12:27:49 kernel: ? sock_aio_write+0x0/0x170 Mar 24 12:27:49 kernel: ? do_sync_readv_writev+0xfb/0x140 Mar 24 12:27:49 kernel: ? handle_mm_fault+0x1e4/0x2b0 Mar 24 12:27:49 kernel: ? evdev_read+0xdf/0x280 Mar 24 12:27:49 kernel: ? selinux_file_permission+0xfb/0x150 Mar 24 12:27:49 kernel: ? security_file_permission+0x16/0x20 Mar 24 12:27:49 kernel: ? vfs_read+0xb5/0x1a0 Mar 24 12:27:49 kernel: ? sys_read+0x62/0x90 Mar 24 12:27:49 kernel: ? sys_poll+0x7c/0x110 -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Anthony Shipman

2:02 a.m.

On Sat, 24 Mar 2012 10:57:42 am Brett Pemberton wrote:

...

On Sat, Mar 24, 2012 at 2:11 AM, Anthony Shipman <als@iinet.net.au> wrote:

...
At the same time the nvidia driver has proved to be buggy and has crashed the system several times.

What nvidia driver? What version?

I've been running 280.13, and I have to say I haven't seen a crash due to nvidia in literally years. Perhaps the hardware is to blame?

/ Brett

I've discovered more error messages in /var/log/messages just before the crash Mar 24 03:41:16 newpc kernel: NVRM: Xid (0000:01:00): 13, 0005 00000000 00009297 00001c08 0923e460 00000000 Mar 24 03:41:16 newpc kernel: NVRM: Xid (0000:01:00): 32, Channel ID 00000005 intr 00040000 Mar 24 03:41:16 newpc kernel: NVRM: Xid (0000:01:00): 31, Ch 00000005, engmask 00000101, intr 10000000 and a forum comment A. "Xid" messages indicate that a general GPU error occurred, most often due to the driver misprogramming the GPU or to corruption of the commands sent to the GPU. I suppose this doesn't exonerate the hardware. -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Jason White

4:04 a.m.

Anthony Shipman <als@iinet.net.au> wrote:

...

and a forum comment

A. "Xid" messages indicate that a general GPU error occurred, most often due to the driver misprogramming the GPU or to corruption of the commands sent to the GPU.

I suppose this doesn't exonerate the hardware.

Correct. Have you tried it with the Nouveau driver?

Anthony Shipman

4:21 a.m.

On Sat, 24 Mar 2012 03:04:31 pm Jason White wrote:

...

Correct.

Have you tried it with the Nouveau driver?

I've had bad experiences with it in the past due to poor performance. IIRC there is no configuration tool for it. Redhat expects that the default configuration is all you will need and if you need something different you're supposed to write your own xorg.conf file from scratch. The nvidia driver at least comes with nvidia-setting and nvidia-xconfig tools. It was some months ago but I think some of the default setings for the nouveau driver were wrong. -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Jason White

4:30 a.m.

Anthony Shipman <als@iinet.net.au> wrote:

...

I've had bad experiences with it in the past due to poor performance. IIRC there is no configuration tool for it. Redhat expects that the default configuration is all you will need and if you need something different you're supposed to write your own xorg.conf file from scratch. The nvidia driver at least comes with nvidia-setting and nvidia-xconfig tools.

Try a newer kernel with the Nouveau driver; that will exclude bugs in the proprietary driver as a possible cause. If it crashes too, then there are either bugs in both drivers or hardware issues (and I would suggest the probability of a hardware fault is raised if there are similar crashes in two independently written drivers). I understand that kernel developers will be less willing to help you if you're running a proprietary driver, so you should now perform the testing that they'll probably ask you to do anyway when you report your issue later on.

Trent W. Buck

12:51 a.m.

Anthony Shipman wrote:

...

I expect I'll just switch to ext4.

My rule of thumb is: use ext unless you KNOW it won't cut it and have benchmarked your specific use case and shown XFS (or whatever) WILL cut it. If your goal is "to avoid long fscks", any journalled filesystem should suffice, incl. ext >2.

Russell Coker

1:51 a.m.

By default ext* will fsck every ~20 mounts and every ~6 months. Unlike the more aggressive journaled filesystems like xfs and btrfs. -- Sent from my Samsung Galaxy S Android phone with K-9 Mail.

Jason White

2:21 a.m.

Russell Coker <russell@coker.com.au> wrote:

...

By default ext* will fsck every ~20 mounts and every ~6 months.

Unlike the more aggressive journaled filesystems like xfs and btrfs.

Interesting... I haven't used ext* for over a decade, other than as a /boot partition for a while.

Craig Sanders

2:56 a.m.

On Sat, Mar 24, 2012 at 02:11:28AM +1100, Anthony Shipman wrote:

...

I've been running a Centos 6.2 system for two months now. I've been using XFS for the /home file system since I had the impression it would be better for large file systems by avoiding long fsck times. At the same time the nvidia driver has proved to be buggy and has crashed the system several times.

more details required... please show output of lspci, and 'lspci -v' for you nvidia card. what brand/model card is it? does your BIOS have IOMMU support enabled? (i've found that that can cause problems for the nvidia driver) also details on your kernel version and hardware would be useful, including details your motherboard, disk controller, on the drives used by your XFS fs. also is XFS on raw disk partition(s) or LVM?

...

XFS has the habit of zeroing out some files each time there is a crash. This would be understandable for files that were being written around the time of the crash. But I've had files be erased that were created hours before the crash and were read-only after creation.

this zero-ing issue has got to be one of the most annoyingly common misconceptions about XFS. http://xfs.org/index.php/XFS_FAQ#Q:_Why_do_I_see_binary_NULLS_in_some_files_... Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1. XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash. Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the xfs_bmap(8) command). BTW, the reason why you see NUL bytes in unflushed sectors after a crash is to avoid leaking data from whatever those sectors were used for before. e.g. say those sectors used to hold the data from /etc/shadow, or a script or config file containing passwords, or some other privacy/security-sensitive data....you wouldn't want that data exposed in, say, a user-owned file after a crash. after a crash, those sectors may contain some or all of the data you wanted/expected it to, some or all of the data from any previous use, or (most likely) some combination of both. either way, you can't trust that the data is good.

...

I expect I'll just switch to ext4.

ext4 is fine and, unlike xfs, will provide an easy upgrade path to btrfs when it's stable (or if you want to experiment). BTW, for most purposes, btrfs is stable enough. wouldn't trust a production server to it yet, but home use ought to be OK, esp if you make backups. craig -- craig sanders <cas@taz.net.au> BOFH excuse #252: Our ISP is having {switching,routing,SMDS,frame relay} problems

Anthony Shipman

3:48 a.m.

On Sat, 24 Mar 2012 01:56:05 pm Craig Sanders wrote:

...

BTW, the reason why you see NUL bytes in unflushed sectors after a crash is to avoid leaking data from whatever those sectors were used for before. e.g. say those sectors used to hold the data from /etc/shadow, or a script or config file containing passwords, or some other privacy/security-sensitive data....you wouldn't want that data exposed in, say, a user-owned file after a crash.

after a crash, those sectors may contain some or all of the data you wanted/expected it to, some or all of the data from any previous use, or (most likely) some combination of both. either way, you can't trust that the data is good.

I'm not seeing NUL bytes in the file. I am seeing files with size=0. The most common case is that after rebooting, my kmailrc file has size=0 so all of my e-mail accounts, folders, mailing list configuration etc have been deleted. I've taken to keeping kmail shut down most of the time to reduce the risk of this happening. The same has happened to some other kde configuration files. After each crash I have had to do a "find /home -size 0" to find destroyed files to be restored from backup. This doesn't find all of them though since some are restored to default values after reboot. On Friday I downloaded a zip file and unpacked it. I then spent some time browsing the HTML documentation in the unpacked directory. Hours later, after the crash, many or most of the HTML files I had read had size=0. -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Tim Connors

26 Mar 26 Mar

4:23 a.m.

On Sat, 24 Mar 2012, Anthony Shipman wrote:

...

On Sat, 24 Mar 2012 01:56:05 pm Craig Sanders wrote:

...
BTW, the reason why you see NUL bytes in unflushed sectors after a crash is to avoid leaking data from whatever those sectors were used for before. e.g. say those sectors used to hold the data from /etc/shadow, or a script or config file containing passwords, or some other privacy/security-sensitive data....you wouldn't want that data exposed in, say, a user-owned file after a crash.

after a crash, those sectors may contain some or all of the data you wanted/expected it to, some or all of the data from any previous use, or (most likely) some combination of both. either way, you can't trust that the data is good.

I'm not seeing NUL bytes in the file. I am seeing files with size=0. The most common case is that after rebooting, my kmailrc file has size=0 so all of my e-mail accounts, folders, mailing list configuration etc have been deleted. I've taken to keeping kmail shut down most of the time to reduce the risk of this happening. The same has happened to some other kde configuration files. After each crash I have had to do a "find /home -size 0" to find destroyed files to be restored from backup. This doesn't find all of them though since some are restored to default values after reboot.

kde (and gnome) programs have a long history of failing to do even the most basic of atomic file operations (I'm not talking of failing to do fsync() here, I'm talking simply open(),write(),close() as being their entire file writing sequence) perhaps kmail is frequently updating its dotfile? Get a better desktop environment and mailer :)

...

On Friday I downloaded a zip file and unpacked it. I then spent some time browsing the HTML documentation in the unpacked directory. Hours later, after the crash, many or most of the HTML files I had read had size=0.

XFS is perfect. You and the multitude of other people over the years that have reported this very same problem of 12 hour old files disappearing, must all be mistaken. QED. XFS has recently taken the place of my backuppc server with its 14million inodes[1]. I like the fact that it has 20GB more space available to it, because somehow it simultaneously uses less space for the metadata, dynamically allocates the inodes, but only uses 7% of the available inodes, whereas the previous ext4 incarnation has been using 25% of the available inodes, and still took 20GB more to store them. But it's slower, I'm worried about the NFS lockup problem Craig mentioned, and I don't like disappearing files, even (especially?) on a backup/archival server. Am I going to spend another month moving it all back to ext4? [1] The transfer took almost a month because it was only making hard links at the rate of about 10 a second[2] [2] hard links are atomic operations as far as the filesystem is concerned, so perhaps every single link involved at least a seek - but god knows why the filesystem operations weren't sorted[3] so it couldn't get slightly better tps, and being atomic at the filesystem level doesn't imply that the inodes need to be forced to disk as barrier calls - and as far as I'm aware, BackupPC_tarPCCopy doesn't make any calls to fsync(). [3] yes, I tried different elevators -- Tim Connors

Jason White

4:39 a.m.

Tim Connors <tconnors@rather.puzzling.org> wrote:

...

XFS has recently taken the place of my backuppc server with its 14million inodes[1]. I like the fact that it has 20GB more space available to it, because somehow it simultaneously uses less space for the metadata, dynamically allocates the inodes, but only uses 7% of the available inodes, whereas the previous ext4 incarnation has been using 25% of the available inodes, and still took 20GB more to store them. But it's slower, I'm worried about the NFS lockup problem Craig mentioned, and I don't like disappearing files, even (especially?) on a backup/archival server. Am I going to spend another month moving it all back to ext4?

Have you had disappearing files, or confirmation that there's a bug which will cause them on your kernel version, or are you just worried by this thread that data might be lost?

Tim Connors

4:50 a.m.

On Mon, 26 Mar 2012, Jason White wrote:

...

Tim Connors <tconnors@rather.puzzling.org> wrote:

...
XFS has recently taken the place of my backuppc server with its 14million inodes[1]. I like the fact that it has 20GB more space available to it, because somehow it simultaneously uses less space for the metadata, dynamically allocates the inodes, but only uses 7% of the available inodes, whereas the previous ext4 incarnation has been using 25% of the available inodes, and still took 20GB more to store them. But it's slower, I'm worried about the NFS lockup problem Craig mentioned, and I don't like disappearing files, even (especially?) on a backup/archival server. Am I going to spend another month moving it all back to ext4?

Have you had disappearing files, or confirmation that there's a bug which will cause them on your kernel version, or are you just worried by this thread that data might be lost?

Years ago when I first tried non stanard filesystems like jfs and xfs, I encountered missing files. As having to restore from backup after unclean shutdown sort of defeats the non-fscking benefits of having a journalled filesystem, I decided to give them a miss and move back to ext3 for many years until I decided that a backup installation is not as critical as $HOME, and I could probably afford to experiment again. So far, not quite so sure the benefits outweigh the disadvantages. Nice to have 20GB extra, but now that I have 7TB of spinning metal, less important. And plus, XFS fanbois annoy me. (I also tried btrfs this time around. But very very very slow for the lots of hardlink activity that backuppc causes, and it mistakenly optimises write with its elevator allocations instead of read. You do a heck of a lot more reads with backuppc than writes (well, after the initial backup, anyway)). -- Tim Connors

Chris Samuel

5:35 a.m.

On 26/03/12 15:50, Tim Connors wrote:

...

I also tried btrfs this time around. But very very very slow for the lots of hardlink activity that backuppc causes

You will also run into the horrible hard link limit in btrfs too with backuppc, it's one of the applications used to illustrate the problem. It'll need an on-disk format rev to fix apparently. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Russell Coker

6 a.m.

On Mon, 26 Mar 2012, Tim Connors <tconnors@rather.puzzling.org> wrote:

...

kde (and gnome) programs have a long history of failing to do even the most basic of atomic file operations (I'm not talking of failing to do fsync() here, I'm talking simply open(),write(),close() as being their entire file writing sequence)

perhaps kmail is frequently updating its dotfile?

KDE programs do frequently update things. Some years ago the kmail code which wrote messages to disk returned an unsigned integer to indicate the index (starting from 0) of the message that was written - this left no possibility of a return code indicating error. So when you ran out of disk space mail just disappeared!

...

Get a better desktop environment and mailer :)

KDE is very user friendly and if you are moving from a CUA environment it can be configured to be familiar. There are many objective criteria by which KDE scores very well. But data integrity probably isn't one of them.

...

...
On Friday I downloaded a zip file and unpacked it. I then spent some time browsing the HTML documentation in the unpacked directory. Hours later, after the crash, many or most of the HTML files I had read had size=0.

XFS is perfect. You and the multitude of other people over the years that have reported this very same problem of 12 hour old files disappearing, must all be mistaken. QED.

Actually a large part of the problem here is apps expecting things that POSIX doesn't guarantee. XFS delivers fewer of the things that certain app developers expect but that POSIX doesn't require. Programmers should learn about what filesystems are required to deliver and code accordingly, assuming that the success of write() means that data is on disk is just wrong. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Chris Samuel

6:12 a.m.

On 26/03/12 17:00, Russell Coker wrote:

...

Actually a large part of the problem here is apps expecting things that POSIX doesn't guarantee. XFS delivers fewer of the things that certain app developers expect but that POSIX doesn't require.

Plus, of course, ext* was updated to have those same semantics as XFS, etc, so unless you have CONFIG_EXT3_DEFAULTS_TO_ORDERED set, or mount your ext* filesystems with data=ordered you'll get data=writeback as the default. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Tim Connors

6:43 a.m.

On Mon, 26 Mar 2012, Russell Coker wrote:

...

...
XFS is perfect. You and the multitude of other people over the years that have reported this very same problem of 12 hour old files disappearing, must all be mistaken. QED.

Actually a large part of the problem here is apps expecting things that POSIX doesn't guarantee. XFS delivers fewer of the things that certain app developers expect but that POSIX doesn't require. Programmers should learn about what filesystems are required to deliver and code accordingly, assuming that the success of write() means that data is on disk is just wrong.

What about the success of write()/close()/sleep(43200) ? Because quite frankly, unless the admin has deliberately turned on laptop_mode (and patched the kernel to increase the maximum commit interval to greater than the standard something like 6000 deciseconds, IIRC[1]), then filesystems that aren't committing data to disk after more than 12 hours of a file being closed (his .zip example) are slightly deficient and should be consigned to the nuclear waste disposal facility along with the planetary-scale hydrogen bomb that was used to ensure there were no traces left of such dodgy dodgy code. Because a filesystem that doesn't actually bother to store anything is somewhat write-once-read-never and is thus a complete waste of everyone's time. Data is important, otherwise we would never have bothered to migrate off clay tablets. ( this thread is hilarious - particularly chris masons reply 3rd down. people doing fsync() (and worse, the people insisting you need to use it otherwise youre a data hater) are probably making their data more fragile. xfs? fragile. ext4 without barriers? Probably just fine, thanks very much!: http://thread.gmane.org/gmane.linux.file-systems/23709 ) [1] Yes, I have done this. I like my disks spun down until I issue a manual sync(1). Yes, I use libeatmydata on programs that assume I'm using some crappy filesystem like XFS whereas I prefer filesystems that allow me to do write()/close()/rename() without having to also do fsync(). -- Tim Connors

Anthony Shipman

24 Mar 24 Mar

4:07 a.m.

On Sat, 24 Mar 2012 01:56:05 pm Craig Sanders wrote:

...

more details required... I've got lots if you want:

Asus P8H67 motherboard. Linux 2.6.32-220.2.1.el6.x86_64 #1 SMP Fri Dec 23 02:21:33 CST 2011 x86_64 x86_64 x86_64 GNU/Linux 01:00.0 VGA compatible controller: nVidia Corporation GF119 [GeForce GT 520] (rev a1) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. Device 83a0 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at fa000000 (32-bit, non-prefetchable) [size=16M] Memory at f0000000 (64-bit, prefetchable) [size=128M] Memory at f8000000 (64-bit, prefetchable) [size=32M] I/O ports at e000 [size=128] [virtual] Expansion ROM at fb000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [b4] Vendor Specific Information <?> Capabilities: [100] Virtual Channel <?> Capabilities: [128] Power Budgeting <?> Capabilities: [600] Vendor Specific Information <?> Kernel driver in use: nvidia Kernel modules: nvidia, nouveau, nvidiafb I'll probably just revert to the Intel GPU though, if I can get a decent configuration out of it. IIRC RedHat systems no longer come with an X11 configuration tool.

...

please show output of lspci, and 'lspci -v' for you nvidia card. what brand/model card is it? does your BIOS have IOMMU support enabled? (i've found that that can cause problems for the nvidia driver)

The EFI manual talks about an Intel VT-d thingy which is disabled by default. I expect it is still disabled.

...

also details on your kernel version and hardware would be useful, including details your motherboard, disk controller, on the drives used by your XFS fs. also is XFS on raw disk partition(s) or LVM?

The XFS fs is on a raw disk partition (/dev/sdb5). This is a GPT partition on a WD SATA drive. -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Craig Sanders

5:08 a.m.

On Sat, Mar 24, 2012 at 03:07:08PM +1100, Anthony Shipman wrote:

...

On Sat, 24 Mar 2012 01:56:05 pm Craig Sanders wrote:

...
more details required... I've got lots if you want:

Asus P8H67 motherboard.

Linux 2.6.32-220.2.1.el6.x86_64 #1 SMP Fri Dec 23 02:21:33 CST 2011 x86_64 x86_64 x86_64 GNU/Linux

01:00.0 VGA compatible controller: nVidia Corporation GF119 [GeForce GT 520] (rev a1) (prog-if 00 [VGA controller])

OK, so fairly new then. i somehow had the impression it was older gear. btw, what version of the nvidia driver are you running? support for the GT 520 was added in version 285.x.x - it may have worked before then, but maybe not.

...

...
please show output of lspci, and 'lspci -v' for you nvidia card. what brand/model card is it? does your BIOS have IOMMU support enabled? (i've found that that can cause problems for the nvidia driver)

The EFI manual talks about an Intel VT-d thingy which is disabled by default. I expect it is still disabled.

VT-d is completely unrelated. VT-d is virtualisation support. if you ever want to run kvm or virtualbox or similar you'll need it enabled. the IOMMU is the memory manager unit. some BIOSes allow enabling it for 64-bit linux with a 64MB buffer for transfers to/from 32-bit devices. i've found it's best to turn it off, and make sure Memory Hole remapping is enabled in the bios.

...

...
also details on your kernel version and hardware would be useful, including details your motherboard, disk controller, on the drives used by your XFS fs. also is XFS on raw disk partition(s) or LVM?

The XFS fs is on a raw disk partition (/dev/sdb5). This is a GPT partition on a WD SATA drive.

what kind of controller? ahci or ide mode? or "RAID" mode? in my experience, it's best to set the BIOS to ahci mode for all drives. and to avoid "RAID" mode like the pestilential POS that it is (linux software raid is *much* better).. what happens if you do something like 'dd if=/dev/sdb5 of=/dev/null' - does it run to completion or does it trigger a fault? if the latter, then it's the controller or the disk itself, not the fs. if the former, well, cause is still unknown. similarly, try a write test with dd direct to a disk partition if you have one available to be wiped/overwritten - maybe turn off your swap partition temporarily and use that. e.g. swapoff /dev/XXXX dd if=/dev/random of=/dev/XXXX where XXXX is your swap partition. remember to mkswap it before you use it as swap again. probably best to make note of the UUID first and re-create it with the same UUID like so: mkswap -U [uuid] /dev/XXXX you can get the current uuid with blkid. e.g. # blkid /dev/sda2 /dev/sda2: UUID="1ac8f2ba-39c7-43b4-8c26-a039d3eda76e" TYPE="swap" craig ps: i've been using XFS on numerous systems for years. can't remember when, exactly, late 90s probably. anyway, I've never seen anything like the problems you describe, so my current favorite guess as to the cause is that it's caused by faulty hardware or faulty bios settings. the only serious outstanding problem with XFS that i know of is that, under some rare and not very-well defined circumstances, NFS activity on an NFS-exported XFS volume can cause the nfs server to lock up. there's some kind of conflict with the XFS code and the NFS code. -- craig sanders <cas@taz.net.au> BOFH excuse #14: sounds like a Windows problem, try calling Microsoft support

Anthony Shipman

6:41 a.m.

On Sat, 24 Mar 2012 04:08:45 pm Craig Sanders wrote:

...

OK, so fairly new then. i somehow had the impression it was older gear.

btw, what version of the nvidia driver are you running?

support for the GT 520 was added in version 285.x.x - it may have worked before then, but maybe not.

I was running 295.20 which was the latest until a few days ago. I'm now running 295.33.

...

...
The EFI manual talks about an Intel VT-d thingy which is disabled by default. I expect it is still disabled.

VT-d is completely unrelated. VT-d is virtualisation support. if you ever want to run kvm or virtualbox or similar you'll need it enabled.

the IOMMU is the memory manager unit. some BIOSes allow enabling it for 64-bit linux with a 64MB buffer for transfers to/from 32-bit devices.

i've found it's best to turn it off, and make sure Memory Hole remapping is enabled in the bios.

Wikipedia says that VT-d is Intel's idea of an IOMMU. But my BIOS doesn't show it despite what the manual says. There is nothing else in the BIOS resembling an IOMMU or memory hole remapping.

...

what kind of controller? ahci or ide mode? or "RAID" mode? in my experience, it's best to set the BIOS to ahci mode for all drives. and to avoid "RAID" mode like the pestilential POS that it is (linux software raid is *much* better)..

The controller was in IDE mode. I've changed it to AHCI.

...

what happens if you do something like 'dd if=/dev/sdb5 of=/dev/null' - does it run to completion or does it trigger a fault? if the latter, then it's the controller or the disk itself, not the fs. if the former, well, cause is still unknown.

No fault at all (with the controller in IDE mode). I'll have to wait and see what happens next. -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Craig Sanders

9:43 a.m.

On Sat, Mar 24, 2012 at 05:41:40PM +1100, Anthony Shipman wrote:

...

Wikipedia says that VT-d is Intel's idea of an IOMMU.

you're right i just read the first few lines of the wikipedia page and thought "it's intel's virtualisation cpu flag....not what i'm looking for"

...

But my BIOS doesn't show it despite what the manual says. There is nothing else in the BIOS resembling an IOMMU

on futher reading...it IS the iommu setting on intel. try turning it off. (i'm far more used to amd systems and bios settings than intel ones)

...

or memory hole remapping.

memory hole remapping is where the huge chunk of memory used by a modern GPU (up to 3GB on some cards) is mapped out of lower memory. there's probably a setting for it somewhere....some BIOSes have settings in extremely weird places (a HP desktop machine i needed to run kvm on a few weeks ago had virtualisation support under "Security" in the BIOS. would never have thought of looking for it there, only found it by exhaustive search of every single option)

...

The controller was in IDE mode. I've changed it to AHCI.

good. that's the native mode for sata these days, ide mode is an emulation for older operating systems (i.e. win xp. or even a later version of windows that was installed on IDE but has been upgraded to sata drives & controller. linux is cool with the underlying hardware changing - at worst, the device name may change so fstab might need editing - but windows doesn't like it at all, and it can be a real pain getting windows booting again if you change the drive type). in short: ahci gives you the full feature set of ahci and (probably) better performance.

...

...
what happens if you do something like 'dd if=/dev/sdb5 of=/dev/null'

No fault at all (with the controller in IDE mode).

I'll have to wait and see what happens next.

did you try the write test as well? one other thing to try - should have suggested it earlier. unmount the XFS file system (boot to single-user if necessary) and run xfs_check and/or xfs_repair on it. if the fs has been corrupted badly enough in the past (e.g. due to a crash or power failure), xfs can get terribly confused when it encounters the corruption again. worst case if it's corrupted so bad that xfs_repair can't fix it, you'll have to backup and restore. craig -- craig sanders <cas@taz.net.au> BOFH excuse #393: Interference from the Van Allen Belt.

Anthony Shipman

1:24 p.m.

On Sat, 24 Mar 2012 08:43:37 pm Craig Sanders wrote:

...

On Sat, Mar 24, 2012 at 05:41:40PM +1100, Anthony Shipman wrote:

there's probably a setting for it somewhere....some BIOSes have settings in extremely weird places (a HP desktop machine i needed to run kvm on a few weeks ago had virtualisation support under "Security" in the BIOS. would never have thought of looking for it there, only found it by exhaustive search of every single option)

The BIOS has no settings for any of that.

...

...
The controller was in IDE mode. I've changed it to AHCI.

good. that's the native mode for sata these days, ide mode is an emulation for older operating systems (i.e. win xp. or even a later version of windows that was installed on IDE but has been upgraded to sata drives & controller. linux is cool with the underlying hardware changing - at worst, the device name may change so fstab might need editing - but windows doesn't like it at all, and it can be a real pain getting windows booting again if you change the drive type).

There were no changes to the device names.

...

...
I'll have to wait and see what happens next.

did you try the write test as well? Not at the moment.

...

one other thing to try - should have suggested it earlier. unmount the XFS file system (boot to single-user if necessary) and run xfs_check and/or xfs_repair on it. if the fs has been corrupted badly enough in the past (e.g. due to a crash or power failure), xfs can get terribly confused when it encounters the corruption again.

I tried xfs_repair earlier on Sat after rebooting from the last crash. It reported no errors. -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Craig Sanders

25 Mar 25 Mar

9:39 p.m.

On Sun, Mar 25, 2012 at 12:24:28AM +1100, Anthony Shipman wrote:

...

I tried xfs_repair earlier on Sat after rebooting from the last crash. It reported no errors.

sorry, i'm out of ideas on this one. craig -- craig sanders <cas@taz.net.au> BOFH excuse #128: Power Company having EMP problems with their reactor

Toby Corkindale

26 Mar 26 Mar

12:42 a.m.

On 26/03/12 08:39, Craig Sanders wrote:

...

On Sun, Mar 25, 2012 at 12:24:28AM +1100, Anthony Shipman wrote:

...
I tried xfs_repair earlier on Sat after rebooting from the last crash. It reported no errors.

sorry, i'm out of ideas on this one.

If Anthony greps for 'barrier' in the system logs, I'd be interested to know if there's anything about xfs failing to set them up.. They're important for journalled filesystems to operate reliably. (Unless you have battery-backed-up RAID controllers or super-capacitor SSDs) -Toby

Anthony Shipman

1 a.m.

On Mon, 26 Mar 2012 11:42:54 am Toby Corkindale wrote:

...

If Anthony greps for 'barrier' in the system logs, I'd be interested to know if there's anything about xfs failing to set them up.. They're important for journalled filesystems to operate reliably. (Unless you have battery-backed-up RAID controllers or super-capacitor SSDs)

-Toby

The word barrier does not appear in any file in /var/log The file system is mounted with options 'defaults,relatime' -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Toby Corkindale

1:05 a.m.

On 26/03/12 12:00, Anthony Shipman wrote:

...

On Mon, 26 Mar 2012 11:42:54 am Toby Corkindale wrote:

...
If Anthony greps for 'barrier' in the system logs, I'd be interested to know if there's anything about xfs failing to set them up.. They're important for journalled filesystems to operate reliably. (Unless you have battery-backed-up RAID controllers or super-capacitor SSDs)

-Toby

The word barrier does not appear in any file in /var/log The file system is mounted with options 'defaults,relatime'

Which kernel version are you running? (XFS started enabling barriers by default from 2.6.17) Are you accessing direct drive partitions, or are you going via LVM or the software raid block devices? (Barriers weren't passed-through on those until much later) Toby

Anthony Shipman

1:25 a.m.

On Mon, 26 Mar 2012 12:05:44 pm Toby Corkindale wrote:

...

Which kernel version are you running? (XFS started enabling barriers by default from 2.6.17)

Are you accessing direct drive partitions, or are you going via LVM or the software raid block devices? (Barriers weren't passed-through on those until much later)

Toby

This is Centos 6.2, kernel 2.6.32-220.2.1.el6.x86_64 The file system is on a GPT partition "/dev/sdb5". Made by:

...

mkfs.xfs /dev/sdb5 meta-data=/dev/sdb5 isize=256 agcount=4, agsize=14468809 blks = sectsz=512 attr=2 data = bsize=4096 blocks=57875233, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=28259, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

-- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Tim Connors

4:01 a.m.

On Sat, 24 Mar 2012, Craig Sanders wrote:

...

one other thing to try - should have suggested it earlier. unmount the XFS file system (boot to single-user if necessary) and run xfs_check and/or xfs_repair on it. if the fs has been corrupted badly enough in the past (e.g. due to a crash or power failure), xfs can get terribly confused when it encounters the corruption again.

Now if only there was a fsck.xfs and regular checks every ~20 mounts. -- Tim Connors

Craig Sanders

5:02 a.m.

On Mon, Mar 26, 2012 at 03:01:18PM +1100, Tim Connors wrote:

...

...
[ ... xfs_repair ... ]

Now if only there was a fsck.xfs and regular checks every ~20 mounts.

alternatively, and with 50% less facetiousness, you can even make ext[234] filesystems behave in a non-annoying manner: tune2fs -i 0 -c 0 /dev/ext[234]partition :-) craig -- craig sanders <cas@taz.net.au> BOFH excuse #285: Telecommunications is upgrading.

Tim Connors

5:22 a.m.

On Mon, 26 Mar 2012, Craig Sanders wrote:

...

On Mon, Mar 26, 2012 at 03:01:18PM +1100, Tim Connors wrote:

...
...
[ ... xfs_repair ... ]

Now if only there was a fsck.xfs and regular checks every ~20 mounts.

alternatively, and with 50% less facetiousness, you can even make ext[234] filesystems behave in a non-annoying manner:

tune2fs -i 0 -c 0 /dev/ext[234]partition

But didn't you just demonstrate why regular checks are a Good Idea? Bad shit sometimes happens, and better to know about it than silently ignore it. Personally, I use Ted Tso's trick to taking LVM snapshots and 'e2fsck -fy'ing the snapshot. If it succeeds, tune2fs the original device to reset the counter, if it fails, send an email to the admin to schedule a reboot and tune2fs the device to say it's been mounted a bajillion times. But a recent update to lvm and/or kernel no longer seems to properly quiesce the filesystem[1], so I get these: e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (27450610, counted=27450649). Fix? yes Free inodes count wrong (19841641, counted=19841642). Fix? yes (I do a preen with e2fsck -p just to get rid of the /dev/gamow/2012-03-25+04.44.28-root-snapshot: Clearing orphaned inode 1156235 (uid=0, gid=0, mode=0100644, size=161848) warnings that always happened after a snapshot. The inode counts requiring a fix never used to happen though.) [1] I'd love to submit a bug report, but I don't know which part has broken, and I fear I'm doing things non standard enough (even with Tso's blessing) that no one will care enough to hunt for the bug. Plus, my system is a bastardised sid/testing/stable machine, so it's all probably my fault, even though I think I took care to make sure that lvm/e2utils/kernel/libraries were taken from the same branch at the same time. -- Tim Connors

Chris Samuel

5:34 a.m.

On 26/03/12 16:22, Tim Connors wrote:

...

But didn't you just demonstrate why regular checks are a Good Idea?

Indeed, my ext4 /home failed its check last night after a reboot for a new kernel and I had to sort out a bunch of duplicate blocks with a manual fsck.. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Craig Sanders

6:51 a.m.

On Mon, Mar 26, 2012 at 04:22:01PM +1100, Tim Connors wrote:

...

On Mon, 26 Mar 2012, Craig Sanders wrote:

...
On Mon, Mar 26, 2012 at 03:01:18PM +1100, Tim Connors wrote:

...
...
[ ... xfs_repair ... ]

Now if only there was a fsck.xfs and regular checks every ~20 mounts.

alternatively, and with 50% less facetiousness, you can even make ext[234] filesystems behave in a non-annoying manner:

tune2fs -i 0 -c 0 /dev/ext[234]partition

But didn't you just demonstrate why regular checks are a Good Idea?

no, how? on the contrary, if the machine has crashed then you should run a check. or if you suspect there's a problem. the user having a clue is a Good Idea. IMO automatic time-interval and mount-count fscks are stupid. i don't see that they do anything useful. they just cause me to be annoyed by a long fsck on every reboot because I (typically) reboot my systems once every six months or more. or 20 times in an hour trying to solve some problem (and no, i don't need an auto-fsck every 5th or 10th reboot, i want to get on with solving whatever problem i'm working on)>

...

Bad shit sometimes happens, and better to know about it than silently ignore it.

who said anything about ignoring the bad shit that happens? not me.

...

Personally, I use Ted Tso's trick to taking LVM snapshots and 'e2fsck -fy'ing the snapshot. If it succeeds, tune2fs the original device

that sounds worthwhile. craig -- craig sanders <cas@taz.net.au> BOFH excuse #312: incompatible bit-registration operators

Russell Coker

7:31 a.m.

On Mon, 26 Mar 2012, Craig Sanders <cas@taz.net.au> wrote:

...

on the contrary, if the machine has crashed then you should run a check. or if you suspect there's a problem. the user having a clue is a Good Idea.

IMO automatic time-interval and mount-count fscks are stupid. i don't see that they do anything useful.

they just cause me to be annoyed by a long fsck on every reboot because I (typically) reboot my systems once every six months or more. or 20

It seems to me that the concept of BTRFS with scrubbing is a good one. If you have a BTRFS filesystem with RAID-1 for data and meta-data (which can be done with only one disk) then you can scrub it after boot and not delay the boot process. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Craig Sanders

8:04 a.m.

On Mon, Mar 26, 2012 at 06:31:05PM +1100, Russell Coker wrote:

...

It seems to me that the concept of BTRFS with scrubbing is a good one. If you have a BTRFS filesystem with RAID-1 for data and meta-data (which can be done with only one disk) then you can scrub it after boot and not delay the boot process.

yep, that's what i do with ZFS - run a weekly scrub on both my ZFS pools ("export" and "backup") and i can check status with: # zpool status -v pool: backup state: ONLINE scan: scrub repaired 0 in 3h9m with 0 errors on Sat Mar 24 04:36:50 2012 config: NAME STATE READ WRITE CKSUM backup ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 scsi-SATA_ST31000528AS_6VP3FWAG ONLINE 0 0 0 scsi-SATA_ST31000528AS_9VP4RPXK ONLINE 0 0 0 scsi-SATA_ST31000528AS_9VP509T5 ONLINE 0 0 0 scsi-SATA_ST31000528AS_9VP4P4LN ONLINE 0 0 0 logs scsi-SATA_Patriot_Torqx_278BF0715010800025492-part5 ONLINE 0 0 0 errors: No known data errors pool: export state: ONLINE scan: scrub repaired 0 in 4h30m with 0 errors on Sat Mar 24 05:57:48 2012 config: NAME STATE READ WRITE CKSUM export ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 scsi-SATA_WDC_WD10EACS-00_WD-WCASJ2114122 ONLINE 0 0 0 scsi-SATA_WDC_WD10EACS-00_WD-WCASJ2195141 ONLINE 0 0 0 scsi-SATA_WDC_WD10EARS-00_WD-WMAV50817803 ONLINE 0 0 0 scsi-SATA_ST31000528AS_9VP18CCV ONLINE 0 0 0 logs scsi-SATA_Patriot_Torqx_278BF0715010800025492-part6 ONLINE 0 0 0 scsi-SATA_Patriot_WildfirPT1131A00006353-part5 ONLINE 0 0 0 cache scsi-SATA_Patriot_Torqx_278BF0715010800025492-part7 ONLINE 0 0 0 scsi-SATA_Patriot_WildfirPT1131A00006353-part6 ONLINE 0 0 0 errors: No known data errors the scrub on /backup is about 1.5 hours faster than on /export because a) /backup mostly idle, just regular rsync cron jobs, and b) it's made up of fast seagate drives rather than mostly WD Green drives. yes, i should have done it the other way around...but they were my btrfs drives until i experimented enough with ZFS on the WDs to decide to switch. /export has one seagate because because one of the WDs (another WD10EARS) died, and i had a seagate handy to replace it with :) btw, I use the /dev/disk/by-id names to avoid kernel-related device renaming issues. craig -- craig sanders <cas@taz.net.au> BOFH excuse #309: firewall needs cooling

Robin Humble

6 Apr 6 Apr

2:59 a.m.

On Mon, Mar 26, 2012 at 07:04:18PM +1100, Craig Sanders wrote:

...

yep, that's what i do with ZFS - run a weekly scrub on both my ZFS pools ("export" and "backup") ... NAME STATE READ WRITE CKSUM export ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 scsi-SATA_WDC_WD10EACS-00_WD-WCASJ2114122 ONLINE 0 0 0 scsi-SATA_WDC_WD10EACS-00_WD-WCASJ2195141 ONLINE 0 0 0 scsi-SATA_WDC_WD10EARS-00_WD-WMAV50817803 ONLINE 0 0 0 scsi-SATA_ST31000528AS_9VP18CCV ONLINE 0 0 0 logs scsi-SATA_Patriot_Torqx_278BF0715010800025492-part6 ONLINE 0 0 0 scsi-SATA_Patriot_WildfirPT1131A00006353-part5 ONLINE 0 0 0 cache scsi-SATA_Patriot_Torqx_278BF0715010800025492-part7 ONLINE 0 0 0 scsi-SATA_Patriot_WildfirPT1131A00006353-part6 ONLINE 0 0 0

errors: No known data errors .. /export has one seagate because because one of the WDs (another WD10EARS) died, and i had a seagate handy to replace it with :)

how did ZoL handle the drive failure? gracefully? no problems putting in a new disk? I ask because I've been hammering ZoL (with Lustre) for a while now looking at performance and stability, but haven't tested with any drive failures (real or simulated) yet. cheers, robin

Craig Sanders

5:55 a.m.

On Thu, Apr 05, 2012 at 10:59:10PM -0400, Robin Humble wrote:

...

how did ZoL handle the drive failure? gracefully?

no problem at all. as gracefully as possible under the circumstances (a missing disk). 'zpool status' reported that the pool was degraded, and that one of the component disks was offline and needed to be replaced. the error reporting in zpool status is quite good: an error code, a short meaningful paragraph and a URL for a web page with more details (earlier versions used to refer to a sun.com URL, but now they refer to pages under http://zfsonlinux.org/msg/ - the sun.com zfs msg links weren't properly redirected when oracle rebranded the old sun web sites)

...

no problems putting in a new disk?

I issued the 'zpool replace ...' command and it started resilvering the data from the available disks onto the new disk. i continued using the system, and barely even noticed when it finished. the entire process was smooth and painless and Just Worked. i've had similar experiences with mdadm and hw raid resyncs, but this just felt smoother. and unlike block-level raid, it didn't have to sync every block, just those blocks that needed to have a copy on the replacement.

...

I ask because I've been hammering ZoL (with Lustre) for a while now looking at performance and stability, but haven't tested with any drive failures (real or simulated) yet.

how's lustre looking? i know it's the major motivation behind behlendorf @ LLNL's work on zfsonlinux, but haven't been following news about it. craig -- craig sanders <cas@taz.net.au> BOFH excuse #446: Mailer-daemon is busy burning your message in hell.

Robin Humble

6:47 a.m.

On Fri, Apr 06, 2012 at 03:55:19PM +1000, Craig Sanders wrote:

...

On Thu, Apr 05, 2012 at 10:59:10PM -0400, Robin Humble wrote:

...
how did ZoL handle the drive failure? gracefully? no problem at all. as gracefully as possible under the circumstances (a missing disk). 'zpool status' reported that the pool was degraded, and that one of the component disks was offline and needed to be replaced.

cool. what about sector re-writes - have you seen ZFS do any of those? I presume ZFS does this, but again, I haven't seen it yet in my testing. I'm familiar with these from md raid1/5/6 which when it hits a disk read error will over-write the unreadable disk blocks with reconstructed (if necessary) info from the other drives. the write causes the drive to remap the sector from an internal spare and fixes the read error. we couldn't survive without this... a vital feature.

...

the entire process was smooth and painless and Just Worked. i've had similar experiences with mdadm and hw raid resyncs, but this just felt smoother.

nice :)

...

and unlike block-level raid, it didn't have to sync every block, just those blocks that needed to have a copy on the replacement.

yeah, that's a great feature.

...

...
I ask because I've been hammering ZoL (with Lustre) for a while now looking at performance and stability, but haven't tested with any drive failures (real or simulated) yet. how's lustre looking?

improving. it's quite an 'experimental' lustre version that they're using, even without the ZFS additions, so lots of new and shiny things to break! :) https://github.com/chaos/lustre/tags unfortunately the zfs backend usually still deadlocks when I push it hard :-/ md backend is ok. my typical test config is 4 lustre clients to one lustre server with 40 SAS disks in raidz2 8+2's - either in 4 separate zpools (4 OSTs in Lustre-speak) or all 4 z2's in one zpool (one OST). using 8 rpc's in flight per client works, but using 32 (more clients would prob have the same effect) isn't stable yet IMHO. Lustre ZFS is very fast for writes when it works though... even random 1m writes. md is faster for reads, but then again, it's not checksum'ing anything.

...

i know it's the major motivation behind behlendorf @ LLNL's work on zfsonlinux, but haven't been following news about it.

I suspect we should know more after LUG in a few weeks time http://www.opensfs.org/lug/program judging by their hw config and my benchmarks (and assuming it's the same config as in Brian's talk at LUG last year) I think they'll easily get to their target of 1TB/s write, but 1TB/s reads will be a bit harder. I suspect they'll still get there though. cheers, robin

Craig Sanders

7:38 a.m.

On Fri, Apr 06, 2012 at 02:47:32AM -0400, Robin Humble wrote:

...

what about sector re-writes - have you seen ZFS do any of those? I presume ZFS does this, but again, I haven't seen it yet in my testing.

i've seen resilvering in tests where i've yanked a drive from a vdev, written data, and then put it back in. i've also seen the same when i've corrupted one of the disks in a VMs vdev by writing to it from the host OS and then either try to cause the VM to read that data or run zfs scrub on the pool. also, one of my home pools is mostly made up of WD Green drives, so i've also occasionally seen rewrites when the long timeouts cause zfs to decide that the block is bad. whether the sector(s) actually got remapped by the drive is hard to tell. i'd assume so.

...

...
how's lustre looking?

improving. it's quite an 'experimental' lustre version that they're using, even without the ZFS additions, so lots of new and shiny things to break! :) https://github.com/chaos/lustre/tags unfortunately the zfs backend usually still deadlocks when I push it hard :-/ md backend is ok.

any idea what's causing the deadlocks? only when writing, or reading too? random or sequential writes?

...

I suspect we should know more after LUG in a few weeks time http://www.opensfs.org/lug/program

judging by their hw config and my benchmarks (and assuming it's the same config as in Brian's talk at LUG last year) I think they'll easily get to their target of 1TB/s write, but 1TB/s reads will be a bit harder. I suspect they'll still get there though.

cool, i'll keep an eye out for that. I definitely want to read his presentation or watch it if ends up on youtube or somewhere. craig -- craig sanders <cas@taz.net.au> BOFH excuse #14: sounds like a Windows problem, try calling Microsoft support

Robin Humble

3:43 p.m.

On Fri, Apr 06, 2012 at 05:38:28PM +1000, Craig Sanders wrote:

...

On Fri, Apr 06, 2012 at 02:47:32AM -0400, Robin Humble wrote:

...
what about sector re-writes - have you seen ZFS do any of those? I presume ZFS does this, but again, I haven't seen it yet in my testing. whether the sector(s) actually got remapped by the drive is hard to tell. i'd assume so.

fair enough. I guess if remapped sectors are incrementing up in the drive's smart data then it's probably working, but if the drives are just timing out SCSI cmds randomly (should they really do that?!) then that wouldn't show up.

...

any idea what's causing the deadlocks?

the traces and some builds back and forward through git commits give some idea. I'm guessing attempts by lustre to send data using zero copy write hooks in zfs are racing with (non-zero copy?) metadata (attr) updates. I'll email Brian and see if he can suggest something and/or which mailing list or jira or github issue to post to. I assume regular ZFS is ok and stable because it doesn't attempt zero copy writes.

...

only when writing, or reading too? random or sequential writes?

just writes. sometimes sequential and sometimes random. always with at least 32 and often with 128 1M i/o's in flight from clients. so I guess you're running with ashift=12 and a limit on zfs_arc_max? I'm also using zfs_prefetch_disable=1 (helps lustre reads), but apart from that no other zfs tweaks, no l2arc SSDs yet etc. cheers, robin

Craig Sanders

8:24 p.m.

On Fri, Apr 06, 2012 at 11:43:40AM -0400, Robin Humble wrote:

...

fair enough. I guess if remapped sectors are incrementing up in the drive's smart data then it's probably working, but if the drives are just timing out SCSI cmds randomly (should they really do that?!) then that wouldn't show up.

they're consumer drives so, yeah, they have long retry times. enterprise drives are far quicker to return an error...consumer drives keep on trying (non-tunable), and the kernel has a few retries too (tunable) so it can sometimes take a minute or more. that can be enough for zfs (or the SAS card, if it's still flashed with raid firmware like mine is) to decide the drive is failing.

...

...
any idea what's causing the deadlocks?

the traces and some builds back and forward through git commits give some idea. I'm guessing attempts by lustre to send data using zero copy write hooks in zfs are racing with (non-zero copy?) metadata (attr) updates. I'll email Brian and see if he can suggest something and/or which mailing list or jira or github issue to post to.

I assume regular ZFS is ok and stable because it doesn't attempt zero copy writes.

ah, okay. well, at least you know it is a priority for the LLNL zfs + lustre project :)

...

...
only when writing, or reading too? random or sequential writes?

just writes. sometimes sequential and sometimes random. always with at least 32 and often with 128 1M i/o's in flight from clients.

you got any SSDs for ZIL?

...

so I guess you're running with ashift=12 and a limit on zfs_arc_max?

yep. 4GB zfs_arc_max. I set it to that when my machine only had 8GB RAM, and didn't bother changing it when in upgraded to 16GB. seems good for a shared desktop/server running a few bloated apps like chromium. $ cat /etc/modprobe.d/zfs.conf # use minimum 1GB and maxmum of 4GB RAM for ZFS ARC options zfs zfs_arc_min=1073741824 zfs_arc_max=4294967296 and yes, using ashift=12 because some of my drives are 4K sectors ("advanced format") and i expect all of them will be replaced with 4K drives in the not-too-distant future. on my zfs backup server with 24GB RAM at work, i have: # use minimum 4GB and maxmum of 12GB RAM for ZFS ARC options zfs zfs_arc_min=4294967296 zfs_arc_max=12884901888 i'd be comfortable increasing that up to 20 or 22GB. usage is mostly writes (rsync backups), so it doesn't matter much.

...

I'm also using zfs_prefetch_disable=1 (helps lustre reads), but apart from that no other zfs tweaks, no l2arc SSDs yet etc.

L2ARC isn't any use for writes anyway, except indirectly as it helps reduce the read i/o load on the disks. i'd try putting in a good fast SLC SSD as ZIL and see if that helps. it should certainly smooth out the metadata updates. for ZIL you probably don't need more than a few GB (maybe as much as 4 or even 8. OTOH it doesn't hurt to have more ZIL than you need, it just doesn't get used, so use the entire device) or so for this purpose so write speed of the SSD is far more important than capacity. and for multiple simultaneous writers you're probably better off with multiple small ZIL devices than one larger one. note, however that larger SSDs tend to be faster than smaller ones due to internal raid0-like configuration of the individual flash chips. 120GB and 240GB SSDs, for example, are noticably faster than 60GB. excess space could be just wasted or perhaps the SSD could be partitioned and the excess used as L2ARC (but that would impact on the ZIL partition's write performance, so maybe better to just use the entire ssd as ZIL regardless of size). fortunately, log (ZIL) and cache (L2ARC) devices can be added and removed at whim with zfs, so you can experiment easily with different configurations. watching 'zfs iostat -v' will tell you ZIL usage. craig -- craig sanders <cas@taz.net.au> BOFH excuse #327: The POP server is out of Coke

Trent W. Buck

26 Mar 26 Mar

9:44 a.m.

Tim Connors wrote:

...

Personally, I use Ted Tso's trick to taking LVM snapshots and 1) 'e2fsck -fy'ing the snapshot. 2) If it succeeds, tune2fs the original device to reset the counter, if it fails, send an email to the admin to schedule a reboot and tune2fs the device to say it's been mounted a bajillion times.

Hmm, I've been doing (1) but not (2). Thanks for the idea.

...

But a recent update to lvm and/or kernel no longer seems to properly quiesce the filesystem[1], so I get these:

e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (27450610, counted=27450649). Fix? yes

Free inodes count wrong (19841641, counted=19841642). Fix? yes

(I do a preen with e2fsck -p just to get rid of the /dev/gamow/2012-03-25+04.44.28-root-snapshot: Clearing orphaned inode 1156235 (uid=0, gid=0, mode=0100644, size=161848) warnings that always happened after a snapshot. The inode counts requiring a fix never used to happen though.)

Oh. Why do you think I'm fscking my snapshots in the first place? :P Maybe I'll dig up the exact complains I'm getting from e2fsck and we can compare notes.

Lindsay Sprinter

27 Mar 27 Mar

12:59 a.m.

On Mon, 26 Mar 2012, Craig Sanders wrote:

...

On Mon, Mar 26, 2012 at 03:01:18PM +1100, Tim Connors wrote:

...
...
[ ... xfs_repair ... ]

Now if only there was a fsck.xfs and regular checks every ~20 mounts.

alternatively, and with 50% less facetiousness, you can even make ext[234] filesystems behave in a non-annoying manner:

tune2fs -i 0 -c 0 /dev/ext[234]partition

:-)

craig

-- craig sanders <cas@taz.net.au>

Interesting, I still use ext2/3 and what I have done to sort of partly get around the problem is to increase the number of mounts before a forced fsck. This was done by simple mutilpying the value debian setup be 4. The current ranges on my systems now being inthe range 70 to around 130 mounts. As the systems are shut down at night and usualy only started every 2nd day a forced fsck every 6 to 8 months is not a major disaster. But I must say I question the necesity of such a check. From memory the only time it showed up any problems was some days before the voltage regulation on a 12V line on one of my systems completely failed. Every other time fsck has run was caused by some thing I had done. As for data backups I have 5 copies of all important data (Note 1). at least once a year the entire lot is writen to a drive and then compared with the others make sure its is complete. So a single drive failure is not a disaster. The back procedure also includes baking up copies of items I have just done or are in the process of doing, Note 3. Note 1: Important data is mostly data I have entered myself or data that the original is unlikely to be/ or is no longer avaible. Such things as old engineering documents, programs and graphics I have done myself, my vinyl record colection and more recently my digital photos. The greatest share of the space is taken up by the photos a single raw photo from my Nikon D700 being in the order of 17 mbytes. Note 2: I have never realy bothered about newer filesystem formats, for nearly 30 years I look after some very complex systems and what this taught me was if such a thing was not broken _____DON'T_____ fix it, that is if the old system is providing what one requires why change it. The change will cause pain. Note 3: I have actually never lost any data, I come up with this way of treating data from one of the first PC's I ever encounted that was a Time 4500, way back in the "bad old days". This had a quite enreliable drive set up and one very quickly got used to backing up every thing not matter how unimportant. A __real__ good training tool it was. Lindsay

Chris Samuel

1:03 a.m.

On 27/03/12 11:59, Lindsay Sprinter wrote:

...

But I must say I question the necesity of such a check.

Up until Sunday night I would have agreed with you, but then a routine fsck on a reboot after a kernel upgrade picked up errors in my ext4 /home filesystem which I had to fix by hand. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Russell Coker

8:20 a.m.

On Tue, 27 Mar 2012, Chris Samuel <chris@csamuel.org> wrote:

...

Up until Sunday night I would have agreed with you, but then a routine fsck on a reboot after a kernel upgrade picked up errors in my ext4 /home filesystem which I had to fix by hand.

What manual fixing did you do? The last time I had to manually fix a filesystem was in about 1997 when an Ext2 filesystem got corrupted due to a kernel bug and put a few Inodes in states whereby they couldn't be deleted and wouldn't be fixed by a fsck -f. The Inodes in question had flags that indicated an obvious error so it was a combination of a kernel bug and a fsck bug. I used debugfs to zero the Inode which made it obviously wrong in a way that fsck would clean up the mess. I don't recall using debugfs in the last decade, I think this is an indication of Ext* working well. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Trent W. Buck

26 Mar 26 Mar

1:05 a.m.

Anthony Shipman wrote:

...

...
VT-d is completely unrelated. VT-d is virtualisation support. if you ever want to run kvm or virtualbox or similar you'll need it enabled. [...]

Wikipedia says that VT-d is Intel's idea of an IOMMU. But my BIOS doesn't show it despite what the manual says. There is nothing else in the BIOS resembling an IOMMU or memory hole remapping.

AIUI, 1. Intel's CPU virtualization is called "VT-x". You need it to modprobe kvm. 2. Intel's I/O virtualization is called "VT-d". According to #kvm, it is ONLY useful if you assign an ENTIRE disk (as in sata0, not an arbitrary block device) to a VM. Since I don't do that (I use mdadm and LVM), that's where I lost interest in VT-d. 3. Strictly, VT-x is CPU virtualization for x86-64. There is a separate VT-i or something for itanium, but nobody cares.

4864

Age (days ago)

4878

Last active (days ago)

List overview

Download

47 comments

11 participants

participants (11)

Anthony Shipman
Brett Pemberton
Chris Samuel
Craig Sanders
Jason White
Lindsay Sprinter
Robin Humble
Russell Coker
Tim Connors
Toby Corkindale
Trent W. Buck