
If anyone has a few seconds spare, could you please run the following and post the results: dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync along with the kernel version and arch, filesystem, layers (lvm, md, etc), and underlying hardware? When I do it on a bare 7200RPM sata disk on a modern server running xfs I get 10-15kbytes/second. I've repeated this on 3 other servers on different hardware with similar results, but when I do it on a ~10yo PC with ext3 I get around 500kbytes/second - 50x faster. I suspect it might be the kernel version (3.8 on new server, 2.6 on old pc) and the implementation of O_SYNC in older kernels wrt metadata but I don't have enough data points to form any conclusions... Thanks James

On 29/04/13 20:21, James Harper wrote:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
All tests below on ext4 except the RAIDz one. SATA spinning disks sw RAID0, kernel 3.5.0: 65536 bytes (66 kB) copied, 2.65235 s, 24.7 kB/s SATA spinning disks sw RAID5, kernel 2.5.0: 65536 bytes (66 kB) copied, 4.36826 s, 15.0 kB/s SATA spinning disks sw RAID10, kernel 3.2.0: 65536 bytes (66 kB) copied, 2.65235 s, 24.7 kB/s (the size of this test is small enough that it's below the stripe size, hence similarities above) SATA spinning disks, sw RAIDZ w/SSD log: 65536 bytes (66 kB) copied, 0.32133 s, 204 kB/s Older SSD, kernel 3.5.0: 65536 bytes (66 kB) copied, 0.826156 s, 79.3 kB/s A modern but budget SSD, kernel 3.2.0: 65536 bytes (66 kB) copied, 1.26909 s, 51.6 kB/s Higher end SSD, kernel 3.2.0: 65536 bytes (66 kB) copied, 0.0313182 s, 2.1 MB/s However after all that.. my guess is that your 10-year-old PC was one where xfs didn't support barriers.. Try mounting your modern PC with "nobarrier" and see what happens to the performance?

On 29/04/2013, at 8:21 PM, James Harper <james.harper@bendigoit.com.au> wrote:
along with the kernel version and arch, filesystem, layers (lvm, md, etc), and underlying hardware?
I get: # dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync 128+0 records in 128+0 records out 65536 bytes (66 kB) copied, 0.113709 s, 576 kB/s That's an Oracle Linux VM running on Oracle VM 3.2 running our 3.7 playground kernel. The filesystem is ext4. The disk is virtual, sitting on an OCFS2 filesystem on a local 7.2k RPM disk. Alternatively: # dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync 128+0 records in 128+0 records out 65536 bytes (66 kB) copied, 0.183353 s, 357 kB/s That's another Oracle Linux VM running our UEK2 (3.0) kernel. Also ext4, the disk is virtual backed over iSCSI to an OCFS2 shared filesystem on a QNAP NAS. Cheers, Avi

On Mon, 2013-04-29 at 10:21 +0000, James Harper wrote:
If anyone has a few seconds spare, could you please run the following and post the results:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
along with the kernel version and arch, filesystem, layers (lvm, md, etc), and underlying hardware?
When I do it on a bare 7200RPM sata disk on a modern server running xfs I get 10-15kbytes/second. I've repeated this on 3 other servers on different hardware with similar results, but when I do it on a ~10yo PC with ext3 I get around 500kbytes/second - 50x faster. I suspect it might be the kernel version (3.8 on new server, 2.6 on old pc) and the implementation of O_SYNC in older kernels wrt metadata but I don't have enough data points to form any conclusions...
Thanks
James
Hi James, andrew@linux-bczj:~> dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync 128+0 records in 128+0 records out 65536 bytes (66 kB) copied, 6.00686 s, 10.9 kB/s andrew@linux-bczj:~> uname -a Linux linux-bczj.site 3.7.10-1.1-desktop #1 SMP PREEMPT Thu Feb 28 15:06:29 UTC 2013 (82d3f21) x86_64 x86_64 x86_64 GNU/Linux AMD Phenom and a hardware list attached Andrew Greig

So what does this all mean? Is it a regression in Linux 3?
Or were previous versions not actually blocking while sync was called?
I think the latter. Probably the filesystems didn't sync as expected in previous versions of Linux. I was surprised by the numbers, but according to Wikipedia a modern 7200RPM SATA disk maxes out at around 75-100 IOPS. An average run of my dd script is: # dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync 128+0 records in 128+0 records out 65536 bytes (66 kB) copied, 4.19463 s, 15.6 kB/s So that's 128 sync writes in 4 seconds, or 32 writes per second. Every time you write the 512 byte chunk to the test.bin file, there will also be a metadata update, and even if that was only a single write operation we are now at 64 IOPS, which is within the same magnitude as the 75-100 figure given by Wikipedia. Small synchronous writes suck. I think the suckiness increases even more with 512 byte writes on 4k sectors because of the read-modify-write situation. I'm hoping that ceph with the journal on an SSD will improve my situation somewhat... my SSD's arrive tomorrow. James

James Harper <james.harper@bendigoit.com.au> wrote:
I'm hoping that ceph with the journal on an SSD will improve my situation somewhat... my SSD's arrive tomorrow.
I found out that to take full advantage of a modern SSD, it is necessary to have a machine with a SATA 3.0 controller capable of a 6 GBPS transfer rate. this rules out my current desktop and laptop systems.

On 30/04/13 11:09, Jason White wrote:
James Harper <james.harper@bendigoit.com.au> wrote:
I'm hoping that ceph with the journal on an SSD will improve my situation somewhat... my SSD's arrive tomorrow.
I found out that to take full advantage of a modern SSD, it is necessary to have a machine with a SATA 3.0 controller capable of a 6 GBPS transfer rate.
this rules out my current desktop and laptop systems.
Not really.. You'll still reap all the benefits of instant seek times and huge iops performance. You just will cap out at ~300mbyte/sec transfers instead of ~600mbyte/sec, but honestly, how often does that matter?

On 30/04/13 11:09, Jason White wrote:
James Harper <james.harper@bendigoit.com.au> wrote:
I'm hoping that ceph with the journal on an SSD will improve my situation somewhat... my SSD's arrive tomorrow.
I found out that to take full advantage of a modern SSD, it is necessary to have a machine with a SATA 3.0 controller capable of a 6 GBPS transfer rate.
this rules out my current desktop and laptop systems.
Not really..
You'll still reap all the benefits of instant seek times and huge iops performance. You just will cap out at ~300mbyte/sec transfers instead of ~600mbyte/sec, but honestly, how often does that matter?
The latency is what I'm going for here. The SSD is only holding the journal so it has the benefits of being able to turn tiny synchronous writes into bigger writes to the rotating media, and while higher SSD throughput would help smooth out the peaks I'm ultimately still limited by the throughput of the rotating media. If I was using the SSD as a cache (bcache/flashcache/etc) then the faster throughput might matter more, but we all have budgets :) James

Hi all In my work (Open Query - MySQL and MariaDB database backed infrastructure) we tend to find that it's not data transfer rate or RPMs that's the biggest hindrance. On a disk or array, it's seektime, and with SANs it's latency. The disk story is because seektime is still counted in milliseconds, and rarely do you transfer a lot of data in one chunk. More commonly you'll read or write smaller chunks in different locations. Thus you're bound by the seek speed. A SAN is only fast when you request big chunks of data, or request smaller chunks that have in some recent past already been accessed (by you or another host). Otherwise you're not using its cache and thus it suffers from the same issue as any disk or array. (FYI, MySQL/MariaDB of course keep lots of data/index info in memory so when they read disk data it won't be in the cache - we tend to make SAN people cry as the *effective* performance for these tasks is just so dismal). SSD is of course very nice as it gets rid of the seeks. You can also use SSD as an intermediate caching mechanism, as it's persistent. There are now some RAID controllers available that implement this, using SSD as well as RAM in multiple layers of caching. This might come in handy when you need more storage. We sometimes play with SATA RAID rather than SAS for fast yet cost-efficient storage. SAS is freaking expensive for less space. SAS has a longer command queue but when you stick a RAID controller in front of it that becomes irrelevant as the controller will work that out for you. No matter which physical device you use, or filesystem, you'll find that setting 'noatime' (in /etc/fstab) helps. You generally don't need to know the last access time (unless it's for some strict read-access security auditing requirement) and it'll prevent at least one and possibly two seeks. Cheers, Arjen. -- Exec.Director @ Open Query (http://openquery.com) MariaDB/MySQL services Sane business strategy explorations at http://upstarta.com.au Personal blog at http://lentz.com.au/blog/

On Tue, Apr 30, 2013 at 09:09:14AM +1000, hannah commodore wrote:
So what does this all mean? Is it a regression in Linux 3?
Or were previous versions not actually blocking while sync was called?
i think Toby indicated the source of the problem with his comment about barriers. from kernel 2.6.28 onwards, write barriers are turned on by default in ext4. i'm not sure what version they got added for XFS but they've definitely been on by default for a few years now, and are known to have a massive performance penalty for mysql and innodb[2] similarly, LVM got full write barrier support in 2.6.33. mdadm RAID0/1 has had write barriers for several years, and RAID5/6 got them in late 2009/early 2010 IIRC. it's only safe to turn barriers off if disable any write-caching in the drive OR if you have a non-volatile write cache (e.g. hardware raid, something like bcache with an SSD, or ZFS with an SSD for ZIL), otherwise you risk data loss and filesystem corruption in the event of a crash or power-failure. [1] http://kernelnewbies.org/Ext4#head-25c0a1275a571f7332fa196d4437c38e79f39f63 this also links to a May 2008 article on write barries: http://lwn.net/Articles/283161/ (see also http://lwn.net/Articles/349970/ "Ext3 and RAID: silent data killers?" which inspired mdadm's author to add write barriers for RAID5) [2] https://pracops.com/wiki/index.php/Write_barriers this one links to useful info at: http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache... craig ps: there's also the fact that dd just isn't a very good tool for performance benchmarking, especially with such small files. pps: my suggestion would be to run mysql VMs on ZFS volumes (rather than LVM+mdadm LVs) with 16KB record size, an SSD for L2ARC and ZIL, and "skip-innodb_doublewrite" in mysql.conf as suggested here: https://blogs.oracle.com/realneel/entry/mysql_innodb_zfs_best_practices and here: http://ftp.nchu.edu.tw/MySQL/tech-resources/articles/mysql-zfs.html for non-VM mysql, a zfs filesystem for /var/lib/mysql with 16KB record size, SSD caching/ZIL, and skip-innodb_doublewrite and similar for postgresql, although the tuning recommendation there is for 8K recordsize for pgsql zfs filesystems/volumes. also, enabling zfs compression has been shown to improve performance with some kinds of data and IO loads. I don't know if anyone has done similar testing with mysql. -- craig sanders <cas@taz.net.au>

Craig Sanders <cas@taz.net.au> writes:
it's only safe to turn barriers off if disable any write-caching in the drive OR if you have a non-volatile write cache (e.g. hardware raid, something like bcache with an SSD, or ZFS with an SSD for ZIL), ^^^^
Not just hw raid, but hardware raid with a BBU or equivalent. I've seen at least one case where the BBU was an "optional extra" to make the base price look more reasonable, and $manager didn't notice it was a separate line item until it was too late...

On 29/04/13 20:21, James Harper wrote:
If anyone has a few seconds spare, could you please run the following and post the results: dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
some older machines: 2.6.32-5-686 on a 2.4G0Hz Xeon running Debian 6 plain ext3 on an 80GB 7200RPM Maxtor drive 65536 bytes (66 kB) copied, 0.150762 s, 435 kB/s 3.2.0-4-686-pae on a 1.86GHz Core2 running Debian 7 plain ext4 on a 500GB 7200RPM WD drive 65536 bytes (66 kB) copied, 5.45952 s, 12.0 kB/s regards, Glenn -- sks-keyservers.net 0x6d656d65

Thanks everyone. You've confirmed that what I'm seeing isn't something strange with my servers, even if it isn't exactly what I expected. James

On 29/04/13 20:21, James Harper wrote:
If anyone has a few seconds spare, could you please run the following and post the results:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
along with the kernel version and arch, filesystem, layers (lvm, md, etc), and underlying hardware?
65536 bytes (66 kB) copied, 0.681949 s, 96.1 kB/s Linux chris-ultralap 3.9.0-rc8-g824282c-1+ #4 SMP Tue Apr 23 22:15:53 EST 2013 x86_64 x86_64 x86_64 GNU/Linux ext4 (no layers) Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz 8GB RAM Crucial M4-CT064M4SSD3 mSATA SSD -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

On 29/04/13 20:21, James Harper wrote:
If anyone has a few seconds spare, could you please run the following and post the results:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
along with the kernel version and arch, filesystem, layers (lvm, md, etc), and underlying hardware?
65536 bytes (66 kB) copied, 0.108355 seconds, 605 kB/s Linux merri-v 2.6.18-308.13.1.el5 #1 SMP Thu Jul 26 05:45:09 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux GPFS over QDR Infiniband Intel(R) Xeon(R) CPU X5550 @ 2.67GHz 48GB RAM DDN SFA10K array (900GB 10K SAS) for data, IBM V7000 with SSD for metadata. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

On 29/04/13 20:21, James Harper wrote:
If anyone has a few seconds spare, could you please run the following and post the results:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
along with the kernel version and arch, filesystem, layers (lvm, md, etc), and underlying hardware?
65536 bytes (66 kB) copied, 0.066523 seconds, 985 kB/s Linux bruce-m.vlsci.unimelb.edu.au 2.6.18-348.4.1.el5 #1 SMP Tue Apr 16 15:40:06 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux PanFS (from Panasas) Intel(R) Xeon(R) CPU X5550 @ 2.67GHz 24GB RAM 4 shelves of Panasas PAS8 shelves (3 generations old). -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

On 29.04.13 10:21, James Harper wrote:
If anyone has a few seconds spare, could you please run the following and post the results:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
along with the kernel version and arch, filesystem, layers (lvm, md, etc), and underlying hardware?
65536 bytes (66 kB) copied, 0.0362834 s, 1.8 MB/s 2.6.32-41-generic, x86 model name : VIA C7 Processor 1500MHz cpu MHz : 1500.000 /dev/sdb1: Linux rev 1.0 ext3 filesystem data

On 29/04/13 20:21, James Harper wrote:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync hi
------------------box 1 $ uname -r 3.8.8-203.fc18.i686.PAE (32 bit Fedora 18) disk mdadm RAID 5 array over 4*1TB 7200 SATA disks $ dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync 128+0 records in 128+0 records out 65536 bytes (66 kB) copied, 0.176816 s, 371 kB/s ------------------box 2 $ uname -r 3.8.8-203.fc18.x86_64 (64 bit Fedora 18) disk ext4 of Samsung SSD $ dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync 128+0 records in 128+0 records out 65536 bytes (66 kB) copied, 1.30531 s, 50.2 kB/s Steve

James Harper <james.harper@bendigoit.com.au> writes:
If anyone has a few seconds spare, could you please run the following and post the results:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
Since no one has suggested it yet, you might try bonnie++ instead.

James Harper <james.harper@bendigoit.com.au> writes:
If anyone has a few seconds spare, could you please run the following and post the results:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
Since no one has suggested it yet, you might try bonnie++ instead.
My strace of mysql showed the behaviour: write(512 bytes) fsync() write(1024 bytes) fsync() and so on and using dd to write 512 byte chunks with O_SYNC is a good approximation of this and very consistent with the behaviour I was seeing from mysql. Initially I was looking for general performance problems but in the end it is this very specific case, and I have now proven it's not specific to this server. The reason why mysql suddenly started doing these these very small sync writes is still a mystery though. Other backup servers don't show this behaviour. James

Hi James
My strace of mysql showed the behaviour:
write(512 bytes) fsync() write(1024 bytes) fsync() and so on
Indeed, that'll be writes to the ib_logfile* and/or binlog. Essentially sequential writes (but that's only useful if they're on separate spindles). For any commit (or single write command in autocommit mode), MySQL (or MariaDB) have to fsync to the InnoDB transaction logfile as well as the binary log (if enabled - for replication, and point-in-time recovery). So that's two fsyncs per transaction commit.
and using dd to write 512 byte chunks with O_SYNC is a good approximation of this and very consistent with the behaviour I was seeing from mysql. Initially I was looking for general performance problems but in the end it is this very specific case, and I have now proven it's not specific to this server.
The reason why mysql suddenly started doing these these very small sync writes is still a mystery though. Other backup servers don't show this behaviour.
You can approximate MySQL's different access methods (sequential for logs, random access for tablespace) using the hdlatency tool (on Launchpad). It's a bit of C without any external dependencies. Run with --quick. It'll test the different access methods similar to the way MySQL/MariaDB use, and also check with direct I/O which is generally what you'd use with InnoDB (flush-method=O_DIRECT). Direct I/O tends to be faster, except on some SANs and other attached storage. Again we use hdlatency to just see which one is faster and configure accordingly. hdlatency is also known as "the tool that makes SAN operators cry" ;-) Generally a desktop SATA comes up with better numbers than the average SAN... I kid you not. Cheers, Arjen. -- Exec.Director @ Open Query (http://openquery.com) MariaDB/MySQL services Sane business strategy explorations at http://upstarta.com.au Personal blog at http://lentz.com.au/blog/

On Mon, 29 Apr 2013, James Harper wrote:
If anyone has a few seconds spare, could you please run the following and post the results:
dd if=/dev/zero of=test.bin bs=512 count=128 oflag=sync
along with the kernel version and arch, filesystem, layers (lvm, md, etc), and underlying hardware?
When I do it on a bare 7200RPM sata disk on a modern server running xfs I get 10-15kbytes/second. I've repeated this on 3 other servers on different hardware with similar results, but when I do it on a ~10yo PC with ext3 I get around 500kbytes/second - 50x faster. I suspect it might be the kernel version (3.8 on new server, 2.6 on old pc) and the implementation of O_SYNC in older kernels wrt metadata but I don't have enough data points to form any conclusions...
What a fun excercise. At work with you beaut SANs tier 1 storage with 15k SAS disks and SSD cache, still only getting 600k (some of our dev tier 3 stuff comes out quicker - too many VM datastores to disentangle how heavily loaded each datastore is though). About the same as my 5 year old laptop with a 240G SSD in it running 3.8. But about 60 times faster than my raspberry pi. Imagine a beowulf cluster of those. And my root filesystem and (empty) home directory on my ZFS fileserver boots off a USB nanostick, which gets the same performance as a ZFS pool on a real disk (er, the raidz 3 disk pool hasn't come back after a minute, but that disk is rather busy at this moment. That box never has been sane). -- Tim Connors
participants (14)
-
Andrew Greig
-
Arjen Lentz
-
Avi Miller
-
Chris Samuel
-
Craig Sanders
-
Erik Christiansen
-
Glenn McIntosh
-
hannah commodore
-
James Harper
-
Jason White
-
Steve Roylance
-
Tim Connors
-
Toby Corkindale
-
trentbuck@gmail.com