[luv-main] PostgreSQL benchmarked on XFS vs ZFS vs btrfs vs ext4

Resending this now the LUV list seems to be back up, as I never saw it come through yesterday. Apologies if this is a duplicate. -------- Original Message -------- Hi, Some months ago, I ran some (probably naive) benchmarks looking at how pgbench performed on an identical system with differing filesystems and mount options. Since then some of you have pointed out that ZFS is looking pretty good on Linux now, and I'm sure there's been a bunch of btrfs fixes too, and no doubt various updates in the Linux kernel and PostgreSQL that should help performance. I ran the tests on Ubuntu 11.04 with Pg 9.0 first, then upgraded the system to Ubuntu 11.10 (beta) with Pg 9.1 and ran them again. The latter combination showed a considerable performance improvement overall - although I didn't investigate to find out whether this was due to kernel improvements, postgres improvements, or virtio improvements. The results are measured in transactions-per-second, with higher numbers being better. Results: ext4 (data=writeback,relatime): natty: 248 oneiric: 297 ext4 (data=writeback,relatime,nobarrier): natty: didn't test oneiric: 1409 XFS (relatime): natty: didn't test oneiric: 171 btrfs (relatime): natty: 61.5 oneiric: 91 btrfs (relatime,nodatacow): natty: didn't test oneiric: 128 ZFS (defaults): natty: 171 oneiric: 996 Conclusion: Last time I ran these tests, xfs and ext4 pulled very similar results, and both were miles ahead of btrfs. This time around, ext4 has managed to get a significantly faster result than xfs. However we have a new contender - ZFS performed *extremely* well on the latest Ubuntu setup - achieving triple the performance of regular ext4! I'm not sure how it achieved this, and whether we're losing some kind of data protection (eg. like the "barrier" options in XFS and ext4). If ext4 has barriers disabled, it surpasses even ZFSs high score. Perhaps some of you can shed some light on this for me? Oddly, ZFS performed wildly differently on ubuntu 11.04 vs 11.10b. I can't explain this, as the ZFS kernel module was identical (coming from a third-party apt repository). Any ideas? Cheers, Toby

On Wed, Sep 14, 2011 at 04:00:54PM +1000, Toby Corkindale wrote:
ZFS (defaults): natty: 171 oneiric: 996
did you tune the fs for database use? e.g. make a separate zfs for ~postgres and give it an 8K block size. apparently it's also a good idea to reduce/limit the size of the ARC (ZFS's equivalent to Linux's disk buffering & caching) because, by default, it will use up all available unused memory less 1GB. The db server will will some or all of that RAM, and db servers often have their own built-in app specific caching too (mysql certainly does). i haven't really looked into tuning ZFS yet, but i have run across some web pages with useful info. http://www.solarisinternals.com/wiki/index.php/ZFS_for_Databases#PostgreSQL_... http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_... also, ZFS has built in support for write intent logging (ZIL) and caching (L2ARC) on a separate block device (i.e. disk or partition). e.g on an SSD or other fast disk. a couple of GB of ZIL will really speed up random write performance (similar to having a battery-backed or SSD non-volatile write cache on a raid controller) and a separate cache will speed up reads. no idea about your questions. i really don't know ZFS well enough yet to even guess. craig -- craig sanders <cas@taz.net.au> BOFH excuse #277: Your Flux Capacitor has gone bad.

On Wed, Sep 14, 2011 at 04:38:22PM +1000, Craig Sanders wrote:
apparently it's also a good idea to reduce/limit the size of the ARC
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Memo... craig -- craig sanders <cas@taz.net.au> BOFH excuse #195: We only support a 28000 bps connection.

On Wed, Sep 14, 2011 at 04:40:13PM +1000, Craig Sanders wrote:
On Wed, Sep 14, 2011 at 04:38:22PM +1000, Craig Sanders wrote:
apparently it's also a good idea to reduce/limit the size of the ARC
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Memo...
also, this ARC memory usage thing is really pushing me towards building a separate workstation and turning this current machine into a dedicated server. i've been gradually working towards this for months, but i think i'm going to have to pick up the pace a bit(*) ARC is a memory hungry pig. Chromium browser is a memory hungry pig. and Iceweasel is too (but nowhere near as bad as Chromium). running all these and more on the same 8GB machine is mostly OK but there are occasionally annoying swap delays. especially annoying as I've got used to how fast and responsive my machine has become since i installed an SSD as the OS disk a few weeks ago. having swap on an SSD is nice too...but even swapping to an SSD results in noticable delays sometimes. (*) what a shame :) craig -- craig sanders <cas@taz.net.au> BOFH excuse #177: sticktion

On 14/09/11 16:38, Craig Sanders wrote:
On Wed, Sep 14, 2011 at 04:00:54PM +1000, Toby Corkindale wrote:
ZFS (defaults): natty: 171 oneiric: 996
did you tune the fs for database use? e.g. make a separate zfs for ~postgres and give it an 8K block size.
All tests were performed with the /var/lib/postgresql directory being a separate partition, with a freshly-created filesystem for the test. I didn't do any tuning beyond the mount options mentioned in my post, so no, Pg didn't get an 8K block size. I'm not convinced on the 8K vs 4K block size though - some benchmarks I've seen actually indicate that smaller sizes perform better - but that swapping 4 for 8 doesn't make a whole lot of difference on the benchmark I'm using. Eg: http://www.fuzzy.cz/en/articles/benchmark-results-hdd-read-write-pgbench/
apparently it's also a good idea to reduce/limit the size of the ARC (ZFS's equivalent to Linux's disk buffering& caching) because, by default, it will use up all available unused memory less 1GB. The db server will will some or all of that RAM, and db servers often have their own built-in app specific caching too (mysql certainly does).
That sounds much like Linux's own VM system then, in which case, it's fine to do that. Without going into details, if you've tuned postgres correctly, then this is actually the optimal setup.
i haven't really looked into tuning ZFS yet, but i have run across some web pages with useful info.
http://www.solarisinternals.com/wiki/index.php/ZFS_for_Databases#PostgreSQL_... http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_...
also, ZFS has built in support for write intent logging (ZIL) and caching (L2ARC) on a separate block device (i.e. disk or partition). e.g on an SSD or other fast disk. a couple of GB of ZIL will really speed up random write performance (similar to having a battery-backed or SSD non-volatile write cache on a raid controller) and a separate cache will speed up reads.
I noticed that.. I'd like to experiment with those again some time, but as it stands, just having three fast drives striped seems to work pretty well anyway. Looking at disk stats as I ran the benchmarks, I noticed that the one that managed to generate the most disk I/O (in MB/sec) was ext4-nobarrier, followed by ZFS. Everything else was quite a way behind. ZFS appeared to be using a lot more CPU than ext4 though.. not sure what to make of that.. I guess the extra complexity in the FS has to cause something!

On Wed, 14 Sep 2011 05:06:01 PM Toby Corkindale wrote:
ZFS appeared to be using a lot more CPU than ext4 though.. not sure what to make of that..
Well it has to checksum all data being written to disk, and verify the checksums of anything being read back from disk. Same as btrfs will be doing. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

On Wed, 14 Sep 2011, Chris Samuel <chris@csamuel.org> wrote:
ZFS appeared to be using a lot more CPU than ext4 though.. not sure what to make of that..
Well it has to checksum all data being written to disk, and verify the checksums of anything being read back from disk.
[521878.448007] raid6: int64x1 2043 MB/s [521878.516018] raid6: int64x2 2160 MB/s [521878.584012] raid6: int64x4 1894 MB/s [521878.652042] raid6: int64x8 1599 MB/s [521878.720007] raid6: sse2x1 4056 MB/s [521878.788004] raid6: sse2x2 4831 MB/s [521878.856005] raid6: sse2x4 7119 MB/s [521878.856007] raid6: using algorithm sse2x4 (7119 MB/s) I just loaded the raid6 module on my E4600 (2*2.4GHz) system. Apparently the RAID-6 operations which are probably more computationally intensive than any basic checksum can be performed at a rate of 7GB/s. As the transfer rates for a SQL test are probably a couple of orders of magnitude less than that I doubt that checksums are enough to explain the CPU use. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 14/09/11 23:33, Russell Coker wrote:
On Wed, 14 Sep 2011, Chris Samuel<chris@csamuel.org> wrote:
ZFS appeared to be using a lot more CPU than ext4 though.. not sure what to make of that..
Well it has to checksum all data being written to disk, and verify the checksums of anything being read back from disk.
[521878.448007] raid6: int64x1 2043 MB/s [521878.516018] raid6: int64x2 2160 MB/s [521878.584012] raid6: int64x4 1894 MB/s [521878.652042] raid6: int64x8 1599 MB/s [521878.720007] raid6: sse2x1 4056 MB/s [521878.788004] raid6: sse2x2 4831 MB/s [521878.856005] raid6: sse2x4 7119 MB/s [521878.856007] raid6: using algorithm sse2x4 (7119 MB/s)
I just loaded the raid6 module on my E4600 (2*2.4GHz) system. Apparently the RAID-6 operations which are probably more computationally intensive than any basic checksum can be performed at a rate of 7GB/s. As the transfer rates for a SQL test are probably a couple of orders of magnitude less than that I doubt that checksums are enough to explain the CPU use.
The Linux RAID6 module is using highly optimised routines targeted at specific x86 CPU's features - note the SSE2 mentioned. ZFS may not have such optimisations, or they may have been targeted at SPARC CPUs instead or something..

Hiya, If you want to have an easy test of db-like access on a filesystem or hardware configuration, try https://launchpad.net/hdlatency which I wrote a few years back. There are no external dependencies, it's simple C. It tests specific sequential as well as random access reads/writes in various block sizes, with and without direct I/O. When used in --quick mode it limits the tests to the specific patterns used by MySQL server. The issue with using a more complex app (PostgreSQL, MySQL, etc) for benchmarking is that you'd need to do very specific tuning on that end also to gain optimal results.
From those filesystems mentioned, extN would tend to be slowest (do set noatime in fstab), and XFS is particularly fast when properly configured. We haven't had any clients use either zfs or btrfs so haven't delved in to that.
Cheers, Arjen. -- Exec.Director @ Open Query (http://openquery.com) MySQL services Sane business strategy explorations at http://Upstarta.biz Personal blog at http://lentz.com.au/blog/

Thanks Arjen, While synthetic testing tools can be useful, I still think the most valuable test is one using the actual target application. Postgres, for instance, includes all sorts of techniques to try and improve disk access, and so the way it accesses the disk isn't going to be as random (or sequential) as a synthetic test. I agree that there is tuning to be done in the application to optimise it for certain storage backends - however in my experience with Pg, those adjustments tend more towards the physical setup (ie. disk cache, system cache, number of disk spindles, seek time) rather than the file system in use. I'm interested to hear examples of when this has not been the case. Also with the file systems, while there are optimal creation and mount time things that can be set (eg. noatime, nodatacow, journal mode), these only tend to make a small difference (<10%) compared to the overall filesystem performance. I'm interested in your comment that extN performs the slowest and XFS the fastest with proper configuration. My experience has been that ext4 actually performs very well - but perhaps my knowledge of how to optimally configure an ext4 system is better than for xfs. (Or perhaps Postgres has a different access pattern to MySQL?) Would you mind sharing your tips for configuring xfs optimally for a database load? I'd like to re-run my tests with them. Cheers, Toby On 15/09/11 11:40, Arjen Lentz wrote:
Hiya,
If you want to have an easy test of db-like access on a filesystem or hardware configuration, try https://launchpad.net/hdlatency which I wrote a few years back. There are no external dependencies, it's simple C.
It tests specific sequential as well as random access reads/writes in various block sizes, with and without direct I/O. When used in --quick mode it limits the tests to the specific patterns used by MySQL server.
The issue with using a more complex app (PostgreSQL, MySQL, etc) for benchmarking is that you'd need to do very specific tuning on that end also to gain optimal results.
From those filesystems mentioned, extN would tend to be slowest (do set noatime in fstab), and XFS is particularly fast when properly configured. We haven't had any clients use either zfs or btrfs so haven't delved in to that.
Cheers, Arjen.
-- .signature

On 15/09/11 11:40, Arjen Lentz wrote:
From those filesystems mentioned, extN would tend to be slowest (do set noatime in fstab), and XFS is particularly fast when properly configured. We haven't had any clients use either zfs or btrfs so haven't delved in to that.
Interestingly, this recent comprehensive benchmark of Postgres on differing filesystems, with differing blocksizes, had XFS coming out as around 75% of the speed (so 25% slower?) compared to ext4, when barriers were enabled. http://bit.ly/ri3vWt Curiously, with barriers disabled, the speeds were only a few percent different. In general this guy was getting slower top speeds than I did, but he was using a single drive, whereas I had three striped. -Toby

On Wed, Sep 14, 2011 at 05:06:01PM +1000, Toby Corkindale wrote:
All tests were performed with the /var/lib/postgresql directory being a separate partition, with a freshly-created filesystem for the test.
'zpool create ...' followed by 'zfs create -o mountpoint=/var/lib/postgresql postgresql' or similar?
I didn't do any tuning beyond the mount options mentioned in my post, so no, Pg didn't get an 8K block size.
I'm not convinced on the 8K vs 4K block size though - some benchmarks
i have no opinion one way or the other yet. haven't done the testing myself. just reporting what i'd read. 4K block-sizes (i.e. ashift=12 rather than default ashift=9 when you create the zpool), on the other hand, are really important. especially if you have some "Advanced Format" 4K sector drives, or there's a chance that you'll be adding some to the pool in the future. which is pretty much a certainty with most 2TB and (i think) all 3TB drives being 4K. I've even got a few 1TB drives that are 4K sectors (2 of them in my zpool). apparently that's a reason to use whole disks rather than partitions when creating/adding to a zpool, so zfs can autodetect whether the drive used 512-byte or 4KB sectors. (of course, i didn't know this when i created my test zpool. didn't know about the ashift=12 option either. no problem, i was expecting to blow it away and start from scratch once i'd played with it enough to know how it worked). I didn't realise what I had when i saw my first 4K sector drive, so just went ahead and formatted it as usual with 512 byte alignment. performance really sucked, and i had no idea why until i googled the model number. had to backup, reformat, and restore.
I've seen actually indicate that smaller sizes perform better - but that swapping 4 for 8 doesn't make a whole lot of difference on the benchmark I'm using. Eg: http://www.fuzzy.cz/en/articles/benchmark-results-hdd-read-write-pgbench/
thanks for the link.
apparently it's also a good idea to reduce/limit the size of the ARC (ZFS's equivalent to Linux's disk buffering& caching) because [...]
That sounds much like Linux's own VM system then, in which case, it's fine to do that. Without going into details, if you've tuned postgres correctly, then this is actually the optimal setup.
very much like it. AIUI, the effort to hack zfsonlinux to use linux' own cachine probably isn't worth it, and would make keeping up with ZFS on Iluumos, FreeBSD, etc much more difficult. it's not a huge problem unless you're running other stuff (like a standard desktop) on the same system and don't have enough memory.
I noticed that.. I'd like to experiment with those again some time, but as it stands, just having three fast drives striped seems to work pretty well anyway.
just striped...scary for anything but "I really don't care if i lose it all" data. zfs is good, but it's not magic. it can't recover data when a disk dies if there isn't another copy on the other disks in the pool. for safety, stick another drive in and use raidz. it's like raid-5 but without (most of) the raid-5 write performance problems.
Looking at disk stats as I ran the benchmarks, I noticed that the one that managed to generate the most disk I/O (in MB/sec) was ext4-nobarrier, followed by ZFS. Everything else was quite a way behind. ZFS appeared to be using a lot more CPU than ext4 though.. not sure what to make of that.. I guess the extra complexity in the FS has to cause something!
compression enabled? ZFS does use a fair amount of CPU power. it does a lot more than most filesystems. also, given that it's designed for "Enterprise" environments, they've made quite reasonable assumptions about hardware capabilities and performance than more consumer-oriented dev teams can get away with. i.e. modern desktop machines won't be troubled by it. aging ones will. especially if they don't have enough RAM. craig -- craig sanders <cas@taz.net.au> BOFH excuse #28: CPU radiator broken

On 14/09/11 22:20, Craig Sanders wrote:
On Wed, Sep 14, 2011 at 05:06:01PM +1000, Toby Corkindale wrote:
All tests were performed with the /var/lib/postgresql directory being a separate partition, with a freshly-created filesystem for the test.
'zpool create ...' followed by 'zfs create -o mountpoint=/var/lib/postgresql postgresql' or similar?
Yeah, almost exactly that, I think.
I didn't do any tuning beyond the mount options mentioned in my post, so no, Pg didn't get an 8K block size.
I'm not convinced on the 8K vs 4K block size though - some benchmarks
i have no opinion one way or the other yet. haven't done the testing myself. just reporting what i'd read.
4K block-sizes (i.e. ashift=12 rather than default ashift=9 when you create the zpool), on the other hand, are really important. especially if you have some "Advanced Format" 4K sector drives, or there's a chance that you'll be adding some to the pool in the future. which is pretty much a certainty with most 2TB and (i think) all 3TB drives being 4K. I've even got a few 1TB drives that are 4K sectors (2 of them in my zpool).
Yep, I'm aware of the 4K Sector stuff, and do configure it correctly when needed. (Although Ubuntu has auto-detected and handled it correctly itself for a while, I've noticed.) The disks in question here were 1TB disks with standard 512-byte sectors though.
I noticed that.. I'd like to experiment with those again some time, but as it stands, just having three fast drives striped seems to work pretty well anyway.
just striped...scary for anything but "I really don't care if i lose it all" data.
zfs is good, but it's not magic. it can't recover data when a disk dies if there isn't another copy on the other disks in the pool.
for safety, stick another drive in and use raidz. it's like raid-5 but without (most of) the raid-5 write performance problems.
Oh, don't worry - I know what I'm doing! This was a purpose-built setup for benchmarking these filesystems - basically a stock install of Ubuntu Server - so it's trivial to recreate. Anything important gets stored on RAID 5, 6 or 10. In the case of this benchmark, I wanted to have something more closely approximating real-world DB servers, which usually have several spindles in a RAID-10 configuration.
Looking at disk stats as I ran the benchmarks, I noticed that the one that managed to generate the most disk I/O (in MB/sec) was ext4-nobarrier, followed by ZFS. Everything else was quite a way behind. ZFS appeared to be using a lot more CPU than ext4 though.. not sure what to make of that.. I guess the extra complexity in the FS has to cause something!
compression enabled?
Nope. I suspect Chris had it right, about all the checksumming being done. (Although btrfs didn't use that much juice.. but also didn't have much throughput, so perhaps the CPU usage would have risen if it had managed it.)
ZFS does use a fair amount of CPU power. it does a lot more than most filesystems. also, given that it's designed for "Enterprise" environments, they've made quite reasonable assumptions about hardware capabilities and performance than more consumer-oriented dev teams can get away with.
Agreed, although, wasn't OSX planning to use ZFS at some point? What happened with that?

On Thu, 15 Sep 2011 11:14:13 AM Toby Corkindale wrote:
Agreed, although, wasn't OSX planning to use ZFS at some point? What happened with that?
Apparently silently dumped by Apple in October 2009. It was a year into the NetApp/Sun lawsuit over ZFS and WAFL so perhaps they wanted to reduce their risk.. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP
participants (5)
-
Arjen Lentz
-
Chris Samuel
-
Craig Sanders
-
Russell Coker
-
Toby Corkindale