
I've got a lot of servers with the root filesystem (including /boot) on a RAID-1 array. For years I've had systems configured such that if one drive fails then the other can boot the system (although in the case of IDE if drive 0 fails then the other drive would have to be changed to the other cable before booting). http://etbe.coker.com.au/2012/04/17/zfs-btrfs-cheap-servers/ Now I'm considering the case of a reasonable size ZFS server. Would it work to have an 8-disk mirror for the root filesystem? I'll probably put /var on ZFS so the root filesystem won't need to be particularly big. In terms of wasting space a 10G root filesystem mirrored across all disks isn't a big deal when you have 3TB disks. It seems that the ability to boot from any disk and having a symmetric layout provide more benefit than saving maybe 60G of raw disk capacity. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, 18 Apr 2012, Russell Coker wrote:
Now I'm considering the case of a reasonable size ZFS server. Would it work to have an 8-disk mirror for the root filesystem?
I don't know. But I use multiple copies to keep older versions in case an update causes trouble. Just an idea Peter

On 18/04/12 15:01, Russell Coker wrote:
Now I'm considering the case of a reasonable size ZFS server. Would it work to have an 8-disk mirror for the root filesystem? I'll probably put /var on ZFS so the root filesystem won't need to be particularly big. In terms of wasting space a 10G root filesystem mirrored across all disks isn't a big deal when you have 3TB disks. It seems that the ability to boot from any disk and having a symmetric layout provide more benefit than saving maybe 60G of raw disk capacity.
I've done it with 4 disks using GPT under FreeBSD 9, works like a charm. Booting takes a little while extra though, but nothing too serious. Regards, -- .''`. Philipp Huebner <debalance@debian.org> : :' : pgp fp: 6719 25C5 B8CD E74A 5225 3DF9 E5CA 8C49 25E4 205F `. `'` HP: http://www.debalance.de, Skype: philipp-huebner `- ICQ: 235-524-440, Jabber: der_schakal@jabber.org

On Wed, Apr 18, 2012 at 01:01:14PM +1000, Russell Coker wrote:
Now I'm considering the case of a reasonable size ZFS server. Would it work to have an 8-disk mirror for the root filesystem?
i don't boot off ZFS(*), but IIRC from what I've read on the zfsonlinux list and github issues it: a) depends on your version of grub. It has to support zfs. b) can't boot from a compressed filesystem. if you enable compression (a good idea in most cases, IMO), then that means you have to create a separate /boot fs and turn off compression. something like: zfs create rpool/boot zfs set compression=off rpool/boot BTW, the zfsonlinux web site, mailing list and github issue tracker are the best places to look for answers to questions like this. (*) I boot off an SSD, with about 80GB for / (xfs), as well as 4GB for ZIL and about 20GB for zfs cache. I used to boot from RAID-1 disks. I may one day end up doing the same with SSDs. or i may switch to booting off ZFS. craig -- craig sanders <cas@taz.net.au>

On Wed, 18 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Wed, Apr 18, 2012 at 01:01:14PM +1000, Russell Coker wrote:
Now I'm considering the case of a reasonable size ZFS server. Would it work to have an 8-disk mirror for the root filesystem?
i don't boot off ZFS(*), but IIRC from what I've read on the zfsonlinux list and github issues it:
Sorry, my previous message wasn't clear. I will use ZFS for the main data storage but Ext3 for booting in a separate mdadm device. Booting from ZFS sounds like too much effort for no real benefit. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, Apr 18, 2012 at 04:55:01PM +1000, Russell Coker wrote:
Sorry, my previous message wasn't clear.
I will use ZFS for the main data storage but Ext3 for booting in a separate mdadm device.
OK, you mean having a small partition on each disk in the zpool, with mdadm raid-1 for the rootfs? sure, that'll work. mdadm has no problem with multiple extra devices in a RAID-1. and reads should be amazingly fast. however, zfs works much better if you give it entire disks rather than partitions. in particular, it can disable write barriers and get much better performance. IIRC it can't do that with partitions because it doesn't know what else may be writing to the disk. making zpools from partitions does work, but is very strongly discouraged.
Booting from ZFS sounds like too much effort for no real benefit.
for an existing system, yeah. for a new one, probably not that much hassle. one method that should work might be: do the initial install on a spare disk (because only debian's kfreebsd supports installation on zfs right now), reboot, create your zpool(s), rsync rootfs to zfs, chroot, reconfigure and re-install grub, make sure that zfs-initramfs package is installed and run update-initramfs. rebooting again without the spare disk should give you a working root on zfs. if it doesn't work first time, plug the spare disk back and fix :) craig -- craig sanders <cas@taz.net.au> BOFH excuse #63: not properly grounded, please bury computer

On Wed, 18 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Wed, Apr 18, 2012 at 04:55:01PM +1000, Russell Coker wrote:
Sorry, my previous message wasn't clear.
I will use ZFS for the main data storage but Ext3 for booting in a separate mdadm device.
OK, you mean having a small partition on each disk in the zpool, with mdadm raid-1 for the rootfs?
Yes.
sure, that'll work. mdadm has no problem with multiple extra devices in a RAID-1. and reads should be amazingly fast.
It works for 2 disks, I've never had a reason to try with more. I wouldn't expect reads to be fast. Last time I tested such things I had great difficulty in demonstrating any read benefit for Linux Software RAID-1.
however, zfs works much better if you give it entire disks rather than partitions. in particular, it can disable write barriers and get much better performance. IIRC it can't do that with partitions because it doesn't know what else may be writing to the disk.
Hmm, so it relies entirely on it's journalling to get consistent data? So what happens with an application calls fsync() or fdatasync()? How does ZFS know that the data is on disk? I'll have to do some benchmarks of these things.
making zpools from partitions does work, but is very strongly discouraged.
It sounds like ZFS will be more difficult than I thought. Do you have a reference for a good ZFS sysadmin guide?
do the initial install on a spare disk (because only debian's kfreebsd supports installation on zfs right now), reboot, create your zpool(s), rsync rootfs to zfs, chroot, reconfigure and re-install grub, make sure that zfs-initramfs package is installed and run update-initramfs. rebooting again without the spare disk should give you a working root on zfs.
if it doesn't work first time, plug the spare disk back and fix :)
That sounds horrible, like a 1990's Linux install! With a bit of luck the server I'll get will have an internal USB port. Then I can use a USB boot device. This is easy to setup and it's also easy to have a spare device just in case of corruption. So far I've only seen one USB device fail properly in production, and it stopped accepting WRITE requests but could still be read. Apart from panicing the kernel when a rw mount was attempted it seemed to work. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, Apr 18, 2012 at 08:02:22PM +1000, Russell Coker wrote:
sure, that'll work. mdadm has no problem with multiple extra devices in a RAID-1. and reads should be amazingly fast.
It works for 2 disks, I've never had a reason to try with more.
I've tried it with three. it worked.
I wouldn't expect reads to be fast. Last time I tested such things I had great difficulty in demonstrating any read benefit for Linux Software RAID-1.
i didn't do any timing tests on three drives, so i'll take your word for it.
however, zfs works much better if you give it entire disks rather than partitions. [...]
Hmm, so it relies entirely on it's journalling to get consistent data? So what happens with an application calls fsync() or fdatasync()? How does ZFS know that the data is on disk?
I'll have to do some benchmarks of these things.
can't remember the details right now, and i'm kind of keen to go get some dinner :) try the zfsonlinux site: http://zfsonlinux.org/ see especially the ZoL github linked from there and the github issues.
making zpools from partitions does work, but is very strongly discouraged.
It sounds like ZFS will be more difficult than I thought. Do you have a reference for a good ZFS sysadmin guide?
i wouldn't say "more difficult" just some different factors to take into account when planning the server build. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide http://www.solarisinternals.com/wiki/index.php/ZFS_for_Databases focused on solaris, but most of it is still relevant to zfsonlinux (and to zfs on freebsd too).
That sounds horrible, like a 1990's Linux install!
yeah, well, debian's installer doesn't support zfs yet. debian gnu/linux doesn't, debian gnu/kfreebsd does. if you want root on zfs you have to do some stuffing around. Six months ago I would have said don't bother - but from what i've read on the zfsonlinux sites, it sounds like it's pretty much sorted out now.
With a bit of luck the server I'll get will have an internal USB port. Then I can use a USB boot device. This is easy to setup and it's also easy to have a spare device just in case of corruption.
that would work too. some motherboards have internal USB ports for plugging USB drives inside the case. you can also buy jumper-block to USB port adapters off ebay for not much money. craig -- craig sanders <cas@taz.net.au> BOFH excuse #417: Computer room being moved. Our systems are down for the weekend.

On Wed, 18 Apr 2012, Craig Sanders wrote:
however, zfs works much better if you give it entire disks rather than partitions. in particular, it can disable write barriers and get much better performance. IIRC it can't do that with partitions because it doesn't know what else may be writing to the disk.
making zpools from partitions does work, but is very strongly discouraged.
There is the problem that Swap over ZFS may not work (when under stress). The reason is similar to the Linux and Swap over NFS problem. http://kerneltrap.org/Linux/Swap_Over_NFS "The problem with swap over network is the generic swap problem: needing memory to free memory." The same applies to Swap over ZFS because the ARC is not integrated into the Virtual Memory Management. If it is a "storage only" it shouldn't be a problem, you just limit ARC to use most of the RAM (leave enough for the NFS/samba/AFS servers) and ARC should handle it safely without adding additional stress to the "rest" (I believe). Overall, don't be stingy with RAM and you are much happier. Otherwise it may be a good idea to use a SSD for root and swap (If you have swap on it you may put root on it as well, just to save you the hassle). Regards Peter

On Wed, Apr 18, 2012 at 06:14:19PM +1000, Trent W. Buck wrote:
Craig Sanders wrote:
(*) I boot off an SSD, with about 80GB for / (xfs), as well as 4GB for ZIL and about 20GB for zfs cache.
Do you RAID1 or otherwise backup the SSD?
not at the moment. I haven't decided yet whether it's worth the expense or whether i should just convert booting from ZFS (since i'm planning on doing that eventually anyway). it's not high on my priority list, either way. I have a way of recovering in case of disaster.
If not, what's your recovery plan for when the SSD shits itself completely and without warning?
buy a new SSD, pxeboot to a restore image (possibly clonezilla with zfs dkms modules utils added) and restore / and /boot from backup. craig -- craig sanders <cas@taz.net.au> BOFH excuse #28: CPU radiator broken
participants (5)
-
Craig Sanders
-
Peter Ross
-
Philipp Huebner
-
Russell Coker
-
Trent W. Buck