
I originally sent this direct to James resending it to the list
On Fri, Apr 12, 2013 at 3:17 PM, James Harper < james.harper@bendigoit.com.au> wrote:
. Online resize/reconfigure
both btrfs and zfs offer this.
Can it seamlessly continue over reboot? Obviously it can't progress while the system is rebooting like a hardware raid but I'd hope it could pick up where it left of automatically.
Yes it does.
This is where a lot of people get this wrong. Once the BIOS has succeeded in reading the bootsector from a boot disk it's committed. If the bootsector reads okay (even after a long time on a failing disk) but anything between the bootsector and the OS fails, your boot has failed. This 'anything between' includes the grub bootstrap, xen hypervisor, linux kernel, and initramfs, so it's a substantial amount of data to read from a disk that may be on its last legs. A good hardware RAID will have long since failed the disk by this point and booting will succeed.
My last remaining reservation on going ahead with some testing is is there an equivalent of clvm for zfs? Or is that the right approach for zfs? My main server cluster is:
2 machines each running 2 x 2TB disks with DRBD with the primary exporting the whole disk as an iSCSI volume 2 machines each importing the iSCSI volume running lvm (clvm) on top, and using the lv's as backing stores for xen VM's.
How would this best be done using zfs?
If i was building new infrastructure today with 2 or more machines hosting VMs i would probably look at using CEPH as the storage layer for the Virtual machines. this would provide distributed mirrored storage that is accessible from all machines. all machines could then be storage and VM hosts. ref: http://www.slideshare.net/xen_com_mgr/block-storage-for-vms-with-ceph http://ceph.com/

On Fri, Apr 12, 2013 at 03:31:20PM +1000, Kevin wrote:
On Fri, Apr 12, 2013 at 3:17 PM, James Harper <james.harper@bendigoit.com.au> wrote:
This is where a lot of people get this wrong. Once the BIOS has succeeded in reading the bootsector from a boot disk it's committed. If the bootsector reads okay (even after a long time on a failing disk) but anything between the bootsector and the OS fails, your boot has failed. This 'anything between' includes the grub bootstrap, xen hypervisor, linux kernel, and initramfs, so it's a substantial amount of data to read from a disk that may be on its last legs. A good hardware RAID will have long since failed the disk by this point and booting will succeed.
i think we're talking about different things here. if you can tell the BIOS "don't boot from sda, boot from sdb instead" then it really doesn't matter how messed up sda is, the system's not going to use it, it's going to boot from sdb like you told it to.
My last remaining reservation on going ahead with some testing is is there an equivalent of clvm for zfs? Or is that the right approach for zfs? My main server cluster is:
2 machines each running 2 x 2TB disks with DRBD with the primary exporting the whole disk as an iSCSI volume 2 machines each importing the iSCSI volume running lvm (clvm) on top, and using the lv's as backing stores for xen VM's.
interesting. that's kind of the opposite of how google's ganeti works, where each node exports small LVs which are combined with DRBD on the host that actually runs the VM to provide a "disk" for a particular VM. ganeti is scalable to multiple machines (up to 40 according to the docs) as a VM's DRBD volume can be constructed from LVs exported by any two machines, but this sounds like it's limited to two and only two machines as the storage servers. (in theory you could make ganeti work with ZFS, ZVOLs and iscsi but i don't think anyone's actually done it)
How would this best be done using zfs?
short answer: zfs doesn't do that. in theory you could export each disk individually with iscsi and build ZFS pools (two mirrored pools). if that actually worked, you'd have to do a lot of manual stuffing around to make sure that the pools were only in use on one machine at a time, and more drudgery to handle fail-over events. seems like a fragile PITA and not worth the bother, even if it could be made to work. i can think of a few other ugly kludgy ways you could emulate something like clvm (like iscsi export ZVOLs from each server and combine with drbd) but they would just be shoe-horning the wrong technology into a particular model. better to look around for other alternatives actually designed to do the job.
If i was building new infrastructure today with 2 or more machines hosting VMs i would probably look at using CEPH as the storage layer for the Virtual machines. this would provide distributed mirrored storage that is accessible from all machines. all machines could then be storage and VM hosts.
ref: http://www.slideshare.net/xen_com_mgr/block-storage-for-vms-with-ceph http://ceph.com/
i'd agree with this - CEPH is cool. in fact, i'd also be inclined to use CEPH as the object store with Openstack instead of Swift - ceph's object store does everything that swift does and also offers the distributed block storage layer on top of that - thus avoiding the need for QCOW2 over NFS (yuk!) or a dedicated netapp server or similar for shared VM images. (apparently ceph's distributed filesystem layer isn't ready for production use yet but the object store and block storage are) craig -- craig sanders <cas@taz.net.au>

On Fri, Apr 12, 2013 at 4:14 PM, Craig Sanders <cas@taz.net.au> wrote:
i'd agree with this - CEPH is cool.
in fact, i'd also be inclined to use CEPH as the object store with Openstack instead of Swift - ceph's object store does everything that swift does and also offers the distributed block storage layer on top of that - thus avoiding the need for QCOW2 over NFS (yuk!) or a dedicated netapp server or similar for shared VM images.
Actually Swift can do one thing ceph doesn't swift has really awesome geographic replication built in. http://swiftstack.com/blog/2012/09/16/globally-distributed-openstack-swift-c... http://www.mirantis.com/blog/configuring-multi-region-cluster-openstack-swif... Once that gets added to ceph there is very little use for anything but ceph. (Ceph can also act as a back end store for hadoop)

On Fri, Apr 12, 2013 at 4:14 PM, Craig Sanders <cas@taz.net.au> wrote:
i'd agree with this - CEPH is cool.
in fact, i'd also be inclined to use CEPH as the object store with Openstack instead of Swift - ceph's object store does everything that swift does and also offers the distributed block storage layer on top of that - thus avoiding the need for QCOW2 over NFS (yuk!) or a dedicated netapp server or similar for shared VM images.
Actually Swift can do one thing ceph doesn't swift has really awesome geographic replication built in. http://swiftstack.com/blog/2012/09/16/globally-distributed-openstack- swift-cluster/ http://www.mirantis.com/blog/configuring-multi-region-cluster-openstack- swift/
Once that gets added to ceph there is very little use for anything but ceph. (Ceph can also act as a back end store for hadoop)
Hmmm... my list of new things to learn is growing rapidly! James

On Fri, Apr 12, 2013 at 03:31:20PM +1000, Kevin wrote:
On Fri, Apr 12, 2013 at 3:17 PM, James Harper <james.harper@bendigoit.com.au> wrote:
This is where a lot of people get this wrong. Once the BIOS has succeeded in reading the bootsector from a boot disk it's committed. If the bootsector reads okay (even after a long time on a failing disk) but anything between the bootsector and the OS fails, your boot has failed. This 'anything between' includes the grub bootstrap, xen hypervisor, linux kernel, and initramfs, so it's a substantial amount of data to read from a disk that may be on its last legs. A good hardware RAID will have long since failed the disk by this point and booting will succeed.
i think we're talking about different things here. if you can tell the BIOS "don't boot from sda, boot from sdb instead" then it really doesn't matter how messed up sda is, the system's not going to use it, it's going to boot from sdb like you told it to.
My original argument in favour of hardware RAID was good BIOS boot support (implying that it still worked seamlessly even in the /dev/sda disk is partly dead case) You then contested that you could change the BIOS order manually and also that BIOS could also try sda, then sdb, etc. Changing the BIOS boot order manually is a kludge that you don't have to perform with hardware RAID, and my rant above was addressing the reasons why having the BIOS try sda then sdb etc isn't really solving the problem in some cases. If I'm using hardware RAID it's one less thing I have to worry about when doing a remote reboot. A good fakeraid implementation would also address this (and coreboot with linux md support would too!) James

On Fri, Apr 12, 2013 at 07:01:15AM +0000, James Harper wrote:
Changing the BIOS boot order manually is a kludge that you don't have to perform with hardware RAID, and my rant above was addressing the reasons why having the BIOS try sda then sdb etc isn't really solving the problem in some cases.
ah, okay. i get it now. i guess i just don't see changing the boot device as being a huge problem as i'm likely to have to do lots more manual fixup stuff anyway in case of a major hardware failure. a minor annoyance issue. in situations like that, i reserve my bitterness and loathing for the extreme crappiness of IPMI consoles that can only be accessed via some POS java GUI craig -- craig sanders <cas@taz.net.au>

James Harper writes:
My original argument in favour of hardware RAID was good BIOS boot support (implying that it still worked seamlessly even in the /dev/sda disk is partly dead case)
You then contested that you could change the BIOS order manually and also that BIOS could also try sda, then sdb, etc.
Changing the BIOS boot order manually is a kludge that you don't have to perform with hardware RAID, and my rant above was addressing the reasons why having the BIOS try sda then sdb etc isn't really solving the problem in some cases.
These issues were why I originally switched from grub to extlinux. While I can think of cases that would still break under extlinux, I haven't run into them. Mostly it was grub's device.map not matching after the BIOS shuffled (or didn't) the disk order to boot sdb. extlinux doesn't *have* a device.map, so it Just Works.

If i was building new infrastructure today with 2 or more machines hosting VMs i would probably look at using CEPH as the storage layer for the Virtual machines. this would provide distributed mirrored storage that is accessible from all machines. all machines could then be storage and VM hosts.
ref: http://www.slideshare.net/xen_com_mgr/block-storage-for-vms-with-ceph http://ceph.com/
CEPH does sound exciting. Is anyone here doing it with Xen? James

James Harper <james.harper@bendigoit.com.au> wrote:
CEPH does sound exciting. Is anyone here doing it with Xen?
It was still under heavy development when last I read about it. The kernel module entered the mainline and, I assume, has undergone further work since then. File systems take a long time to mature. Testing and bug finding efforts always help, however.
participants (5)
-
Craig Sanders
-
James Harper
-
Jason White
-
Kevin
-
trentbuck@gmail.com