
Russell Coker <russell@coker.com.au> writes:
One significant difference between a home network and a serious server network is that most of the functions of your home network don't matter much when you are asleep or away. Therefore a redundancy which involves you logging in as root and running a route command will work a lot better on a home network. Of course my experience is that having a sysadmin login as root and manually fail things over is better than any cluster software implementation I've seen, but that's another issue.
+1. This is the approach I have taken for SMEs as well -- I usually try to simplify it so that there is a (metaphorical or actual) throw switch, so that if it goes tits-up the customer can flip it and hopefully things will be at least usable until a trained sysadmin can clean up the mess properly. Trying to teach the computer when to flip it tends to be nontrivial and tends to introduce new SPOFs.
I *want* to believe things are handled better in the Enterprise! space, but my personal experience is negligiible there and I'm a pessimist.
The problem is that there are so many modes of failure, and when you've accounted for them all, murphy will invent new ones. Even RAID (especially software RAID) often requires user intervention when disks play up but don't outright die. We recently had a customer server (HP ML350 with HP hardware RAID) play up, running extremely slowly for roughly 2 minutes out of every 3. We suspected a bad disk, but _nothing_ in any logs or other measurable thing indicated which disk, and pulling the potentially wrong disk out of a RAID5 is not something to be considered lightly. In the end the onsite manager noted that disk 1 would flash strangely on occasion, and those occasions lined up exactly with the freezing, so after confirming that the most recent backups were successful and making sure the customer understood the consequences we pulled the oddly behaving disk, and the effect on performance was positive and instant. A new disk failed instantly in that slot, but worked in another slot. Replacing the backplane and reinserting the original disk brought the problem back with no indication of actual failure. Only with a new disk and new backplane were things good again. My best guess was that there was some sort of failure on the original disk such that inserting it corrupted the slot in the backplane until a reboot, but that's a guess and having solved the problem I didn't much care anymore as the hardware is covered by service contract. The biggest problem you'll have with any sort of simple failover arrangement is flapping, where something has failed such that it works for 5 seconds then doesn't work for 5 seconds, for various values of "works" and "doesn't work". I've seen datacentre's fall victim to this before - a redundant link doesn't help much if the primary link comes up and down every few seconds and the failover keeps swapping back to the original link just as it fails again. So even "Enterprise" can fall victim to such things. Fortunately such problems are relatively rare and STP/OSPF will be all that is required most of the time, with someone needing to pull the plug manually on rare occasions. James