
On Thu, 21 Nov 2013, Toby Corkindale wrote:
On 20 November 2013 15:16, Tim Connors <tconnors@rather.puzzling.org> wrote:
When we have these overloads, nothing else we measure seems to be approaching any limit. The servers have plenty of CPU left, and there's no real difficulty logging into them. Anything else I should be looking at? Fork rate is tiny (1 or 2 per second). Network bandwidth is fine. Not sure that I've noticed network packet limitations (4k packets per second per host when it failed last time, generating 16000 interrupts/second total per host).
What is going wrong in the "overload"?
Something hits a tipping point, the number of apache worker slots (3000-6000 depending on hardware specs) rapidly fills up, then apache stops accepting new connections and www.bom.gov.au goes dark (since this happens on all machines in the load balanced cluster simultaneously).
Ah, you are probably already well aware of this, so stop me if so.. In my experience, there's definitely an upper bound to the number of web-serving worker threads you can run on a machine, beyond which you start to see a drop in the aggregate performance rather than gain.
That I believe. But the reason I bring up context switch limitations is that we've still got 80% cpu free when we hit whatever limit we're hitting. iowait is bugger all; system (ie, what I would have thought would be accurately accounting for any context switch overhead, but maybe it misses a few cpu cycles between user space relinquishing control and the kernel swapping its pointers in and starting to account for resources against itself?) is small; so what else can it be (notwork is fine, interrupts appear fine, but again I don't know whether theres an invisible limitation there; NFS server and disk array are trivially loaded)? Here's a sar -u output from last Saturday when we filled 5000 slots on some of the cluster nodes and started refusing connections (scary thought, because then the load balancer drops out the first offending node, and so the others have to take up the slack, and all drop out too): 00:00:01 CPU %user %nice %system %iowait %steal %idle 06:20:01 all 8.25 0.00 9.44 0.25 0.00 82.05 sar -w: 00:00:01 cswch/s 06:10:01 80444.64 06:20:01 108955.60 06:30:01 55500.00 (I've seen 150000 on these 16 core machines before it really started struggling. The older 8 core nodes in the cluster achieve pretty much half that)
Three to six thousand slots sounds like a lot for one machine, to me.* I wondered why so many? Are you not running a reverse-proxy accelerator in front of Apache? (eg. Varnish or some configurations of nginx)
We're government. Let's throw resources at it (until we run out of money) rather than think about it carefully (actually, it's served us pretty well up til now. But someone made Wise Choices last year, and then <rant elided>).
If you were just serving static content directly, I'd go with something lighter-weight than Apache; and if you're serving dynamic content (ie. the php you mention) then I'd definitely not do so without a good reverse-proxy in front of it, and a much-reduced number of apache threads.
The php wasn't a big thing until a few months ago. It's obvious that it's causing the problems, but chucking a cache in front of each node will be impossible now that we're not allowed to buy any equipment or replacements for so-old-they're-out-of-warranty machines (annoyingly, the offending .php page could just as easily be a static page, but we outsourced that to an "industry expert"). The httpd.conf configuration is complex enough that it'll never be replaced with another httpd server particularly now that the only two people who knew enough about it in the web group have retired.
* But I'm a bit out of date; current spec hardware is quite a bit more powerful than it was last time I was seriously working with high-thread-count code.
Heh. I spent all day in a hot aisle of one of our data centres tracing unlabelled cables a few weeks ago. Some of these 64 core blades are seriously powerful boxes. Fortunately it wasn't a day like today. -- Tim Connors