On Thu, 21 Nov 2013, Toby Corkindale wrote:
On 20 November 2013 15:16, Tim Connors
<tconnors(a)rather.puzzling.org> wrote:
When we have these overloads, nothing else we measure
seems to be
approaching any limit. The servers have plenty of CPU left, and there's
no real difficulty logging into them. Anything else I should be looking
at? Fork rate is tiny (1 or 2 per second). Network bandwidth is fine.
Not sure that I've noticed network packet limitations (4k packets per
second per host when it failed last time, generating 16000
interrupts/second total per host).
What is going wrong in the "overload"?
Something hits a tipping point, the number of apache worker slots
(3000-6000 depending on hardware specs) rapidly fills up, then apache
stops accepting new connections and
www.bom.gov.au goes dark (since this
happens on all machines in the load balanced cluster simultaneously).
Ah, you are probably already well aware of this, so stop me if so..
In my experience, there's definitely an upper bound to the number of
web-serving worker threads you can run on a machine, beyond which you
start to see a drop in the aggregate performance rather than gain.
That I believe. But the reason I bring up context switch limitations is
that we've still got 80% cpu free when we hit whatever limit we're
hitting. iowait is bugger all; system (ie, what I would have thought
would be accurately accounting for any context switch overhead, but maybe
it misses a few cpu cycles between user space relinquishing control and
the kernel swapping its pointers in and starting to account for resources
against itself?) is small; so what else can it be (notwork is fine,
interrupts appear fine, but again I don't know whether theres an invisible
limitation there; NFS server and disk array are trivially loaded)?
Here's a sar -u output from last Saturday when we filled 5000 slots on
some of the cluster nodes and started refusing connections (scary thought,
because then the load balancer drops out the first offending node, and so
the others have to take up the slack, and all drop out too):
00:00:01 CPU %user %nice %system %iowait %steal %idle
06:20:01 all 8.25 0.00 9.44 0.25 0.00 82.05
sar -w:
00:00:01 cswch/s
06:10:01 80444.64
06:20:01 108955.60
06:30:01 55500.00
(I've seen 150000 on these 16 core machines before it really started
struggling. The older 8 core nodes in the cluster achieve pretty much
half that)
Three to six thousand slots sounds like a lot for one
machine, to me.*
I wondered why so many? Are you not running a reverse-proxy
accelerator in front of Apache? (eg. Varnish or some configurations of
nginx)
We're government. Let's throw resources at it (until we run out of money)
rather than think about it carefully (actually, it's served us pretty well
up til now. But someone made Wise Choices last year, and then <rant
elided>).
If you were just serving static content directly,
I'd go with
something lighter-weight than Apache; and if you're serving dynamic
content (ie. the php you mention) then I'd definitely not do so
without a good reverse-proxy in front of it, and a much-reduced number
of apache threads.
The php wasn't a big thing until a few months ago. It's obvious that it's
causing the problems, but chucking a cache in front of each node will be
impossible now that we're not allowed to buy any equipment or replacements
for so-old-they're-out-of-warranty machines (annoyingly, the offending
.php page could just as easily be a static page, but we outsourced that to
an "industry expert"). The httpd.conf configuration is complex enough
that it'll never be replaced with another httpd server particularly now
that the only two people who knew enough about it in the web group have
retired.
* But I'm a bit out of date; current spec hardware
is quite a bit more
powerful than it was last time I was seriously working with
high-thread-count code.
Heh. I spent all day in a hot aisle of one of our data centres tracing
unlabelled cables a few weeks ago. Some of these 64 core blades are
seriously powerful boxes. Fortunately it wasn't a day like today.
--
Tim Connors