
On 20 November 2013 15:16, Tim Connors <tconnors@rather.puzzling.org> wrote:
When we have these overloads, nothing else we measure seems to be approaching any limit. The servers have plenty of CPU left, and there's no real difficulty logging into them. Anything else I should be looking at? Fork rate is tiny (1 or 2 per second). Network bandwidth is fine. Not sure that I've noticed network packet limitations (4k packets per second per host when it failed last time, generating 16000 interrupts/second total per host).
What is going wrong in the "overload"?
Something hits a tipping point, the number of apache worker slots (3000-6000 depending on hardware specs) rapidly fills up, then apache stops accepting new connections and www.bom.gov.au goes dark (since this happens on all machines in the load balanced cluster simultaneously).
Ah, you are probably already well aware of this, so stop me if so.. In my experience, there's definitely an upper bound to the number of web-serving worker threads you can run on a machine, beyond which you start to see a drop in the aggregate performance rather than gain. Three to six thousand slots sounds like a lot for one machine, to me.* I wondered why so many? Are you not running a reverse-proxy accelerator in front of Apache? (eg. Varnish or some configurations of nginx) If you were just serving static content directly, I'd go with something lighter-weight than Apache; and if you're serving dynamic content (ie. the php you mention) then I'd definitely not do so without a good reverse-proxy in front of it, and a much-reduced number of apache threads. Sorry this doesn't really help with the context-switching question, but maybe helps with the overall issue. -Toby * But I'm a bit out of date; current spec hardware is quite a bit more powerful than it was last time I was seriously working with high-thread-count code.