
Hi all, On your most overloaded (cpu/fork rate/context switches - ignoring memory network, disk and swap etc), what is the maximum number of context switches per second per core (ie divide sar -w output by 16 if you have a 16 core box) you measure? Does anyone know what the maximum number of context switches per core you can expect on xeon level hardware? I'm trying to claim we get overloaded when we reach a little less than 10,000 cswch/s per second, but we've lost all the historical data. -- Tim Connors

On Wed, 20 Nov 2013, Tim Connors wrote:
Hi all,
On your most overloaded (cpu/fork rate/context switches - ignoring memory network, disk and swap etc), what is the maximum number of context switches per second per core (ie divide sar -w output by 16 if you have a 16 core box) you measure?
Does anyone know what the maximum number of context switches per core you can expect on xeon level hardware?
I'm trying to claim we get overloaded when we reach a little less than 10,000 cswch/s per second, but we've lost all the historical data.
Indeed, is there going to be a maximum for a given piece of hardware (eg, maximum amount of interrupts that can be generated per second; time spent in the interrupt handler that all has to be handled by only one CPU hence explaining why CPU system usage never looks alarming (divide by 8 on some servers, by 16 on others); big kernel lock somewhere in the context switch code)? When we have these overloads, nothing else we measure seems to be approaching any limit. The servers have plenty of CPU left, and there's no real difficulty logging into them. Anything else I should be looking at? Fork rate is tiny (1 or 2 per second). Network bandwidth is fine. Not sure that I've noticed network packet limitations (4k packets per second per host when it failed last time, generating 16000 interrupts/second total per host). -- Tim Connors

On Wed, 20 Nov 2013, Tim Connors <tconnors@rather.puzzling.org> wrote:
On your most overloaded (cpu/fork rate/context switches - ignoring memory network, disk and swap etc), what is the maximum number of context switches per second per core (ie divide sar -w output by 16 if you have a 16 core box) you measure?
Of all the systems I can measure in that regard (ones with sar installed and running) the highest is an average of 3428.79 for a 10 minute period. That's a workstation running KDE and Chromium which probably had some cron jobs running at the time (but nothing particularly intensive, maybe an iView download). That system in question has 4 cores. Of my servers the ones that have any significant load are running Xen. That means that they have several copies of sar for different DomUs (which is difficult to collate at best) and also means that there may be context switches between DomUs that don't show up in sar output and which I don't know how to measure (if you know then let me know and I'll get the output). I really doubt that 850 contest switches per second is any sort of hardware limit.
Does anyone know what the maximum number of context switches per core you can expect on xeon level hardware?
I'm trying to claim we get overloaded when we reach a little less than 10,000 cswch/s per second, but we've lost all the historical data.
Indeed, is there going to be a maximum for a given piece of hardware (eg, maximum amount of interrupts that can be generated per second; time spent in the interrupt handler that all has to be handled by only one CPU hence explaining why CPU system usage never looks alarming (divide by 8 on some servers, by 16 on others); big kernel lock somewhere in the context switch code)?
When we have these overloads, nothing else we measure seems to be approaching any limit. The servers have plenty of CPU left, and there's no real difficulty logging into them. Anything else I should be looking at? Fork rate is tiny (1 or 2 per second). Network bandwidth is fine. Not sure that I've noticed network packet limitations (4k packets per second per host when it failed last time, generating 16000 interrupts/second total per host).
What is going wrong in the "overload"? Why not just write a context switch benchmark? It should be simple to have a 50+ pairs of processes and for each pair have them send a byte to a pipe and then wait to receive a byte from another pipe. http://manpages.ubuntu.com/manpages/hardy/lat_ctx.8.html From a quick Google search it seems that my above idea has already been implemented. http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html https://github.com/tsuna/contextswitch The above looks interesting too. A google search for the words context, switch, and benchmark will find you other things as well. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, 20 Nov 2013, Russell Coker wrote:
On Wed, 20 Nov 2013, Tim Connors <tconnors@rather.puzzling.org> wrote:
Does anyone know what the maximum number of context switches per core you can expect on xeon level hardware?
I'm trying to claim we get overloaded when we reach a little less than 10,000 cswch/s per second, but we've lost all the historical data.
Indeed, is there going to be a maximum for a given piece of hardware (eg, maximum amount of interrupts that can be generated per second; time spent in the interrupt handler that all has to be handled by only one CPU hence explaining why CPU system usage never looks alarming (divide by 8 on some servers, by 16 on others); big kernel lock somewhere in the context switch code)?
When we have these overloads, nothing else we measure seems to be approaching any limit. The servers have plenty of CPU left, and there's no real difficulty logging into them. Anything else I should be looking at? Fork rate is tiny (1 or 2 per second). Network bandwidth is fine. Not sure that I've noticed network packet limitations (4k packets per second per host when it failed last time, generating 16000 interrupts/second total per host).
What is going wrong in the "overload"?
Something hits a tipping point, the number of apache worker slots (3000-6000 depending on hardware specs) rapidly fills up, then apache stops accepting new connections and www.bom.gov.au goes dark (since this happens on all machines in the load balanced cluster simultaneously). woops!
Why not just write a context switch benchmark? It should be simple to have a 50+ pairs of processes and for each pair have them send a byte to a pipe and then wait to receive a byte from another pipe.
http://manpages.ubuntu.com/manpages/hardy/lat_ctx.8.html
From a quick Google search it seems that my above idea has already been implemented.
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html https://github.com/tsuna/contextswitch
The above looks interesting too. A google search for the words context, switch, and benchmark will find you other things as well.
Believe me I searched. All the snot in my head seems to be clogging up my synapses today unfortunately. But the blog entry looks good. I imagine that the 140,000 cswitches/second on 16 core machines running httpd+php interpreter is pretty much a fundamental limit on E5410 level hardware, given that apache is heavy weight enough that it's going to be more towards the 50,000µs end of the spectrum presented in that blog. Now I just have to convince the powers that be that php is a stupid thing to rely on when you don't have to, and it's obviously that recent change that broke the system that formerly coped with many times the amount of traffic that it now croaks on. Now I've got some benchmarks to run. I mean, fight some fires. -- Tim Connors

On 20 November 2013 15:16, Tim Connors <tconnors@rather.puzzling.org> wrote:
When we have these overloads, nothing else we measure seems to be approaching any limit. The servers have plenty of CPU left, and there's no real difficulty logging into them. Anything else I should be looking at? Fork rate is tiny (1 or 2 per second). Network bandwidth is fine. Not sure that I've noticed network packet limitations (4k packets per second per host when it failed last time, generating 16000 interrupts/second total per host).
What is going wrong in the "overload"?
Something hits a tipping point, the number of apache worker slots (3000-6000 depending on hardware specs) rapidly fills up, then apache stops accepting new connections and www.bom.gov.au goes dark (since this happens on all machines in the load balanced cluster simultaneously).
Ah, you are probably already well aware of this, so stop me if so.. In my experience, there's definitely an upper bound to the number of web-serving worker threads you can run on a machine, beyond which you start to see a drop in the aggregate performance rather than gain. Three to six thousand slots sounds like a lot for one machine, to me.* I wondered why so many? Are you not running a reverse-proxy accelerator in front of Apache? (eg. Varnish or some configurations of nginx) If you were just serving static content directly, I'd go with something lighter-weight than Apache; and if you're serving dynamic content (ie. the php you mention) then I'd definitely not do so without a good reverse-proxy in front of it, and a much-reduced number of apache threads. Sorry this doesn't really help with the context-switching question, but maybe helps with the overall issue. -Toby * But I'm a bit out of date; current spec hardware is quite a bit more powerful than it was last time I was seriously working with high-thread-count code.

On Thu, 21 Nov 2013, Toby Corkindale wrote:
On 20 November 2013 15:16, Tim Connors <tconnors@rather.puzzling.org> wrote:
When we have these overloads, nothing else we measure seems to be approaching any limit. The servers have plenty of CPU left, and there's no real difficulty logging into them. Anything else I should be looking at? Fork rate is tiny (1 or 2 per second). Network bandwidth is fine. Not sure that I've noticed network packet limitations (4k packets per second per host when it failed last time, generating 16000 interrupts/second total per host).
What is going wrong in the "overload"?
Something hits a tipping point, the number of apache worker slots (3000-6000 depending on hardware specs) rapidly fills up, then apache stops accepting new connections and www.bom.gov.au goes dark (since this happens on all machines in the load balanced cluster simultaneously).
Ah, you are probably already well aware of this, so stop me if so.. In my experience, there's definitely an upper bound to the number of web-serving worker threads you can run on a machine, beyond which you start to see a drop in the aggregate performance rather than gain.
That I believe. But the reason I bring up context switch limitations is that we've still got 80% cpu free when we hit whatever limit we're hitting. iowait is bugger all; system (ie, what I would have thought would be accurately accounting for any context switch overhead, but maybe it misses a few cpu cycles between user space relinquishing control and the kernel swapping its pointers in and starting to account for resources against itself?) is small; so what else can it be (notwork is fine, interrupts appear fine, but again I don't know whether theres an invisible limitation there; NFS server and disk array are trivially loaded)? Here's a sar -u output from last Saturday when we filled 5000 slots on some of the cluster nodes and started refusing connections (scary thought, because then the load balancer drops out the first offending node, and so the others have to take up the slack, and all drop out too): 00:00:01 CPU %user %nice %system %iowait %steal %idle 06:20:01 all 8.25 0.00 9.44 0.25 0.00 82.05 sar -w: 00:00:01 cswch/s 06:10:01 80444.64 06:20:01 108955.60 06:30:01 55500.00 (I've seen 150000 on these 16 core machines before it really started struggling. The older 8 core nodes in the cluster achieve pretty much half that)
Three to six thousand slots sounds like a lot for one machine, to me.* I wondered why so many? Are you not running a reverse-proxy accelerator in front of Apache? (eg. Varnish or some configurations of nginx)
We're government. Let's throw resources at it (until we run out of money) rather than think about it carefully (actually, it's served us pretty well up til now. But someone made Wise Choices last year, and then <rant elided>).
If you were just serving static content directly, I'd go with something lighter-weight than Apache; and if you're serving dynamic content (ie. the php you mention) then I'd definitely not do so without a good reverse-proxy in front of it, and a much-reduced number of apache threads.
The php wasn't a big thing until a few months ago. It's obvious that it's causing the problems, but chucking a cache in front of each node will be impossible now that we're not allowed to buy any equipment or replacements for so-old-they're-out-of-warranty machines (annoyingly, the offending .php page could just as easily be a static page, but we outsourced that to an "industry expert"). The httpd.conf configuration is complex enough that it'll never be replaced with another httpd server particularly now that the only two people who knew enough about it in the web group have retired.
* But I'm a bit out of date; current spec hardware is quite a bit more powerful than it was last time I was seriously working with high-thread-count code.
Heh. I spent all day in a hot aisle of one of our data centres tracing unlabelled cables a few weeks ago. Some of these 64 core blades are seriously powerful boxes. Fortunately it wasn't a day like today. -- Tim Connors

On 27 November 2013 13:06, Tim Connors <tconnors@rather.puzzling.org> wrote: [snip]
Three to six thousand slots sounds like a lot for one machine, to me.* I wondered why so many? Are you not running a reverse-proxy accelerator in front of Apache? (eg. Varnish or some configurations of nginx)
We're government. Let's throw resources at it (until we run out of money) rather than think about it carefully (actually, it's served us pretty well up til now. But someone made Wise Choices last year, and then <rant elided>).
If you were just serving static content directly, I'd go with something lighter-weight than Apache; and if you're serving dynamic content (ie. the php you mention) then I'd definitely not do so without a good reverse-proxy in front of it, and a much-reduced number of apache threads.
The php wasn't a big thing until a few months ago. It's obvious that it's causing the problems, but chucking a cache in front of each node will be impossible now that we're not allowed to buy any equipment or replacements for so-old-they're-out-of-warranty machines (annoyingly, the offending .php page could just as easily be a static page, but we outsourced that to an "industry expert"). The httpd.conf configuration is complex enough that it'll never be replaced with another httpd server particularly now that the only two people who knew enough about it in the web group have retired.
If you were just serving static content before, then Apache (w/sendfile) is fairly efficient and can handle a lot of simultaneous connections. I really wouldn't do it myself for a busy site, and it's susceptible to a few problems, but it's something you can mostly get away with. As soon as you throw dynamically-generated stuff in (ie. CGI of any sort) all that changes. You see, Apache can use a kernel-extension (sendfile) to attach an open filehandle (to static content) directly to a socket, and then the kernel just handles the rest, so it uses very little memory or CPU time or context switching. But dynamic content involves spawning an addition process, doing a whole lot of memory allocation, lots of file i/o and system calls, and worse: that heavy execution environment has to stick around as long as it takes to send the results off to a client, gradually feeding in a few kilobytes at a time. If your client is on a slow connection (mobile, dial-up, busy adsl, DoS attack) then you're holding up a lot of resources for a long time. Again, you are probably aware of this -- but I'm trying to illustrate just *how much* heavier the PHP processes are compared to serving static content. I'm not particularly surprised that servers which could handle things fine w/static content are falling over now. I do advise limiting the maximum number of threads per machine, to some amount lower than where you see the problems occurring. Set up a benchmarking rig against one machine (that's taken out of the load-balancing pool) and find out where the optimum amount is -- I'm pretty sure you'll find it's lower than at the many-thousand-processes mark. It's better to allow connections to pile up in the "pending" queue, but be able to process them quickly, than to accept them all and then serve them all very slowly or not at all. Secondly, if you're really stuck with Apache, and can't put decent reverse proxy accelerators in front of them, then try switching over the event-based worker? http://httpd.apache.org/docs/current/mod/event.html Toby

On Wed, 27 Nov 2013, Toby Corkindale wrote:
I do advise limiting the maximum number of threads per machine, to some amount lower than where you see the problems occurring. Set up a benchmarking rig against one machine (that's taken out of the load-balancing pool) and find out where the optimum amount is -- I'm pretty sure you'll find it's lower than at the many-thousand-processes mark. It's better to allow connections to pile up in the "pending" queue, but be able to process them quickly, than to accept them all and then serve them all very slowly or not at all.
ListenBacklog: https://sites.google.com/site/beingroot/articles/apache/socket-backlog-tunin... 512 by default (according to that page, but http://<server>/server-info doesn't list the current value), but the kernel in its current config only allows 128 per port. Aha, we'll change that next week (I actually thought that when it got to MaxClients, ie, number of slots filled up, it didn't accept any new connections at all. At least, in practice, we find that once it hits maxclients, the servers start dropping connections very soon after (and this propagates through the load balancers). That would be explained by the kernel limit being to only allow an extra 10% above the current number of max slots. Easy to fix.
Secondly, if you're really stuck with Apache, and can't put decent reverse proxy accelerators in front of them, then try switching over the event-based worker? http://httpd.apache.org/docs/current/mod/event.html
That would be good, but rhel5&6 are still on apache 2.2, and event is marked as experimental there :( http://httpd.apache.org/docs/2.2/mod/event.html We'll be stuck on rhel5 in production for years to come up at the current rate. I wonder about worker vs prefork? linux processes are lightweight, so I don't imagine threading is going to be much better. We only fork one process per second typically, and I don't think there'll be many differences in context switch overhead between the two. Worker apparently "sucks for php", but I don't know whether that's for mod_php or cgi or whatever. I like the sound of mod_pagespeed: https://www.digitalocean.com/community/articles/how-to-get-started-with-mod_... but the risk of rewriting stuff on the fly won't be accepted for most of our website. Hey, we just rediscovered a longstanding problem in that the most common static image (5 million hits in an hour, image hasn't changed in a year) on the site was in a directory that was marked as non cacheable! Whee! -- Tim Connors Midrange Systems | ITB | Bureau of Meteorology Phone: (03) 9669 4208 | E-mail: T.Connors@bom.gov.au

On 29 November 2013 17:08, Tim Connors <tconnors@rather.puzzling.org> wrote:
On Wed, 27 Nov 2013, Toby Corkindale wrote: [snip]
Secondly, if you're really stuck with Apache, and can't put decent reverse proxy accelerators in front of them, then try switching over the event-based worker? http://httpd.apache.org/docs/current/mod/event.html
That would be good, but rhel5&6 are still on apache 2.2, and event is marked as experimental there :( http://httpd.apache.org/docs/2.2/mod/event.html
We'll be stuck on rhel5 in production for years to come up at the current rate.
Ugh :(
I wonder about worker vs prefork? linux processes are lightweight, so I don't imagine threading is going to be much better. We only fork one process per second typically, and I don't think there'll be many differences in context switch overhead between the two. Worker apparently "sucks for php", but I don't know whether that's for mod_php or cgi or whatever.
That's OK though, because you're following the highly recommended practice of not using the same Apache instance for dynamic (ie. php) content as for static content, right? So you can switch the static handling over to mpm worker, and leave PHP on the prefork setup. T

On Fri, 29 Nov 2013 05:08:58 PM Tim Connors wrote:
I wonder about worker vs prefork? linux processes are lightweight, so I don't imagine threading is going to be much better. We only fork one process per second typically, and I don't think there'll be many differences in context switch overhead between the two. Worker apparently "sucks for php", but I don't know whether that's for mod_php or cgi or whatever.
Well I switched from prefork to worker on my personal Debian VM so that it could cope with lots of PHP, I kept running out of RAM otherwise. Worked really well and my problems went away. Now of course it's not loaded anything like the BoM servers, but this is the first I've heard that it "sucks for php". cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

You may also want to investigate Apache in worker mode with php-fpm running via fCGI. I switched a VM to that combo recently and the memory consumption by Apache dropped significantly. Also, you can then disable lots of apache modules to make it even leaner. Also, consider new relic monitoring of your servers. That allowed us to spot all sorts of PHP and MySQL related bottlenecks. Sent from my mobile device.
On 30 Nov 2013, at 8:53 am, Chris Samuel <chris@csamuel.org> wrote:
On Fri, 29 Nov 2013 05:08:58 PM Tim Connors wrote:
I wonder about worker vs prefork? linux processes are lightweight, so I don't imagine threading is going to be much better. We only fork one process per second typically, and I don't think there'll be many differences in context switch overhead between the two. Worker apparently "sucks for php", but I don't know whether that's for mod_php or cgi or whatever.
Well I switched from prefork to worker on my personal Debian VM so that it could cope with lots of PHP, I kept running out of RAM otherwise.
Worked really well and my problems went away.
Now of course it's not loaded anything like the BoM servers, but this is the first I've heard that it "sucks for php".
cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main

On 30 November 2013 08:53, Chris Samuel <chris@csamuel.org> wrote:
On Fri, 29 Nov 2013 05:08:58 PM Tim Connors wrote:
I wonder about worker vs prefork? linux processes are lightweight, so I don't imagine threading is going to be much better. We only fork one process per second typically, and I don't think there'll be many differences in context switch overhead between the two. Worker apparently "sucks for php", but I don't know whether that's for mod_php or cgi or whatever.
Well I switched from prefork to worker on my personal Debian VM so that it could cope with lots of PHP, I kept running out of RAM otherwise.
Worked really well and my problems went away.
Now of course it's not loaded anything like the BoM servers, but this is the first I've heard that it "sucks for php".
It's even in the PHP manual: http://www.php.net/manual/en/faq.installation.php#faq.installation.apache2

On 03/12/13 10:46, Toby Corkindale wrote:
It's even in the PHP manual:
I suspect that's about using mod_php, not PHP per se, it says: # If you want to use a threaded MPM, look at a FastCGI # configuration where PHP is running in its own memory space. which is pretty much what I do (mod_fcgid) and it works well. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

On 3 December 2013 10:55, Chris Samuel <chris@csamuel.org> wrote:
On 03/12/13 10:46, Toby Corkindale wrote:
It's even in the PHP manual:
I suspect that's about using mod_php, not PHP per se, it says:
# If you want to use a threaded MPM, look at a FastCGI # configuration where PHP is running in its own memory space.
which is pretty much what I do (mod_fcgid) and it works well.
I'd agree with that. Using fastcgi is similar to using a reverse proxy. Rather than apache running the dynamic content generation, it's just dispatching queries to back-end app servers and then forwarding the results back to the client. (Which is different to trying to efficiently manage many php interpreters within a multi-threaded apache process) Given what we've heard about Tim's environment, I suspect the machines will have taken the simplistic option, and be running PHP within the Apache process, rather than as separate processes connected to Apache via a socket. T
participants (5)
-
Avi Miller
-
Chris Samuel
-
Russell Coker
-
Tim Connors
-
Toby Corkindale