strange mysqld CPU use

Russell Coker

1 Jul 2012 1 Jul '12

7:28 a.m.

# strace -p 20033 Process 20033 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>^C <unfinished ...> Process 20033 detached The below is from a system running a recent update of Debian/Testing. Running strace on process 20033 shows that it's stuck in a system call (see the above strace output). So it appears that one of the child threads of 20033 is using the CPU time. How do I determine which one it is? Does anyone have any general suggestions for debugging the case where mysqld is using a lot of CPU time while not doing anything? The mysql command "show processlist;" shows that there are currently NO connections to mysqld at all. top - 17:23:54 up 22:44, 2 users, load average: 1.22, 2.00, 2.59 Tasks: 191 total, 1 running, 190 sleeping, 0 stopped, 0 zombie %Cpu(s): 21.6 us, 2.6 sy, 0.0 ni, 75.0 id, 0.8 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 2034264 total, 1621676 used, 412588 free, 51228 buffers KiB Swap: 1949692 total, 66244 used, 1883448 free, 879412 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20033 mysql 20 0 344m 37m 1480 S 130.9 1.9 33:31.93 mysqld 27420 root 20 0 23300 1548 1084 R 5.9 0.1 0:00.02 top -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Show replies by date

Alan Harper

1 Jul 1 Jul

8:28 a.m.

On 01/07/2012, at 5:28 PM, Russell Coker wrote:

...

Does anyone have any general suggestions for debugging the case where mysqld is using a lot of CPU time while not doing anything? The mysql command "show processlist;" shows that there are currently NO connections to mysqld at all.

No general suggestions, but this could be the leap second bug. See http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and...

Russell Coker

8:57 a.m.

On Sun, 1 Jul 2012, Alan Harper <alan@aussiegeek.net> wrote:

...

No general suggestions, but this could be the leap second bug. See http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-a nd-the-fix/

Thanks. I used date(1) to set the date and then CPU use suddenly dropped. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679723 There's already a Debian bug, I should have checked the BTS before asking on the list. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Mark Trickett

12:30 p.m.

Hello Russell, On Sun, 2012-07-01 at 18:57 +1000, Russell Coker wrote:

...

On Sun, 1 Jul 2012, Alan Harper <alan@aussiegeek.net> wrote:

...
No general suggestions, but this could be the leap second bug. See http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-a nd-the-fix/

Thanks. I used date(1) to set the date and then CPU use suddenly dropped.

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679723

There's already a Debian bug, I should have checked the BTS before asking on the list.

But then the rest of us would not have had the news. I have been reading things about leap seconds, and now knowing that it can have a real impact is of value, along with that you cleared the matter by using date. Regards, Mark Trickett

Russell Coker

3 Jul 3 Jul

2:11 p.m.

On Sun, 1 Jul 2012, Mark Trickett <marktrickett@bigpond.com> wrote:

...

But then the rest of us would not have had the news. I have been reading things about leap seconds, and now knowing that it can have a real impact is of value, along with that you cleared the matter by using date.

The below message which was sent out by Hetzner.de (a hosting company that provides excellent value for money and a quality service - which incidentally owns the server that runs my blog) should be of interset. 1MW for a couple of bugs which didn't even affect all servers! I have root on 5 Hetzner systems which includes two MySQL instances and for some reason none of them were afflicted by this. This is getting to the level where it becomes an issue that should be of concern to national governments. Hetzner is only one German hosting company and there's also a lot of private computer use that has mostly idle servers (EG pretty much every corporate server I've ever run). It's easy to imagine this bug as having added a few hundred MW of load to the power grid. That sort of sudden load could cause a blackout. If the systems which manage the power grid to prevent cascading failures were also hit by the same bug then it would have been particularly nasty. Am Dienstag, den 03.07.2012, 14:59 +0200 schrieb info@hetzner.de:

...

During the night of 30.06.2012 to 01.07.2012 our internal monitoring systems registered an increase in the level of IT power usage by approximately one megawatt.

The reason for this huge surge is the additional switched leap second which can lead to permanent CPU load on Linux servers.

According to heise.de, various Linux distributions are affected by this. Further information can be found at: http://www.h-online.com/open/news/item/Leap-second-Linux-can- freeze-1629805.html

In order to reduce CPU load to a normal level again, a restart of the whole system is necessary in many cases. First, a soft reboot via the command line should be attempted. Failing that, you have the option of performing a hardware reset via the Robot administration interface. For this, select menu item "Server" and the "Reset" tab for the respective server in the administration interface.

Please do not hesitate to contact us, should you have any queries.

Kind regards,

Hetzner Online AG Stuttgarter Str. 1 91710 Gunzenhausen / Germany info@hetzner.de http://www.hetzner.com

-- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Robin Humble

5 Jul 5 Jul

12:05 p.m.

On Wed, Jul 04, 2012 at 12:11:59AM +1000, Russell Coker wrote:

...

On Sun, 1 Jul 2012, Mark Trickett <marktrickett@bigpond.com> wrote:

...
But then the rest of us would not have had the news. I have been reading things about leap seconds, and now knowing that it can have a real impact is of value, along with that you cleared the matter by using date.

The below message which was sent out by Hetzner.de (a hosting company that provides excellent value for money and a quality service - which incidentally owns the server that runs my blog) should be of interset.

1MW for a couple of bugs which didn't even affect all servers!

another take on this would be that it's a shocking waste that these machines aren't using 1MW all the time - it means that they are basically idle and wasted. cloud is highly inefficient. those of us in HPC expect all machines to be running at ~90% of max power all the time. if they're not, then something is wrong.

...

I have root on 5 Hetzner systems which includes two MySQL instances and for some reason none of them were afflicted by this.

we had 2 nodes out of ~2000 that might have been affected by the leap second. very minor.

...

This is getting to the level where it becomes an issue that should be of concern to national governments.

WTF?! I think you must be trolling... back on planet earth -> AFIACT the bug is now understood if not fixed. inflating this to anything more than "it was a bug in linux" is tabloid sensationalism. indeed that itself is only news if you don't understand that all software over 10 lines has bugs in it.

...

Hetzner is only one German hosting company and there's also a lot of private computer use that has mostly idle servers (EG pretty much every corporate server I've ever run).

that is shocking. machines use ~30-50% of max power when idle. they should either be off or at max power doing useful work. anything else is a total waste. I guess virtualisation doesn't work now any better than it ever has done.

...

It's easy to imagine this bug as having added a few hundred MW of load to the power grid. That sort of sudden load could cause a blackout. If the systems which manage the power grid to prevent cascading failures were also hit by the same bug then it would have been particularly nasty.

servers are tiny proportion of the baseload. think of all the air conditioners and aluminium smelters out there. I believe they are at the 1 to 2% level of total power used. also if power companies can't supply to the sum of their rated substations then that would be negligent of them. they strictly regulate their substations - you can't just plug one in. major blackouts usually occur because of poor maintainance and preparation (eg. the current USA storm blackouts) componded by, storms, ice storms, geomagnetic storms, or by bugs in power company software and protocols, such as those that took out the east coast of the USA a few years ago. cheers robin

Russell Coker

12:22 p.m.

On Thu, 5 Jul 2012, Robin Humble <robin.humble@anu.edu.au> wrote:

...

...
1MW for a couple of bugs which didn't even affect all servers!

another take on this would be that it's a shocking waste that these machines aren't using 1MW all the time - it means that they are basically idle and wasted. cloud is highly inefficient. those of us in HPC expect all machines to be running at ~90% of max power all the time. if they're not, then something is wrong.

An advantage of the Linode model or the EC2 model is that the resources may be used more efficiently. Hetzner just rents servers and you only need a fraction of the resources then you still get an entire server. The Hetzner servers I run are far from fully utilised, but they are still a lot cheaper than any other option for getting the same job done.

...

...
I have root on

5 Hetzner systems which includes two MySQL instances and for some reason none of them were afflicted by this.

we had 2 nodes out of ~2000 that might have been affected by the leap second. very minor.

I just had Chromium and MySQL on my workstion get afflicted. It's strange that it apparently only happened today and didn't appear to happen in the past (I ran top a couple of days ago and saw nothing unusual).

...

...
Hetzner is only one German hosting company and there's also a lot of private computer use that has mostly idle servers (EG pretty much every corporate server I've ever run).

that is shocking. machines use ~30-50% of max power when idle. they should either be off or at max power doing useful work. anything else is a total waste. I guess virtualisation doesn't work now any better than it ever has done.

We need more grid computing tasks like SETI@home. It's a pity that they all seem to have proprietary clients which makes them undesirable to us.

...

...
It's easy to imagine

this bug as having added a few hundred MW of load to the power grid. That sort of sudden load could cause a blackout. If the systems which manage the power grid to prevent cascading failures were also hit by the same bug then it would have been particularly nasty.

servers are tiny proportion of the baseload. think of all the air conditioners and aluminium smelters out there. I believe they are at the 1 to 2% level of total power used.

I believe that aluminium smelters have close arrangements with the power companies, they don't just surprise the power company by turning things on. Air conditioners are quite predictable, you won't suddenly have a few hundred MW of air-conditioning turn on at midnight!

...

also if power companies can't supply to the sum of their rated substations then that would be negligent of them. they strictly regulate their substations - you can't just plug one in.

It's a well known fact that in almost every part of the world the power companies can NEVER supply to the sum of their substations (*). Doing so would be simply uneconomical as a lot of unused generating capacity would need to be build and maintained at significant expense. It's not uncommon for this to be demonstrated in summer. (*) Antarctica may be the only exception.

...

major blackouts usually occur because of poor maintainance and preparation (eg. the current USA storm blackouts) componded by, storms, ice storms, geomagnetic storms, or by bugs in power company software and protocols, such as those that took out the east coast of the USA a few years ago.

A few years ago power to a large portion of the Melbourne CBD was cut when the connection to Tasmania failed. It seems that in hot weather we are on power station failure away from having a massive power cut. Cutting off the CBD is a major issue. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Chris Samuel

8 Jul 8 Jul

9:38 a.m.

On Thursday 05 July 2012 22:22:53 Russell Coker wrote:

...

We need more grid computing tasks like SETI@home. It's a pity that they all seem to have proprietary clients which makes them undesirable to us.

*cough* BOINC *cough* http://boinc.berkeley.edu/ Not all problems are amenable to that though (not many desktop systems have 1TB of RAM needed by some codes). cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Russell Coker

10:28 a.m.

On Sun, 8 Jul 2012, Chris Samuel <chris@csamuel.org> wrote:

...

...
We need more grid computing tasks like SETI@home. It's a pity that they all seem to have proprietary clients which makes them undesirable to us.

cough BOINC cough

http://boinc.berkeley.edu/

Not all problems are amenable to that though (not many desktop systems have 1TB of RAM needed by some codes).

I hadn't realised that they open-sourced that one. It looks like it won't be in Wheezy (it was in Squeeze), why is that? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Chris Samuel

11:33 a.m.

New subject: BOINC (was Re: strange mysqld CPU use)

On Sunday 08 July 2012 20:28:50 Russell Coker wrote:

...

I hadn't realised that they open-sourced that one.

I thought it had always been open.

...

It looks like it won't be in Wheezy (it was in Squeeze), why is that?

I presume because it was removed from testing.. http://packages.qa.debian.org/b/boinc/news/20120514T163914Z.html This appears to be why: http://release.debian.org/migration/testing.pl?package=boinc # boinc is not yet built on powerpc: 7.0.27+dfsg-3 vs 7.0.27+dfsg-5 # (missing 8 binaries) cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Trent W. Buck

9 Jul 9 Jul

1:49 a.m.

New subject: BOINC (was Re: strange mysqld CPU use)

Chris Samuel wrote:

...

I thought [boinc] had always been open.

IIRC, seti@home was originally closed-source, and from that grew boinc which was/is open source. WP doesn't make it clear.

...

http://packages.qa.debian.org/b/boinc/news/20120514T163914Z.html http://release.debian.org/migration/testing.pl?package=boinc

Also wnpp-alert found this: RFH 511243 boinc -- BOINC distributed computing

Chris Samuel

12:33 p.m.

New subject: BOINC (was Re: strange mysqld CPU use)

On Monday 09 July 2012 11:49:46 Trent W. Buck wrote:

...

IIRC, seti@home was originally closed-source, and from that grew boinc which was/is open source.

That's my memory and understanding too, here's the page about BOINC from 2002: http://web.archive.org/web/20021005205228/http://boinc.berkeley.edu/intro.ht... and here's a Nature article about it: http://www.nature.com/news/2005/051212/full/news051212-10.html cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

James Harper

5 Jul 5 Jul

1:17 p.m.

...

On Wed, Jul 04, 2012 at 12:11:59AM +1000, Russell Coker wrote:

...
On Sun, 1 Jul 2012, Mark Trickett <marktrickett@bigpond.com> wrote:

...
But then the rest of us would not have had the news. I have been reading things about leap seconds, and now knowing that it can have a real impact is of value, along with that you cleared the matter by using date.

The below message which was sent out by Hetzner.de (a hosting company that provides excellent value for money and a quality service - which incidentally owns the server that runs my blog) should be of interset.

1MW for a couple of bugs which didn't even affect all servers!

another take on this would be that it's a shocking waste that these machines aren't using 1MW all the time - it means that they are basically idle and wasted. cloud is highly inefficient. those of us in HPC expect all machines to be running at ~90% of max power all the time. if they're not, then something is wrong.

That statement and most of the rest of what you said only makes sense if the hosting company is running a HPC.

...

I guess virtualisation doesn't work now any better than it ever has done.

That's the stupidest thing I've heard today. Virtualisation allows you to take a mostly idle workload off of (say) 100 servers and run it on a single server[1] which is well under 100x the cost and power consumption of the 100 servers. Even if that single server is still mostly idle, it still represents a massive saving of hardware and power, and certainly doesn't equate to virtualisation "not working". James [1] I know you wouldn't use a single server - i'm illustrating the ratio that a data centre might use.

Russell Coker

1:35 p.m.

On Thu, 5 Jul 2012, James Harper <james.harper@bendigoit.com.au> wrote:

...

...
I guess virtualisation doesn't work now any better than it ever has done.

That's the stupidest thing I've heard today. Virtualisation allows you to take a mostly idle workload off of (say) 100 servers and run it on a single server[1] which is well under 100x the cost and power consumption of the 100 servers. Even if that single server is still mostly idle, it still represents a massive saving of hardware and power, and certainly doesn't equate to virtualisation "not working".

One issue of virtualisation is that not everyone does the same things. One of my clients has four Hetzner servers because they need 4*RAID-1 arrays for the size and performance of their data. They could probably get by with CPU power equivalent to one server, but would really need two servers for the RAM they need. Hetzner has just offered a new server with enough storage and RAM to satisfy my client's needs - but it costs more than the four current servers. Presumably there are other Hetzner customers with lots of CPU use but little disk IO. While CPU utilisation accounts for most variation on power use disk IO also counts for something. Amazon has various offerings for different ratios of IO and compute performance so they can try to match things to use all the resources. If you instruct EC2 to create a high CPU instance it could be run on a system that has mostly low-CPU instances. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

4767

Age (days ago)

4775

Last active (days ago)

List overview

Download

13 comments

7 participants

participants (7)

Alan Harper
Chris Samuel
James Harper
Mark Trickett
Robin Humble
Russell Coker
Trent W. Buck