Periodic ntp loss of synchronisation problems

I am having periodic ntp synchronisation problems. ntp doesn't directly log anything - but I am using nagios3 to track it's synchronisation and I periodically get problems - several times a day. I can't figure out what is causing the loss of synchronisation. Perhaps a burst of ntp packets dropped? But wouldn't ntp log something about this? Any suggestions on logging more info to find the cause or deeper insights into what is going wrong would be appreciated. Andrew Here's the output of commands when things are bad: config(0)# check_ntp_peer -H 127.0.0.1 -w 1.0 -c 2.0 NTP WARNING: Server has the LI_ALARM bit set, Offset 0.210925 secs|offset=0.210925s;1.000000;2.000000; LI_ALARM apparently means not in sync ??? config(1)# ntpq -c rl associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer, version="ntpd 4.2.6p2@1.2194-o Sun Oct 17 13:35:13 UTC 2010 (1)", processor="x86_64", system="Linux/2.6.32-5-amd64", leap=11, stratum=3, precision=-23, rootdelay=95.696, rootdisp=263.117, refid=192.189.54.33, reftime=d2bccf2f.4854b34f Sun, Jan 15 2012 15:06:07.282, clock=d2bcd1d6.e9ffea4c Sun, Jan 15 2012 15:17:26.914, peer=16519, tc=10, mintc=3, offset=0.000, frequency=500.000, sys_jitter=35.804, clk_jitter=0.000, clk_wander=91.828 It thinks it's in error by 16s??? config(0)# ntpdc -c kerninfo pll offset: 0 s pll frequency: 500.000 ppm maximum error: 16 s estimated error: 16 s status: 4041 pll unsync mode=fll pll time constant: 10 precision: 1e-06 s frequency tolerance: 500 ppm ntptime gives the same info config(0)# ntptime ntp_gettime() returns code 5 (ERROR) time d2bcd2be.08d3a000 Sun, Jan 15 2012 15:21:18.034, (.034479), maximum error 16000000 us, estimated error 16000000 us ntp_adjtime() returns code 5 (ERROR) modes 0x0 (), offset 0.000 us, frequency 500.000 ppm, interval 1 s, maximum error 16000000 us, estimated error 16000000 us, status 0x4041 (PLL,UNSYNC,MODE), time constant 10, precision 1.000 us, tolerance 500 ppm, Then mysertiously everything is okay: config(0)# ntpdc -c kerninfo pll offset: 0.00998 s pll frequency: 500.000 ppm maximum error: 1.6291 s estimated error: 0.004771 s status: 0001 pll pll time constant: 10 precision: 1e-06 s frequency tolerance: 500 ppm My leap becomes none (no leap_alarm) and things are ok? config(0)# ntpq -c rl associd=0 status=0618 leap_none, sync_ntp, 1 event, no_sys_peer, version="ntpd 4.2.6p2@1.2194-o Sun Oct 17 13:35:13 UTC 2010 (1)", processor="x86_64", system="Linux/2.6.32-5-amd64", leap=00, stratum=3, precision=-23, rootdelay=95.272, rootdisp=983.007, refid=192.189.54.33, reftime=d2bcd33b.bbc10580 Sun, Jan 15 2012 15:23:23.733, clock=d2bcd852.ec6be1b9 Sun, Jan 15 2012 15:45:06.923, peer=16519, tc=10, mintc=3, offset=13.497, frequency=500.000, sys_jitter=7.251, clk_jitter=4.772, clk_wander=151.809

I assume you would have mentioned it, but are you running on a virtual machine? If the host is tinkering with the VM's wallclock it could confuse ntp. James

no this is a real machine. On Jan 15, 2012 5:02 PM, "James Harper" <james.harper@bendigoit.com.au> wrote:
I assume you would have mentioned it, but are you running on a virtual
machine? If the host is tinkering with the VM's wallclock it could confuse ntp.
No it's a real machine. Andrew

On Sun, 15 Jan 2012, Andrew Worsley wrote:
Any suggestions on logging more info to find the cause or deeper insights into what is going wrong would be appreciated.
More logs with these settings: logfile /var/log/ntpd statsdir /var/log/ntpstats/ statistics loopstats peerstats clockstats filegen loopstats file loopstats type day enable filegen peerstats file peerstats type day enable filegen clockstats file clockstats type day enable (don't forget to logrotate!) Probably won't show anything other than time went off by a small amount.
Here's the output of commands when things are bad:
config(0)# check_ntp_peer -H 127.0.0.1 -w 1.0 -c 2.0 NTP WARNING: Server has the LI_ALARM bit set, Offset 0.210925 secs|offset=0.210925s;1.000000;2.000000;
LI_ALARM apparently means not in sync ???
config(1)# ntpq -c rl associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer, version="ntpd 4.2.6p2@1.2194-o Sun Oct 17 13:35:13 UTC 2010 (1)", processor="x86_64", system="Linux/2.6.32-5-amd64", leap=11, stratum=3, precision=-23, rootdelay=95.696, rootdisp=263.117, refid=192.189.54.33, reftime=d2bccf2f.4854b34f Sun, Jan 15 2012 15:06:07.282, clock=d2bcd1d6.e9ffea4c Sun, Jan 15 2012 15:17:26.914, peer=16519, tc=10, mintc=3, offset=0.000, frequency=500.000, sys_jitter=35.804, clk_jitter=0.000, clk_wander=91.828
It thinks it's in error by 16s???
config(0)# ntpdc -c kerninfo pll offset: 0 s pll frequency: 500.000 ppm maximum error: 16 s estimated error: 16 s status: 4041 pll unsync mode=fll pll time constant: 10 precision: 1e-06 s frequency tolerance: 500 ppm
warrane.connect.com.au (192.189.54.33) got dodgy hardware? (do a check_ntp from your host against another ntp server perhaps, then compare against warrane.connect.com.au). ntptrace show anything? Writing a DVD just before? DMA accidentally turned itself off? -- Tim Connors

On 16 January 2012 12:31, Tim Connors <tconnors@rather.puzzling.org> wrote:
On Sun, 15 Jan 2012, Andrew Worsley wrote:
Any suggestions on logging more info to find the cause or deeper insights into what is going wrong would be appreciated.
More logs with these settings:
logfile /var/log/ntpd statsdir /var/log/ntpstats/ statistics loopstats peerstats clockstats filegen loopstats file loopstats type day enable filegen peerstats file peerstats type day enable filegen clockstats file clockstats type day enable
(don't forget to logrotate!)
Probably won't show anything other than time went off by a small amount.
.... Hmm got some complaints about syntax errors - turns out to be old configuration settings ... see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637941 and this http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616684 and also default flag on restrict appears appears to cause complaints. See if it makes any difference...
warrane.connect.com.au (192.189.54.33) got dodgy hardware? (do a check_ntp from your host against another ntp server perhaps, then compare against warrane.connect.com.au). ntptrace show anything?
All ntp servers looked okay when I checked currently: config(3)# check_ntp -H warrane.connect.com.au NTP OK: Offset 0.4516801834 secs|offset=0.451680s;60.000000;120.000000; config(0)# check_ntp -H yarrina.connect.com.au NTP OK: Offset 0.5105582476 secs|offset=0.510558s;60.000000;120.000000; config(0)# check_ntp -H ns0.eftel.com NTP OK: Offset 0.5307137966 secs|offset=0.530714s;60.000000;120.000000; config(0)# ntptrace localhost: stratum 3, offset 0.122603, synch distance 0.038587 yarrina.connect.com.au: stratum 2, offset -0.000213, synch distance 0.033050 130.95.179.80: timed out, nothing received ***Request timed out
Writing a DVD just before? DMA accidentally turned itself off?
-- Tim Connors
See if the statistics show up anything - need to dig out the documentation to understand it. Thanks Andrew

On 16 January 2012 15:39, Andrew Worsley <amworsley@gmail.com> wrote:
On 16 January 2012 12:31, Tim Connors <tconnors@rather.puzzling.org> wrote:
On Sun, 15 Jan 2012, Andrew Worsley wrote:
.... See if the statistics show up anything - need to dig out the documentation to understand it.
ok - quick note on documentation info sudo apt-get install ntp-doc file:///usr/share/doc/ntp-doc/html/debug.html file:///usr/share/doc/ntp-doc/html/accopt.html#restrict

My problem appears to be solved. It's been nearly 24 hours and ntp is latched very well - 2-3ms offset all day! This is a brief summary of things in the hope of helping any one else who has trouble. I am not sure if my solution was exactly ideal : I noticed that my frequency offset was exactly 500 - which is mentioned as an upper limit for the ntp to fix the clock drift. It just said it didn't perform well with such large offsets not that it wouldn't work at all. I read this page: http://lists.debian.org/debian-user/2003/01/msg05870.html and stopped ntp, removed the drift file and tried to calculate the adjustments to correct the clock drift and fix up it up by using the calculator on this page: http://www.ep.ph.bham.ac.uk/general/support/adjtimex.html I stopped ntp from synchronising with clocks by using the noselect flag on the ntp server statements (as mentioned in it's debugging notes on large drift values) and then run adjtimex. I did that several times first getting a large +ve drifts and then next -ve drifts. Each time I zero-ed the clock by running ntpdate and started again. I think actually adjtimex may take a while to cause an effect so I am not sure if I am waiting long enough or undoing the effect of the previous one. After doing that a few times, I restarted ntp normally, having set the date with ntpdate and it set the frequency to -467 (previously it was +500) and then it has slowly settled down to -453 and the offset is around 2-3ms. Really good. It might have worked even simpler if I just followed the instructions, stopped ntpd, removed the drift file, and ran ntpdate every 10mins ntpdate -s -b ntpserver which will set the time instantly (rather than drift it and log it to syslog). Then use the size of the adjustments to calculate parameters for adjtimex using the above link. Then run ntpd Another few notes for other people: You can enable more logging to syslog via logconfig +allsync +allclock +allpeer Still doesn't log that much and as it is not losing sync I don't know if it would be helpful. /usr/share/doc/ntp-doc/html/monopt.html - using the debian documentation package ntp-doc explains the stats when you enable as mentioned in previous e-mail. Also the above link mentions some interesting issues about the clock source - it found the hpet clock source was 10x better than tsc. e.g. - cat /sys/devices/system/clocksource/clocksource0/available_clocksource echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource I haven't tried that because I am very happy with things currently. Hopefully this is useful to some one else having troubles with ntp which is really quite complex under the hood. Andrew

On 15 January 2012 16:09, Andrew Worsley <amworsley@gmail.com> wrote:
I am having periodic ntp synchronisation problems.
Apologies for digging up such an old thread, and thanks for reporting your eventual success to the list. This caught my interest whilst trawling the archive, and is unrelated to the recent leap second, but I've just got a few questions regarding the root-cause... <...>
Here's the output of commands when things are bad:
config(0)# check_ntp_peer -H 127.0.0.1 -w 1.0 -c 2.0 NTP WARNING: Server has the LI_ALARM bit set, Offset 0.210925 secs|offset=0.210925s;1.000000;2.000000;
LI_ALARM apparently means not in sync ???
This is actually the "leap indicator", also verified by the ntpq output below, (via "leap_alarm" and "leap=11"), designed to notify clients of an impending leap second; consequently most NTP clients will exclude that reference as a potential time source. I'm not even sure why this was set? As the previous leap second was on Dec 31 2008
config(1)# ntpq -c rl associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer, version="ntpd 4.2.6p2@1.2194-o Sun Oct 17 13:35:13 UTC 2010 (1)", processor="x86_64", system="Linux/2.6.32-5-amd64", leap=11, stratum=3, precision=-23, rootdelay=95.696, rootdisp=263.117, refid=192.189.54.33, reftime=d2bccf2f.4854b34f Sun, Jan 15 2012 15:06:07.282, clock=d2bcd1d6.e9ffea4c Sun, Jan 15 2012 15:17:26.914, peer=16519, tc=10, mintc=3, offset=0.000, frequency=500.000, sys_jitter=35.804, clk_jitter=0.000, clk_wander=91.828
As for the frequency error greater than 500PPM (~ 43.2 sec / day), that's incredibly bad quality clock! (It also wanders pretty badly too!)
It thinks it's in error by 16s???
config(0)# ntpdc -c kerninfo
<...>
ntptime gives the same info
config(0)# ntptime ntp_gettime() returns code 5 (ERROR)
<...> No, it's just not synchronised!
Then mysertiously everything is okay:
config(0)# ntpdc -c kerninfo pll offset: 0.00998 s pll frequency: 500.000 ppm maximum error: 1.6291 s estimated error: 0.004771 s status: 0001 pll pll time constant: 10 precision: 1e-06 s frequency tolerance: 500 ppm
My leap becomes none (no leap_alarm) and things are ok?
Things are not OK, seeing as "no_sys_peer" flag is still set, you've still not synchronised... (albeit you seeing an offset at least)
config(0)# ntpq -c rl associd=0 status=0618 leap_none, sync_ntp, 1 event, no_sys_peer, version="ntpd 4.2.6p2@1.2194-o Sun Oct 17 13:35:13 UTC 2010 (1)", processor="x86_64", system="Linux/2.6.32-5-amd64", leap=00, stratum=3, precision=-23, rootdelay=95.272, rootdisp=983.007, refid=192.189.54.33, reftime=d2bcd33b.bbc10580 Sun, Jan 15 2012 15:23:23.733, clock=d2bcd852.ec6be1b9 Sun, Jan 15 2012 15:45:06.923, peer=16519, tc=10, mintc=3, offset=13.497, frequency=500.000, sys_jitter=7.251, clk_jitter=4.772, clk_wander=151.809
Aside from the bogus leap indicator, I'm curious as to what; a) hardware you're using (i.e. cat /proc/cpuinfo), as a modern chipset's TSC should be immune from CPU throttling b) kernel parameters you're passing at boot (if any) in particular any of; clock/clocksource/notsc, noapic/noalpci/acpi c) pool of ntp servers you have configured, and what options in particular, any of; burst, iburst, minpoll, maxpoll? On 18 January 2012 22:24, Andrew Worsley <amworsley@gmail.com> wrote:
My problem appears to be solved. It's been nearly 24 hours and ntp is latched very well - 2-3ms offset all day! <...>
Are you sure you're actually synchronised? Double-check the output of "ntpq -p"
I think actually adjtimex may take a while to cause an effect so I am not sure if I am waiting long enough or undoing the effect of the previous one.
Yup, for instance adjtime(2) on standard Linux will _slew_ 0.5ms per second <...>
It might have worked even simpler if I just followed the instructions, stopped ntpd, removed the drift file, and ran ntpdate every 10mins ntpdate -s -b ntpserver which will set the time instantly
Yup, will use settimeofday(2) to _step_ the clock instead <...>
Also the above link mentions some interesting issues about the clock source - it found the hpet clock source was 10x better than tsc. e.g. - cat /sys/devices/system/clocksource/clocksource0/available_clocksource echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
I haven't tried that because I am very happy with things currently.
<...> I would recommend using the most accurate system clock available to you, and if that fails, perhaps increasing the ntp poll interval... otherwise regulating via adjtimex seems like an unnecessary kludge (pretty sure I've seen this discouraged somewhere too, probably NTP doco) -- Joel Shea <jwshea@gmail.com>
participants (4)
-
Andrew Worsley
-
James Harper
-
Joel W Shea
-
Tim Connors