Re: Periodic ntp loss of synchronisation problems

3 Jul 2012

      On 15 January 2012 16:09, Andrew Worsley <amworsley@gmail.com> wrote:
...
I am having periodic  ntp synchronisation problems.
Apologies for digging up such an old thread, and thanks for reporting
your eventual success to the list.

This caught my interest whilst trawling the archive, and is unrelated
to the recent leap second, but I've just got a few questions regarding
the root-cause...

<...>
...
Here's the output of commands when things are bad:
config(0)# check_ntp_peer  -H 127.0.0.1  -w 1.0 -c 2.0
NTP WARNING: Server has the LI_ALARM bit set, Offset 0.210925
secs|offset=0.210925s;1.000000;2.000000;
LI_ALARM apparently means not in sync ???
This is actually the "leap indicator", also verified by the ntpq
output below, (via "leap_alarm" and "leap=11"), designed to notify
clients of an impending leap second; consequently most NTP clients
will exclude that reference as a potential time source.

I'm not even sure why this was set? As the previous leap second was on
Dec 31 2008
...
config(1)# ntpq -c rl
associd=0 status=c618 leap_alarm, sync_ntp, 1 event, no_sys_peer,
version="ntpd 4.2.6p2@1.2194-o Sun Oct 17 13:35:13 UTC 2010 (1)",
processor="x86_64", system="Linux/2.6.32-5-amd64", leap=11, stratum=3,
precision=-23, rootdelay=95.696, rootdisp=263.117, refid=192.189.54.33,
reftime=d2bccf2f.4854b34f  Sun, Jan 15 2012 15:06:07.282,
clock=d2bcd1d6.e9ffea4c  Sun, Jan 15 2012 15:17:26.914, peer=16519,
tc=10, mintc=3, offset=0.000, frequency=500.000, sys_jitter=35.804,
clk_jitter=0.000, clk_wander=91.828
As for the frequency error greater than 500PPM (~ 43.2 sec / day),
that's incredibly bad quality clock! (It also wanders pretty badly
too!)
...
It thinks it's in error by 16s???
config(0)# ntpdc -c kerninfo
<...>
...
ntptime gives the same info
config(0)# ntptime
ntp_gettime() returns code 5 (ERROR)
<...>

No, it's just not synchronised!
...
Then mysertiously everything is okay:
config(0)# ntpdc -c kerninfo
pll offset:           0.00998 s
pll frequency:        500.000 ppm
maximum error:        1.6291 s
estimated error:      0.004771 s
status:               0001  pll
pll time constant:    10
precision:            1e-06 s
frequency tolerance:  500 ppm
My leap becomes none (no leap_alarm) and things are ok?
Things are not OK, seeing as "no_sys_peer" flag is still set, you've
still not synchronised... (albeit you seeing an offset at least)
...
config(0)# ntpq -c rl
associd=0 status=0618 leap_none, sync_ntp, 1 event, no_sys_peer,
version="ntpd 4.2.6p2@1.2194-o Sun Oct 17 13:35:13 UTC 2010 (1)",
processor="x86_64", system="Linux/2.6.32-5-amd64", leap=00, stratum=3,
precision=-23, rootdelay=95.272, rootdisp=983.007, refid=192.189.54.33,
reftime=d2bcd33b.bbc10580  Sun, Jan 15 2012 15:23:23.733,
clock=d2bcd852.ec6be1b9  Sun, Jan 15 2012 15:45:06.923, peer=16519,
tc=10, mintc=3, offset=13.497, frequency=500.000, sys_jitter=7.251,
clk_jitter=4.772, clk_wander=151.809
Aside from the bogus leap indicator, I'm curious as to what;
a) hardware you're using (i.e. cat /proc/cpuinfo),
    as a modern chipset's TSC should be immune from CPU throttling

b) kernel parameters you're passing at boot (if any)
    in particular any of; clock/clocksource/notsc, noapic/noalpci/acpi

c) pool of ntp servers you have configured, and what options
    in particular, any of; burst, iburst, minpoll, maxpoll?

On 18 January 2012 22:24, Andrew Worsley <amworsley@gmail.com> wrote:
...
My problem appears to be solved. It's been nearly 24 hours and ntp is
latched very well - 2-3ms offset all day!
<...>
Are you sure you're actually synchronised? Double-check the output of "ntpq -p"
...
I think actually adjtimex may take a while to cause an
effect so I am not sure if I am waiting long enough or undoing the
effect of the previous one.
Yup, for instance adjtime(2) on standard Linux will _slew_ 0.5ms per second

<...>
...
It might have worked even simpler if I just followed the instructions,
stopped ntpd, removed the drift file, and ran ntpdate every 10mins
 ntpdate -s -b ntpserver which will set the time instantly
Yup, will use settimeofday(2) to _step_ the clock instead

<...>
...
Also the above link mentions some interesting issues about the clock
source - it found the hpet clock source was 10x better than tsc.
e.g. -
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
I haven't tried that because I am very happy with things currently.
<...>

I would recommend using the most accurate system clock available to
you, and if that fails, perhaps increasing the ntp poll interval...
otherwise regulating via adjtimex seems like an unnecessary kludge
(pretty sure I've seen this discouraged somewhere too, probably NTP
doco)

-- 
Joel Shea <jwshea@gmail.com>

Re: Periodic ntp loss of synchronisation problems

Joel W Shea