[wellylug] ubuntu load average

Fri Jan 16 15:15:53 NZDT 2009

Daniel Pittman wrote:
>> Some machines have irq issues with the linux kernel, this can manifest
>> as a high load average, clock drift etc.
>>     
>
> Really?  Your description sounds very much like a screaming interrupt of
> some sort, which is a very serious issue, but your description here
> doesn't quite match up.
>
> The only citations I can find for high load average on idle or clock
> drift don't suggest interrupt issues, but rather kernel or hardware
> faults in other areas.

Here's what's happening under the hood. Or, at least, an educating
montage which resembles what newer systems do based on what systems
which were simple enough for me to understand did. At some point during
Linux kernel initialisation it reconfigures the 8253 timer chip to count
(at 1,193,180 Hz) from a number less than 65,536 (which yields 18.2
interrupts per second), to something like 11932 (yielding ~100 ticks per
second), 4773 (yielding ~250) or 1193 (yielding ~1000). When the timer
counts to 0, interrupt 8 becomes ready on the 8259A chip. This causes
the CPU to *immediately* drop whatever it's doing, saving its state on
the stack and jumping into the interrupt handler. *Unless* interrupts
are masked. This was originally something that user code could set, but
since the 386 AIUI you need to be in ring 0 to really mask it off. ie if
you're not the kernel, then you can't stop it from running.

Because the interrupt handler is very short, it just runs straight away
and then returns to whatever task was running (unless the scheduler
decided that the running process' time was up). Its principle job is to
increment a counter called "jiffies". It doesn't matter what the load
(number of runnable processes) are. It could be 10,000 and interrupts
would still be delivered.

So anyway the point of this is that Linux keeps time by counting these
interrupts. Accessing the CMOS RTC every time is far too slow
(milliseconds to read, lots of IO, etc). Even with dynticks
(CONFIG_NO_HZ), the timing of these interrupts is still relied upon for
reliable timekeeping (eg, set the timer for now + 200ms, when it fires
we know it's 200ms later). If for some reason, interrupts remain masked
for the whole 1ms, 4ms or 10ms between timer interrupts (depending on
HZ), then the CPU doesn't know that one has been missed and you get a
skew - always slowing. I've seen this happen when the console is set to
a serial port, and something is logging a lot of information from the
kernel. eg, a firewall is logging the packets it's dropping. The reason
is that the kernel considers a kernel thread writing to the console
"important" and does not allow any other task to pre-empt writing to
that serial port; otherwise, some critical Oops message might never be
displayed before the kernel went and killed itself. Another possibility
is some driver which masked interrupts and then started waiting for a
device state to change - something that device driver writers shouldn't do.

But I don't think that this is what is happening here, sounds more like
a stuck process or thread. It'll be in the D state - technically
runnable and therefore adding to the load average. But because it is
never scheduled it doesn't actually use any CPU time.

Sam.