[wellylug] ubuntu load average

Fri Jan 16 16:04:43 NZDT 2009

Sam Vilain <sam at vilain.net> writes:
> Daniel Pittman wrote:
>>> Some machines have irq issues with the linux kernel, this can manifest
>>> as a high load average, clock drift etc.
>>>     
>>
>> Really?  Your description sounds very much like a screaming interrupt of
>> some sort, which is a very serious issue, but your description here
>> doesn't quite match up.
>>
>> The only citations I can find for high load average on idle or clock
>> drift don't suggest interrupt issues, but rather kernel or hardware
>> faults in other areas.
>
> Here's what's happening under the hood. Or, at least, an educating
> montage which resembles what newer systems do based on what systems
> which were simple enough for me to understand did. At some point during
> Linux kernel initialisation it reconfigures the 8253 timer chip to count

[... at HZ, which is somewhere between 100 and 1000 on Linux ...]

> When the timer counts to 0, interrupt 8 becomes ready on the 8259A
> chip. This causes the CPU to *immediately* drop whatever it's doing,
> saving its state on the stack and jumping into the interrupt
> handler. *Unless* interrupts are masked.

*nod*  More or less, yes, and probably close enough for practical
purposes.

[... interrupt increments jiffies, load makes no difference to this ...]

> So anyway the point of this is that Linux keeps time by counting these
> interrupts.

*nod*

> Accessing the CMOS RTC every time is far too slow (milliseconds to
> read, lots of IO, etc). Even with dynticks (CONFIG_NO_HZ), the timing
> of these interrupts is still relied upon for reliable timekeeping (eg,
> set the timer for now + 200ms, when it fires we know it's 200ms
> later).

*nod*

> If for some reason, interrupts remain masked for the whole 1ms, 4ms or
> 10ms between timer interrupts (depending on HZ), then the CPU doesn't
> know that one has been missed and you get a skew - always
> slowing.

Correct.  This happens with poorly written drivers, but the odds that
anyone here has hardware that nasty still in use are slim, in general.

Well, except that theoretically the binary blob video drivers could mess
things up — but there is no particular evidence that they are, at
present.

> I've seen this happen when the console is set to a serial port, and
> something is logging a lot of information from the kernel.

I understood that had been significantly improved these days, but
I could be wrong.  Certainly, many years ago in the days of the dinosaur
and the dial-up modem you got benefits from unmasking IRQs during serial
I/O if your hardware supported it.

[...]

The one other common cause of this sort of nasty situation is SMI mode,
in which the firmware can take control of the CPU and do anything it
wants.

This is often triggered during early processing by the "Legacy USB"
support setting, and floats around audio hardware and sensors, making
life miserable for everyone involved.

> But I don't think that this is what is happening here, sounds more like
> a stuck process or thread. It'll be in the D state - technically
> runnable and therefore adding to the load average. But because it is
> never scheduled it doesn't actually use any CPU time.

Yeah, that is what it sounds like to me.  Notably, the problem described
on KernelTrap where load average shot up due to select not sleeping
*does* report CPU time used, because the application is spinning.

What I wondered about was the assertion that this could be connected to
IRQ activity, and especially "extra" IRQ activity, in any way.

None of what you describe can, really, even though it is about real
problems.

Regards,
        Daniel