[wellylug] ubuntu load average

Fri Jan 16 15:44:08 NZDT 2009

Maybe this is the problem?

http://kerneltrap.org/Linux/High_Idle_Load_Average

-Pete

>>> Some machines have irq issues with the linux kernel, this can manifest
>>> as a high load average, clock drift etc.
>>>     
>>>       
>> Really?  Your description sounds very much like a screaming interrupt of
>> some sort, which is a very serious issue, but your description here
>> doesn't quite match up.
>>
>> The only citations I can find for high load average on idle or clock
>> drift don't suggest interrupt issues, but rather kernel or hardware
>> faults in other areas.
>>     
>
> Here's what's happening under the hood. Or, at least, an educating
> montage which resembles what newer systems do based on what systems
> which were simple enough for me to understand did. At some point during
> Linux kernel initialisation it reconfigures the 8253 timer chip to count
> (at 1,193,180 Hz) from a number less than 65,536 (which yields 18.2
> interrupts per second), to something like 11932 (yielding ~100 ticks per
> second), 4773 (yielding ~250) or 1193 (yielding ~1000). When the timer
> counts to 0, interrupt 8 becomes ready on the 8259A chip. This causes
> the CPU to *immediately* drop whatever it's doing, saving its state on
> the stack and jumping into the interrupt handler. *Unless* interrupts
> are masked. This was originally something that user code could set, but
> since the 386 AIUI you need to be in ring 0 to really mask it off. ie if
> you're not the kernel, then you can't stop it from running.
>
> Because the interrupt handler is very short, it just runs straight away
> and then returns to whatever task was running (unless the scheduler
> decided that the running process' time was up). Its principle job is to
> increment a counter called "jiffies". It doesn't matter what the load
> (number of runnable processes) are. It could be 10,000 and interrupts
> would still be delivered.
>
> So anyway the point of this is that Linux keeps time by counting these
> interrupts. Accessing the CMOS RTC every time is far too slow
> (milliseconds to read, lots of IO, etc). Even with dynticks
> (CONFIG_NO_HZ), the timing of these interrupts is still relied upon for
> reliable timekeeping (eg, set the timer for now + 200ms, when it fires
> we know it's 200ms later). If for some reason, interrupts remain masked
> for the whole 1ms, 4ms or 10ms between timer interrupts (depending on
> HZ), then the CPU doesn't know that one has been missed and you get a
> skew - always slowing. I've seen this happen when the console is set to
> a serial port, and something is logging a lot of information from the
> kernel. eg, a firewall is logging the packets it's dropping. The reason
> is that the kernel considers a kernel thread writing to the console
> "important" and does not allow any other task to pre-empt writing to
> that serial port; otherwise, some critical Oops message might never be
> displayed before the kernel went and killed itself. Another possibility
> is some driver which masked interrupts and then started waiting for a
> device state to change - something that device driver writers shouldn't do.
>
> But I don't think that this is what is happening here, sounds more like
> a stuck process or thread. It'll be in the D state - technically
> runnable and therefore adding to the load average. But because it is
> never scheduled it doesn't actually use any CPU time.
>
> Sam.
>
>
>