[wellylug] High load averages but no apparent cause

Wed Mar 24 11:54:32 NZDT 2010

Just a follow up, this thread describes my problem exactly:
http://centos.org/modules/newbb/viewtopic.php?viewmode=flat&order=ASC&topic_id=22554&forum=37

The short story is that even though smartctl reports no issue there is
probably a hardware issue.

Below is the output from iostat on the server when the disk problem is
taking place.
The utilisation of sda and sdb are 100% even though they are hardly doing
anything.
This lockup remains for 15-20 seconds until something in the kernel/hardware
resets, and then it is happy again.

*Output of iostat while problem is taking place:*

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.38    0.00    0.00   99.62    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     1.75    0.25    0.50     2.00    14.00    21.33
 10.73 4300.00 1333.33 100.00
sdb               0.00     1.75    0.50    0.25     4.00    28.00    42.67
 10.10 3340.00 1333.33 100.00
sdc               1.50     0.00    0.25    0.50    14.00     4.00    24.00
  0.01   10.00  10.00   0.75
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00
  0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00
  0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00
  0.00    0.00   0.00   0.00
md3               0.00     0.00    0.25    2.75     2.00    22.00     8.00
  0.00    0.00   0.00   0.00

*Output of iostat when problem is not taking place:*

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.62    0.00    4.62    6.50    0.00   88.25

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               4.50    14.25    3.75   13.00    66.00   234.00    17.91
  0.08    4.93   4.63   7.75
sdb               6.50    12.25    5.25   12.00    94.00   210.00    17.62
  0.12    7.25   6.09  10.50
sdc               7.25    11.00    4.75   12.25    96.00   202.00    17.53
  0.10    5.88   4.85   8.25
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00
  0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00
  0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    1.50     0.00    12.00     8.00
  0.00    0.00   0.00   0.00
md3               0.00     0.00    1.75   33.00    14.00   264.00     8.00
  0.00    0.00   0.00   0.00

This suggests that either there's a problem with sda and sdb, or there's an
issue with the SATA controller which is leaving both hanging. Annoying
either way...

David

On Wed, Mar 24, 2010 at 8:27 AM, David Harrison <
david.harrison at stress-free.co.nz> wrote:

> Hi,
> I ran the smartctl tests (both short and long) on all three physical drives
> overnight.
> It showed all drives were working 100% correctly.
>
> Overnight I also ran a number of read/write tests and monitored the i/o
> status in vmstat and iostat.
>
> It seems like performance falls through the floor as soon as the physical
> memory on the server is exhausted.
>
> The issue I am experiencing seems to be very similar to the issue which is
> documented here:
> http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html
>
> I've checked the kernel parameters that are mentioned in this article
> (dirty_ratio and dirty_background_ratio) and they are the values that are
> recommended.
>
> Putting more RAM in the machine will certainly forestall the issue, but
> beyond that it maybe a case of trying RAID1 instead of RAID5.
>
>
> David
>
>
>
> On Tue, Mar 23, 2010 at 9:42 AM, David Harrison <
> david.harrison at stress-free.co.nz> wrote:
>
>> Cheers Daniel.
>> Will do tonight out of hours.
>>
>> I hope it isn't one of the drives, they are brand new and the servers in
>> Auckland whilst I am here in Wellington...
>>
>>
>> David
>>
>>
>> On Tue, Mar 23, 2010 at 9:20 AM, Daniel Reurich <daniel at centurion.net.nz>wrote:
>>
>>> On Tue, 2010-03-23 at 08:50 +1300, David Harrison wrote:
>>> > Thanks Daniel that switch for vmstat is very handy and I'd completely
>>> > missed the io wait value in top.
>>> > Googling based on your comments also brought up this page which is
>>> > very handy:
>>> > http://strugglers.net/wiki/Linux_performance_tuning
>>> >
>>> >
>>> > The problem is certainly is looking like an intermittent I/O issue.
>>> >
>>> Probably caused by a disk issue - seriously.  I have seen this before.
>>> >
>>> > Has anyone experience with the performance boost of a dedicated PCI-X
>>> > SATA controller for software RAID?
>>>
>>> >
>>> > The server in question is a bog-standard HP ML110.
>>> > It isn't up to their needs, but it was a recent purchase by the
>>> > previous IT guys, so I'm afraid it is staying.
>>> >
>>> I don't think it should be an issue.
>>> >
>>> > For practical reasons I want to keep the software RAID-5 (3x1TB
>>> > drives), but would putting some (or all) of these disks onto a
>>> > dedicated controller alleviate the I/O issue?
>>> >
>>> I'd say it would provide none to minimal gain given your load stats if
>>> you are using software raid, and a small gain if you get hardware raid.
>>> >
>>> > i.e. Is it worth recommending a $400 PCI-X SATA controller for the
>>> > box, or is that money better left on the table for a new server
>>> > (ML310/330) in twelve months time?
>>> > (My concern being that the card goes in and the problem remains the
>>> > same.)
>>>
>>> Save the money unless you can be sure it's the built in controller (I
>>> doubt it is).
>>>
>>> I think your next port of call is to d a SMART check on your drives.
>>> Install smartmontools and do a smartctl -son /dev/sdX for each drive,
>>> and report that back if you don't understand the results.
>>>
>>> Regards,
>>>        Daniel.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Wellington Linux Users Group Mailing List:
>>> wellylug at lists.wellylug.org.nz
>>> To Leave:  http://lists.wellylug.org.nz/mailman/listinfo/wellylug
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wellylug.org.nz/pipermail/wellylug/attachments/20100324/fdc637ca/attachment.html>