[wellylug] High load averages but no apparent cause

David Harrison david.harrison at stress-free.co.nz
Sun Mar 28 20:33:19 NZDT 2010


Just an update, the replacement server arrived in Auckland on Friday and all
data and services have been migrated.
The migration went without a hitch because it appears the RAID5 lockup only
occurs during write operations.
i.e. 500gb of data was pulled from the device without a single lock-up.


The migration did reveal something about Ubuntu's default RAID5
configuration, it is very poorly tuned :-)

For anyone using Ubuntu and RAID5 I recommend you checkout the following
couple of pages and set your stripe_cache_size accordingly:

http://randomitblog.blogspot.com/2009/10/ubuntu-raid-tweak.html
http://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/

By changing the stripe_cache_size parameter to 16meg I saw a 10x improvement
in write operations.


David



On Thu, Mar 25, 2010 at 10:50 AM, David Harrison <
david.harrison at stress-free.co.nz> wrote:

> Yes that could be a very real possibility.
>
> A replacement server is being shipped up tomorrow, so by mid-next week it
> should be back here in Wellington where it can be better examined...
>
>
> On Thu, Mar 25, 2010 at 10:42 AM, Daniel Reurich <daniel at centurion.net.nz>wrote:
>
>> Power supply not coping anymore (under spec'd) or mainboard capacitors
>> popped is my guess.
>>
>>
>>
>> On Thu, 2010-03-25 at 07:51 +1300, David Harrison wrote:
>> > No, but now that you say that if the system is unable to write to the
>> > RAID5 which contains the log file would this even happen?
>> >
>> >
>> > e.g. /var is the problematic RAID5 partition and when it locks up it
>> > takes out one or more of the physical disks.
>> >
>> >
>> > An interesting observation is that when the problem occurs it either
>> > locks up both sda & sdb, or sdc by itself.
>> > I am guessing that this is because sda & sdb are on the same channel,
>> > so either the channel itself is going or one of the disks is which is
>> > taking the other with it.
>> >
>> >
>> >
>> > David
>> >
>> >
>> >
>> >
>> > On Thu, Mar 25, 2010 at 12:14 AM, Daniel Reurich
>> > <daniel at centurion.net.nz> wrote:
>> >         Does anything show up in the syslog or dmesg that indicates
>> >         sata i/o
>> >         port resets or anything like that??
>> >
>> >         Daniel Reurich
>> >
>> >
>> >         On Wed, 2010-03-24 at 20:53 +1300, David Harrison wrote:
>> >         > On Wed, Mar 24, 2010 at 6:36 PM, Daniel Pittman
>> >         <daniel at rimspace.net>
>> >         > wrote:
>> >         >         David Harrison <david.harrison at stress-free.co.nz>
>> >         writes:
>> >         >
>> >         >
>> >         >         > I will try the deadline scheduler tonight and see
>> >         if that
>> >         >         makes a
>> >         >         > difference.
>> >         >
>> >         >
>> >         >         You should be able to make the change at run-time,
>> >         through
>> >         >         sysfs, I believe.
>> >         >         It is a property of the hardware devices, IIRC, in
>> >         sysfs.
>> >         >
>> >         >
>> >         >
>> >         >
>> >         > I tried out a few of the schedulers and none of them helped
>> >         the
>> >         > problem.
>> >         > If anything I'd have to say it got worse.
>> >         >
>> >         >
>> >         > As a final test I have switched to the kernel that was
>> >         installed
>> >         > originally by Ubuntu (2.6.24-24-server).
>> >         > The problem still exists and I know for sure it didn't when
>> >         things
>> >         > were first setup.
>> >         > - There's just no way we could have migrated 400gig of data
>> >         onto the
>> >         > RAID if it was this flakey.
>> >         >
>> >         >
>> >         > Whatever it is is hardware related, and it seems to be
>> >         getting worse
>> >         > over time...
>> >         >
>> >         >
>> >         >
>> >         >
>> >         > David
>> >         >
>> >         >
>> >
>> >         > --
>> >         > Wellington Linux Users Group Mailing List:
>> >         wellylug at lists.wellylug.org.nz
>> >         > To Leave:
>> >          http://lists.wellylug.org.nz/mailman/listinfo/wellylug
>> >
>> >
>> >
>> >         --
>> >         Daniel Reurich.
>> >
>> >         Centurion Computer Technology (2005) Ltd
>> >         Mobile 021 797 722
>> >
>> >
>> >
>> >
>> >         --
>> >
>> >
>> >         Wellington Linux Users Group Mailing List:
>> >         wellylug at lists.wellylug.org.nz
>> >         To Leave:
>> >          http://lists.wellylug.org.nz/mailman/listinfo/wellylug
>> >
>> >
>> >
>> > --
>> > Wellington Linux Users Group Mailing List:
>> wellylug at lists.wellylug.org.nz
>> > To Leave:  http://lists.wellylug.org.nz/mailman/listinfo/wellylug
>>
>>
>>
>>
>> --
>> Wellington Linux Users Group Mailing List: wellylug at lists.wellylug.org.nz
>> To Leave:  http://lists.wellylug.org.nz/mailman/listinfo/wellylug
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wellylug.org.nz/pipermail/wellylug/attachments/20100328/1417b9d8/attachment.htm>


More information about the wellylug mailing list