[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20040102: RAID on Linux



>From:  Gerry Creager N5JXS <address@hidden>
>Organization:  Texas A&M University -- AATLT
>Keywords:  200401021656.i02GuEp2026352

Hi Gerry,

Sorry I couldn't get to your note this past weekend, but I was
consumed by activities related to changing of the Unidata-Wisconsin
datastream.

>Tom, this is a follow-up to your inquiry about RAID and product queues.

OK.

>I've started seeing journal corruption and kernel panics on bigbird over 
>the last month, predating the ramp-up with CRAFT and CONDUIT.

Hmm...  Do you think that this has to do with the less than optimum
RAID support in Linux?

>Yesterday 
>I moved the product queue from the RAID partition to a system partition 
>to see if going to a different controller and disk will resolve anything.

I did the following on the system I am putting together:

- abandoned the TX2000 harware RAID card and put in a TX2 UDMA-133
  IDE interface
- upgrade from RH 9 to Fedora Core 1; told the OS to create a software
  RAID using the two Samsung 160 GB HDs
- ran the system with the LDM queue on the software RAID while
  ingesting all IDD data.  The latencies dropped to essentially
  zero
- cranked up McIDAS decoding; the latencies remained at zero
- cranked up GEMPAK decoding; the latencies started to climb,
  so I moved the LDM queue back to the system disk.  From that
  point the latencies have stayed at/near zero

>It's my intent to rebuild bigbird, regardless of the results, using 
>either reiserfs or xfs, probably next week.

I have followed the responses to your inquiry with great interest.
I know Daryl has a lot of experience with Linux, so I would probably
go with XFS.

>However, if the system 
>stays stable as currently configured, I _may_ let it run another week.

Sounds prudent.

>I installed gempak 5.6L today.  I'm seeing a lot of write failures and 
>table failures.  Chiz and I (and I think, you and I) ran thru this once 
>before, but I slept since then...  Could we arrange to get on the phone 
>and hammer thru them again?

I will talk with Chiz about this tomorrow.

>I'm home today (last blissful day of 
>vacation) at 979.695.6878 if you're working.  If not, try the cellphone 
>(below) sometime when you've an interest, and I'll make for a terminal 
>and work on it.

Well, my weekend was not a vacation.  I worked about 12 hours yesterday,
but I did it from home so I got real stressed out with my terrible
internet connection (dial-up, ugh!).

>Hope y'all had a wonderful Holiday season!

You too!

Cheers,

Tom

>From address@hidden Tue Jan  6 08:31:01 2004

Morning!

Tom Yoksas wrote:
>>From:  Gerry Creager N5JXS <address@hidden>
>>Organization:  Texas A&M University -- AATLT
>>Keywords:  200401021656.i02GuEp2026352
> 
> 
> Hi Gerry,
> 
> Sorry I couldn't get to your note this past weekend, but I was
> consumed by activities related to changing of the Unidata-Wisconsin
> datastream.

I saw the announcements

>>Tom, this is a follow-up to your inquiry about RAID and product queues.
> 
> 
> OK.
> 
> 
>>I've started seeing journal corruption and kernel panics on bigbird over 
>>the last month, predating the ramp-up with CRAFT and CONDUIT.
> 
> 
> Hmm...  Do you think that this has to do with the less than optimum
> RAID support in Linux?

Probably, and the fact that Promise and High Point, while releasing 
drivers, are not releasing all their info so other driver writers can 
derive better drivers...

>>Yesterday 
>>I moved the product queue from the RAID partition to a system partition 
>>to see if going to a different controller and disk will resolve anything.
> 
> 
> I did the following on the system I am putting together:
> 
> - abandoned the TX2000 harware RAID card and put in a TX2 UDMA-133
>   IDE interface
> - upgrade from RH 9 to Fedora Core 1; told the OS to create a software
>   RAID using the two Samsung 160 GB HDs
> - ran the system with the LDM queue on the software RAID while
>   ingesting all IDD data.  The latencies dropped to essentially
>   zero
> - cranked up McIDAS decoding; the latencies remained at zero
> - cranked up GEMPAK decoding; the latencies started to climb,
>   so I moved the LDM queue back to the system disk.  From that
>   point the latencies have stayed at/near zero

We are bringing up an Opteron (uP) with a TX2000.  I will test it in 
ingest mode and see what we can do there.  We loaded the 2.6.0 kernel 
and recompiled (or at least it was compiling when I went to the "next" 
meeting, yesterday).  S/W RAID is incorporated therein.

Once we have a week or 2 of datapoints, we'll re-define the RAID to S/W 
(the TX2000 will allow "regular" IDE control, so this'll simply be a 
re-build of the arrays and flushing the RAID bios) and rerun the 
operations.  That should give us info on what we need to do.  At that 
point, between the 2 of us, I suspect we'll be comfortable answering 
these questions to others.

All that said, however, I want to get a 3Ware controller in and work 
with it.  They're supposed to have superior Linux support...

>>It's my intent to rebuild bigbird, regardless of the results, using 
>>either reiserfs or xfs, probably next week.
> 
> 
> I have followed the responses to your inquiry with great interest.
> I know Daryl has a lot of experience with Linux, so I would probably
> go with XFS.

In going back thru the Beowulf list info about RAID, I found the 
following links, which might be useful to both of us in the future.
http://www.linux-ide.org/chipsets.html
http://www.1u-raid5.net

I was looking for a response I got privately from Stonie Cooper, who was 
adamant that reiserfs was the right answer, based on his experience. 
The Opteron system is currently Reiserfs.  I'm going to have bigbird 
reformatted as xfs.  We should be able to draw a few conclusions.

>>However, if the system 
>>stays stable as currently configured, I _may_ let it run another week.
> 
> 
> Sounds prudent.
> 
> 
>>I installed gempak 5.6L today.  I'm seeing a lot of write failures and 
>>table failures.  Chiz and I (and I think, you and I) ran thru this once 
>>before, but I slept since then...  Could we arrange to get on the phone 
>>and hammer thru them again?
> 
> 
> I will talk with Chiz about this tomorrow.

Thanks!

>>I'm home today (last blissful day of 
>>vacation) at 979.695.6878 if you're working.  If not, try the cellphone 
>>(below) sometime when you've an interest, and I'll make for a terminal 
>>and work on it.
> 
> 
> Well, my weekend was not a vacation.  I worked about 12 hours yesterday,
> but I did it from home so I got real stressed out with my terrible
> internet connection (dial-up, ugh!).

Sorry to hear about both.  I finally got DSL at home, and I'm looking at 
a wireless link back to the campus (it's good to be the research network 
engineer in my "spare time"...) that'll give me 25 Mb or so to play with 
unfettered.  It'll be IPv4/IPv6 for some testing, and may incorporate 
some mesh networking as a research sidelight!

>>Hope y'all had a wonderful Holiday season!
> 
> 
> You too!

Thanks again, Gerry
-- 
Gerry Creager -- address@hidden
Texas Mesonet -- AATLT, Texas A&M University    
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843