[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #TIB-337303]: New IDD relay - ldm is hanging after some time



Hi Pete,

re:
> We just got a new machine to replace idd.aos.wisc.edu. The machine has 2
> 8-core opteron processors, 32 Gb of RAM, and two 300 Gb SAS disks. I
> have it running Scientific Linux (redhat recompile) version 6.

Do you have a reading of which version of Fedora your version of Scientific
Linux would correspond to?  (this question might make more sense as you
read further)

re:
> The ldm queue is on a software raid 0 spanning both disks.
> 
> The ldm built fine, and starts fine, but after ingesting data for some
> time (usually 45 min to an hour) one ldmd process will peg to 100%, and
> data stops being ingested and relayed.  I can use ldmadmin stop to stop
> the ldm, and it seems to stop and restart properly, but I don't see
> anything in the logs that indicate what went wrong.
> 
> Any ideas?

Quite some time ago, I experimented with hardware and software RAIDs on
Fedora Core 1 and 3 systems that I was putting together in my office.
I found that the LDM performance when the queue was on the RAID was
very poor... so poor, in fact, that the latencies for all feeds would
slowly climb and eventually be at 3600 seconds.  This behavior occurred
in more or less the same way regardless of the RAID configuration (i.e.,
regardless of it being hardware or software RAID).  I found that the
LDM performance would improve dramatically (understatement) if the
LDM queue was moved to a non-RAID file system.

The question that comes to my mind is what would happen if you moved your
LDM queue to a non-RAID file system?

Here are the tests I would try:

- first, try reducing the size of the queue to something like 2 GB

  Does the severe reduction in size eliminate the problem?

- if the problem persists, I would try putting the queue on a non-RAID
  file system

  Does the problem disappear?

I must add, however, that we are running our idd.unidata.ucar.edu cluster
nodes with large (e.g., 12 or 20 GB) LDM queues on RAID-located file systems.
The difference might be that we are running reasonably current versions
(Fedora 12) of Linux on those nodes (hence the question about what Fedora
release your Scientific Linux distribution corresponds to).

re:
> I can get you an account on the machine if you want to poke
> around on it.

OK, thanks.  Just so you know, the problem could potentially come down to
redoing the file system(s), and this will not be doable remotely.

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: TIB-337303
Department: Support LDM
Priority: Normal
Status: Closed