[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010822: Unable to maintain connects from wsi (fwd)



>To: <address@hidden>
>From: "Arthur A. Person" <address@hidden>
>Subject: Unable to maintain connects from wsi
>Organization: UCAR/Unidata
>Keywords: 200108221513.f7MFDR125002

Art,

> ... I took your suggestion and remade the ldm queue
> and that fixed the connection problem to wsi.  This appears to me to be a
> bug somewhere... in the ldm queue management?  Or perhaps RedHat?  It
> appears that something caused the queue to become corrupt in some fashion
> during normal operation of the ldm such that the ldm didn't notice much
> and didn't prevent most of its operation.  However, wsi would never
> connect and I'm waiting to see if our NMC2 reception improves as that's
> been flaky as well.  Any ideas on what would cause the queue to corrupt?
> I'm concerned this may happen again. I still have the old queue if someone
> wants to look at it.

The ldm queue management library is filled with assertion checks that
are intended to catch queue corruption or queue data structure
inconsistency at the beginning of every operation on the queue and
often at the conclusion of a queue operation as well.  If the queue
was corrupted somehow, it seems more likely that one of the many
(fatal) assertion violation messages would appear in the log files
just before the ldm process exited, rather than the problem causing a
slow down.  At least that was my experience during testing and
debugging of the pq library.

But id you still have the old queue available, could you possibly do
me a favor by sending me the output of a couple of additional checks
for queue corruption?

Assuming the old product queue is in a file named "old.pq", the first
test is getting the output from pqmon:

 $ pqmon -q old.pq
 Aug 24 17:32:08 pqmon: Starting Up (13705)
 Aug 24 17:32:08 pqmon: nprods nfree  nempty      nbytes  maxprods  maxfree  
minempty    maxext  age
 Aug 24 17:32:08 pqmon:   3314     1   21099    49698472      3314        1     
21099  50309464 20560600
 Aug 24 17:32:08 pqmon: Exiting

As above, expect 4 lines of output, the 2nd and 3rd of which are long.
Please send these; they're explained in the pqmon man page, if you're
interested.

The second thing I'd like to see is pqcat's idea of how many products
are in the queue, and how long it takes to go through all these.  You
get this by running something like

 $ pqcat -q old.pq > /dev/null
 Aug 24 18:43:39 pqcat: Starting Up (13752)
 Aug 24 18:43:40 pqcat: Exiting
 Aug 24 18:43:40 pqcat: Number of products 3314

If either of these programs dies or emits an error message, then the
queue really was corrupt, and the problem requires further
investigation.  We've never seen a corrupted queue with versions
5.1.2, 5.1.3, or 5.1.4 on motherlode, which has been feeding dozens of
sites millions of products for months, so if there is a problem, it
may be Linux-specific ...

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu