[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20001027: workshop prep (fwd)




===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================

---------- Forwarded message ----------
Date: Sun, 29 Oct 2000 06:55:02 -0700
From: Russ Rew <address@hidden>
To: Tom Yoksas <address@hidden>
     address@hidden, address@hidden, address@hidden
Subject: Re: 20001027: workshop prep 

Tom,

> uni9 (dual 800 Mhz Pentium III) appears to be ingesting IDD data smoothly,
> but its disk access is still very.

?

> shemp, on the other hand, is is always a couple to three hours behind
> on products getting out of the LDM queue for decoding.  This is
> especially true for the image products from the Unidata-Wisconsin (LDM
> MCIDAS feed type) stream.  The reboot of shemp yesterday (Friday night
> late) did not solve the problems we were seeing.  I have to figure that
> the inability to get data out of the queue into the "hands" of decoders
> is intimately related to shemp's data disk showing continuous,
> unrelenting activity.  Anybody else have a theory?

I just spent some time trying to see if I could discover any symptoms
of problems with shemp, but it looks like the LDM is running fine.
pqmon shows everything as expected with the queue algorithms.  Looking
at shemp's pqbinstats files, the latencies for MCIDAS products all look
small, with average latencies about 10 or 15 seconds and worst-case
latency of 509 seconds.  I couldn't see any obvious problems in the
pnga2area decoder logs either (is that the decoder that seems to be
falling behind?)  The pnga2area decoder usually finished within a few
of seconds of starting up.

The ldmd.log on shemp does show more of these sorts of messages than
usual:  

 Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: pq_del_oldest: 
conflict on 1785472176
 Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: comings: pqe_new: 
Resource temporarily unavailable
 Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]:        : 
68c5b47e4b3a9c49959e63a570bb6127    21422 20001029131013.897    NMC2 174  
/u/ftp/ga
 Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: Connection reset by 
peer
 Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: Disconnect

I don't recall seeing these at all during our earlier testing.  I
think what this means is that a receiver process can't get a lock on
the oldest product in the queue to delete it to make space for an
incoming product, because some other process, probably a sender
process, still has a lock on that region.  If a sender died while it
still had a lock on a product, it would never release it, so this
might be a symptom of that.  But later messages refer to a different
region, so the lock must have gotten released.  This may be a red
herring, but I'll try to look at it more carefully to see exactly why
it is occurring.

--Russ