[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20001029: LDM 5.1.2 on solarisx86 not letting products out of queue



Tom,

> >The ldmd.log on shemp does show more of these sorts of messages than
> >usual:  
> >
> Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: pq_del_oldest: 
> conflict on 1785472176
> Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: comings: pqe_new: 
> Resource temporarily unavailable
> Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]:        : 68c5b47e4b3 
> a9c49959e63a570bb6127    21422 20001029131013.897    NMC2 174  /u/ftp/ga
 ...
> >
> >I don't recall seeing these at all during our earlier testing.  I
> >think what this means is that a receiver process can't get a lock on
> >the oldest product in the queue to delete it to make space for an
> incoming product, because some other process, probably a sender
> >process, still has a lock on that region.  If a sender died while it
> >still had a lock on a product, it would never release it, so this
> >might be a symptom of that.  But later messages refer to a different
> >region, so the lock must have gotten released.  This may be a red
> >herring, but I'll try to look at it more carefully to see exactly why
> >it is occurring.
>
> Going with the concept of a slow feedee, I can offer the following:
> I saw the machine navier from Penn State connecting to shemp.  I also
> recall seeing a message that navier was on a network that was having
> problems.  Perhaps the two go together?  If so, the next question is
> why is shemp feeding navier?  The question after that is can I shut it
> off and see if shemp returns to the land of the living?

Thinking about it some more, I'll bet what has the lock on these
oldest products is not a downstream sender but a McIDAS decoder.  That
would be consistent with the other symptoms, but it would mean that
the above message doesn't help much, it's just another indication that
it takes a long time for the McIDAS decoders to get the products to
decode.

Right now the only indication of the region with a lock on it in
pq_del_oldest messages like

  Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: pq_del_oldest: 
conflict on 1785472176

is the region offset, 1785472176, which is not very useful to mere
humans.  I think I can improve this message to provide the product ID,
so we would at least know which product was locked.  If these all
turned out to be McIDAS products, it might help in diagnosing the
problem.  To use the new debugging code, we would have to stop and
restart the LDM on shemp.

I'll see if this is practical by just changing the code in
pq_del_oldest on shemp for now ...

--Russ