[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LDM Status Report Question



Hi Allan,

This is curious.  With problems like this it's always useful to look at
the logs on both the sending and receiving machines.  Do you still have
the logs for that time period?  Checking the system logs might also be
useful.

Other possibilities that could keep a host from getting data are:
        incorrect system time 
        a full disk
but these do not explain the failure of ldmping nor the success due to
restarting.  If its not the DNS, it seems like the connection got hung
or the rpc.ldmd got hung.

If this happens again, here are a few things to do:

- Definately look at the ldm logs, and maybe the system logs.

- Try a traceroute to the receiving machine.

- See how many ldm processes are running.  Generally there will be X + Y
+ Z + 1 of them, where X is the number of upstream feeds, Y is the
number of downstream sinks that are actually requesting data, and Z is
for 'notifymes' and ldmpings that are being sent to your machine.

- If you can identify which rpc.ldmd has the problem, you can toggle its
verbosity by sending it a USR2 signal:
        kill -USR2 <rpc.ldmd PID>
This will cause more information to go to the log.  Sending the USR2
signal causes rpc.ldmd to cycle through three levels of verbosity:
quiet, verbose, and debug.

I would be very interested to know what you find out about this.  Please
keep us informed.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************

Russ Rew wrote:
> 
> > Date: Wed, 10 Jan 2001 09:17:49 -0500
> > From: "Allan Darling" <address@hidden>
> > Organization: DOC/NOAA/NWS - National Weather Service
> > To: address@hidden, James Fenix <address@hidden>,
> >    William Brockman <address@hidden>
> > Subject: LDM Status Report Question
> 
> Hi Allan,
> 
> > I've experienced a problem using LDM and was wondering if you might have
> > any insight or comments about how to resolve, or identify the problem.
> > I'm using LDM (v 5.0.8) to distribute files to three receivers within
> > the NWS WAN.  We recently experienced a problem where one of the three
> > systems stopped receiving files while the other two system continued to
> > receive files.  The affected system could ping, but not ldmping, our
> > system.  We could ping and ldmping their system.  This condition
> > persisted even after they restarted their LDM. After we stopped and
> > started LDM at our end, the problem went away.  I ran an ldmadmin check,
> > the results of which are below.
>  ...
> > 'NULLPROC error' message occurred 28 time(s).
> >         Last one at:  Jan 10 06:02:49
> >         For 205.165.7.125 it happened 4 time(s).
> >         For maul.wrh.noaa.gov it happened 24 time(s).
> 
> First, I don't think the "ldmadmin check" output is very helpful in
> this case.  The ldmping sends a NULLPROC remote procedure call, and the
> above merely indicates something went wrong with trying to return a
> result acknowledging the NULLPROC call.
> 
> This sounds like a DNS (domain name service) problem, but there could
> be other causes.  It would help to see the actual ldmping output from
> maul.wrh.noaa.gov to see how far it got up the protocol stack.  That
> is, was the "State" it reported "NAMED" (in which case the DNS lookup
> failed) or was it "SVC_UNAVAIL" (in which case it was contacting port
> 388 but the LDM was running on a different port, possibly due to
> starting it up as some user other than "ldm" or not having run "make
> install_setuids" as root).  Do you remember what the ldmping "State"
> was when it failed?
> 
> > I'd like to know if you have seen this problem before and if so do you
> > have a resolution.  I'm hoping an upgrade to v 5.1.2 will address this.
> 
> I can't recall seeing this specific symptom, but DNS problems are
> fairly common (and there's nothing the LDM can do about them).
> Another possible problem would be the upstream host tgsv not having an
> "ALLOW" entry in its ldmd.conf to allow the downstream node
> maul.wrh.noaa.gov to ldmping it.  Or having such an ALLOW entry, but
> DNS not resolving that name to the same IP number that the ldmping
> request came from.  If someone recently changed the IP number of
> either host and the old number was still cached in the DNS server,
> that would also cause this symptom.
> 
> I'm CC:ing Anne Wilson also, in case she has a better idea about what
> might cause this problem.  In the future, you might want to send
> questions like this to address@hidden instead of me
> specifically, in case I'm away from my email.
> 
> > I'm also very interested in how we might monitor, from our end, the
> > successful transfer of files to the remote systems. Any assistance you
> > can provide would be very much appreciated.
> 
> You can monitor the successful transfer of files with the "notifyme"
> command running on the upstream host asking the downstream host to
> send notifications of each product.  Or you can set up a cron job to
> periodically run notifyme for a little while and send you mail if it
> doesn't produce any output.  The typical invocation is something
> like:
> 
>   notifyme -v -l- -h <downstream_host>
> 
> where sometimes you also add a "-o xxx" argument to look back xxx
> seconds in the downstream host's queue, in case it has falling behind
> the data feed.
> 
> Please let us know if this helps resolve the problem or if you see it
> again what the ldmping output looks like.
> 
> --Russ
> 
> _____________________________________________________________________
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                     http://www.unidata.ucar.edu