[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LDM blues



Chris Herbster wrote:
> 
> Anne, thanks for the efforts thus far.  I have some answers, while we'll have 
> to defer to our network folks for some of the other items....
> 
<snip>
> 
> Okay, I'm not sure about the network issues with "notifyme" (i.e. what ports 
> are used) .... but I can tell you that traceroute won't work.  Nor will ping.
> 
> Our network administrators have trapped all traceroute and ping packets at 
> the firewall.  When I asked for some traceroute stats quite a few months ago, 
> I was told that this was not possible.  Apparently no computer has been left 
> on the outside of the firewall to allow for this type of network test.  I've 
> included them in this email, so perhaps they can address this again.  (Ernst?)


In order to get timely data, we need to be able to sample the network. 
There are choices as to who can feed you, but in order to pick the best
feed source we need some data.  As you know, geographical proximity
isn't always the fastest or best choice.  If other sites can't
traceroute to you, then we can only blindly guess which site to pick.

And, it looks like you can't traceroute out either:

[ldm@thermal ~]$ /usr/sbin/traceroute imogene.unidata.ucar.edu
traceroute to imogene.unidata.ucar.edu (128.117.140.28), 30 hops max, 38
byte packets
 1  node129-254.unnamed.db.erau.edu (155.31.129.254)  0.582 ms  0.507
ms  0.412 ms
 2  * * *

Although doing a traceroute from your site to another doesn't
necessarily reflect the path that would occur in reverse, it is probably
somewhat reflective of the state of the connection between the two.  It
would be helpful if you could traceroute to a variety of sites to sample
the connectivity.


One other point to mention.  I just checked your logs again.  In
addition to periods of 'skipped' messages, there are there lots of lost
connections and reconnections to FSU.  To see these, do 'grep FEEDME
ldmd.log*' in the logs dir.  If you do this, you'll see that connections
to FSL and Albany are infrequent, unlike the number to FSU.  This also
points to the connectivity problems to FSU in particular.  It does,
however, look like you now have a good connection that's been up for
over a day and a half.


> 
> The connectivity has been worse recently, but it always is bad in fall when 
> compared to summer ....
> 
> 

Yeah, the student blues.  But, listening to music over the computer is
important!! :(


> We also need to establish our failover site.  Jeff Weber had asked me to give 
> him some network stats to make the decision.  Since I was not able to provide 
> this info to Jeff we never went further to configure a failover site.
> 

Again, picking a good failover site requires some data about
connectivity.  Otherwise we'll just pick out of the blue. 


> 
> > There is one other thing you can do.  In your '~ldm/bin/ldmadmin'
> > script, you can change the default setting so that your LDM won't reject
> > products that are just over an hour late.  For example, you might want
> > to accept products that are two hours old.  I can make that change for
> > you if you want - I'll mark it in the file so you can see what I did.
> 
> Let's try to make our system a bit more tolerant of older data.  Perhaps 
> allowing for a lag of 75 or 90 minutes?  What ever you think would be best.
> 

I made the change to allow data to be 90 minutes old.  If you want to
see it, look in ~ldm/bin/ldmadmin and search on "Anne".  The change is
actually two lines below the comment that includes "Anne".  I added two
flags to the call to rpc.ldmd. I stopped and restarted the ldm.  He are
the first lines in the log:

[ldm@thermal ~/logs]$ ldmadmin tail
Oct 05 16:24:37 thermal striker[12796]: run_requester:
20011005145437.192
                                                               ^^^^^^
 TS_ENDT {{NLDN,  ".*"}} 
Oct 05 16:24:37 thermal pqact[12794]: Starting Up 
Oct 05 16:24:37 thermal ldm[12797]: run_requester: Starting Up:
ldm.fsl.noaa.gov 
Oct 05 16:24:37 thermal ldm[12797]: run_requester: 20011005145437.193
TS_ENDT {{PCWS,  "^FSL\.NetCDF\.ACARS\.QC\..*"}} 
Oct 05 16:24:37 thermal pqbinstats[12793]: Starting Up (12792) 
Oct 05 16:24:37 thermal pluto[12795]: FEEDME(pluto.met.fsu.edu): OK 
Oct 05 16:24:38 thermal ldm[12797]: FEEDME(ldm.fsl.noaa.gov): OK 
Oct 05 16:24:39 thermal localhost[12804]: Connection from
localhost.localdomain 
Oct 05 16:24:39 thermal localhost[12804]: Connection reset by peer 
Oct 05 16:24:39 thermal localhost[12804]: Exiting 
Oct 05 16:24:47 thermal striker[12796]:
FEEDME(striker.atmos.albany.edu): OK 


The time 1005145437 is 90 minutes behind the time I restarted the ldm. 
We'll see if this keeps you from losing products.


> On a related note, if I do:
> [herbster@thermal ~]$ ldmadmin queuecheck
> [herbster@thermal ~]$ echo $status
> 1
> 
> I get a non-zero status from ldmadmin.  Does this mean the queue is corrupted 
> or not?  The man page suggests that status info should be printed to STDOUT, 
> but that isn't my experience with this option.
> 

Yes, I see what you mean.  In looking at the ldmadmin code, there is no
output - the man page is wrong. (I'll put that on my list).  But, all
ldmadmin does is call pqcat.  You can do this yourself.  Do: 'pqcat -l -
> /dev/null'.  pqcat extracts products from the queue.  Here, we are telling it 
> to dump the products to /dev/null - we don't really want the products 
> themselves, we just want info about the products.  On your machine I get this:

[ldm@thermal ~/data]$ pqcat -l - > /dev/null
Oct 05 16:42:54 pqcat: Starting Up (12919)
Oct 05 16:42:59 pqcat: Exiting
Oct 05 16:42:59 pqcat: Number of products 24422

If you really want to check that each product in the queue is ok, you
can use the -c option to pqcat, which recomputes the checksum for each
product.

> ** Update since I first wrote this last section.  I have moved the product 
> queue back to the default location.  (We had put it on the /var filesystem 
> due to disk space issues before.)  I think that the other programs/functions 
> weren't looking in the right place for the queue and that was giving 
> misleading information.
> 

Yes, when you use the ldmadmin script instead of calling the commands
directly you can't give arguments if default values aren't applicable. 
But, you can always look to see what ldmadmin is doing and simply run
the command yourself from the command line, like I described with pqcat,
above.

This caused me to check your queue.  I see that you are running with a
very small queue, 100Mb:

-rw-rw-r--    1 ldm      data     101662720 Oct  5 12:34 ldm.pq

The extra size is for housekeeping purposes - the queue holds 100Mb of
data.  

This is the default queue size that is currently set in the distribution
version of ldmadmin.  I'm going to change this in future distributions -
it's too small for today's needs.  But, we do expect that sites will
change this value to suit their own needs.

In your case, although this is a small queue, the size seems to be fine
for the amount of data you're currently getting.  I determined this from
using pqmon:

[ldm@thermal ~/data]$ pqmon
Oct 05 16:34:34 pqmon: Starting Up (12885)
Oct 05 16:34:34 pqmon: nprods nfree  nempty      nbytes  maxprods 
maxfree  minempty    maxext  age
Oct 05 16:34:34 pqmon:  24404     6       4    43413888     24409      
38         4  37762992 15808
Oct 05 16:34:34 pqmon: Exiting

In the last column you see the age of the oldest product in your queue. 
Your queue is currently holding 4.5 hours worth of data.  When you go
back to requesting more data, I'm guessing you'll want to increase your
queue size.  HDS data is much larger that IDS|DDPLUS. To do this, change
the value of pq_size in ldmadmin.  It's relatively near the top, in the
configuration section of the script.  You have to remember to make this
change every time you upgrade.



> 
> Let's try to adjust our system to allow for slightly older data.  Meanwhile, 
> we'll have to wait for a reply from our network folks to see if we can 
> determine where the delays occur.  For the time being, I have reduced our 
> data request to just the DDPLUS data stream.  Perhaps that will reduce the 
> network load to allow our system to catch up to current data.
> 
> Anne, thanks again for the help!!!
> 
> Chris H.
> 

You're very welcome!  Btw, you might want to consider attending our LDM
workshop next year.  I'm biased, (I teach it) but I feel it covers stuff
pretty well.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************