[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Re: 20010622: Solaris 8 problems]



"Bryan G. White" wrote:
> 
> > I have a couple of things I'd like to check out.  I will start the LDM
> > and let it run for a while and watch it.  If you see activity, it's me.
> >
> > Have you changed what data you're requesting, or would the stream have
> > increased for any other reason?
> 
> We made some changes before I upgraded to new OS.  Everything seemed
> to go fine for a week or so.  I did remove a site after the upgraded.

Hi Bryan,

I found several things to report to you.

First, log messages are being written to /var/log/infolog, as per the
local0 entry in /etc/syslog.conf.  The good news is that I could see the
log messages which gave me a clue about the problem.  But, I recommend
that you change your syslog.conf file so that LDM log messages are
written to the log files in ~ldm/logs.  Maybe you already know how to do
this, but just in case, there are instructions for this in
http://www.unidata.ucar.edu/packages/ldm/ldmPreInstallList.html#s8 under
"Configuring the Operating System as root".

Second, you're installation is nonstandard.  The home of user 'ldm' is
/home/ldm, but there's an extra subdirectory 'ldm', yielding a path of
/home/ldm/ldm to the standard files and directories.  But, some of those
directories are duplicated under /home/ldm.  For example, there's
/home/ldm/logs and /home/ldm/ldm/logs.  Similar for the 'data'
directory.  I would straighten that out.  The same web page as above has
instructions about the conventional structure - see:
http://www.unidata.ucar.edu/packages/ldm/ldmPreInstallList.html#s14.

But these aren't the _real_ problem.  In the logs I found the following:

Jul 10 13:20:22 met20.slc.noaa.gov cirp[17711]: comings: pqe_new: Not
enough space
Jul 10 13:20:22 met20.slc.noaa.gov cirp[17711]:        :
57dc6d0af3b225993daf4335e38e7604 13379545 20010710130514.165     EXP
000  ens_010710_00_
Jul 10 13:20:22 met20.slc.noaa.gov cirp[17711]: Connection reset by peer
Jul 10 13:20:22 met20.slc.noaa.gov cirp[17711]: Disconnect
Jul 10 13:20:25 met20.slc.noaa.gov rpc.ldmd[17707]: child 17712
terminated by signal 11
Jul 10 13:20:25 met20.slc.noaa.gov rpc.ldmd[17707]: Killing (SIGINT)
process group
Jul 10 13:20:25 met20.slc.noaa.gov rpc.ldmd[17707]: Interrupt
Jul 10 13:20:25 met20.slc.noaa.gov rpc.ldmd[17707]: Exiting
Jul 10 13:20:25 met20.slc.noaa.gov voyager(feed)[17857]: Interrupt
Jul 10 13:20:25 met20.slc.noaa.gov voyager(feed)[17857]: Exiting
Jul 10 13:20:25 met20.slc.noaa.gov cirp(feed)[17760]: Interrupt
Jul 10 13:20:25 met20.slc.noaa.gov cirp(feed)[17760]: Exiting
Jul 10 13:20:25 met20.slc.noaa.gov cirp[17711]: Interrupt
Jul 10 13:20:25 met20.slc.noaa.gov cirp[17711]: Exiting
Jul 10 13:20:25 met20.slc.noaa.gov pqact[17710]: Interrupt
Jul 10 13:20:25 met20.slc.noaa.gov pqact[17710]: Exiting
Jul 10 13:20:25 met20.slc.noaa.gov pqbinstats[17709]: Interrupt
Jul 10 13:20:25 met20.slc.noaa.gov pqbinstats[17709]: Exiting
Jul 10 13:20:25 met20.slc.noaa.gov pqexpire[17708]: Interrupt
Jul 10 13:20:25 met20.slc.noaa.gov pqexpire[17708]: Exiting
Jul 10 13:20:25 met20.slc.noaa.gov pqexpire[17708]: > Up since:     
20010709161608.400
Jul 10 13:20:25 met20.slc.noaa.gov pqexpire[17708]: > Queue usage
(bytes):10002432
Jul 10 13:20:25 met20.slc.noaa.gov pqexpire[17708]: >         
(nregions):    2436
Jul 10 13:20:25 met20.slc.noaa.gov pqexpire[17708]: > nprods deleted 0

This "not enough space" problem caused everything to die.  From previous
log entries I see that you are getting lots of very small products.  I
think you ran out of "slots" in your queue.  The total number of slots
for products is the size of the queue divided by the average product
size, which is by default assumed to be 4096 bytes, so with your 10Mb
queue you'll be able to handle at most 2441 products.

I restarted your ldm and then started the pqmon program to monitor the
queue.  Here's what it said:   

48met20% pqmon -i5
Jul 11 19:16:17 pqmon: Starting Up (3051)
Jul 11 19:16:17 pqmon: nprods nfree  nempty      nbytes  maxprods
maxfree  minempty    maxext  age
Jul 11 19:16:17 pqmon:   2431     6       4     2839544      2436     
10         4   6632488  81
Jul 11 19:16:22 pqmon:   2431     6       4     2839544      2436     
10         4   6632488  86
Jul 11 19:16:27 pqmon:   2431     6       4     2839544      2436     
10         4   6632488  91
Jul 11 19:16:32 pqmon:   2432     5       4     2845944      2436     
10         4   6632488  96
...

The 'nprods' column tells me that you had almost reached the limit on
the number of products you could store.  (That 2441 number is
theoretical and may actually never be reached due to overhead.)

Your LDM is running now - although I thought it might reach the limit
quickly, it didn't.  Instead space is being recycled and so far
everything's ok.  I suggest that you let it run, and at the same time
run pqmon and have it log to a file.  Then, if and when it crashes, you
may be able to correlate the crash with running out of slots as reported
by pqmon and confirm this diagnosis.  For information about pqmon and
how to have it log to a file see the pqmon man page.  It will create a
very large file - probably you want to clear it out or rotate it
regularly.

You can change the default number of slots using the -S option to
rpc.ldmd.  To do this, I suggest you modify the ldmadmin script.  In
ldmadmin, in the subroutine make_pq find the section that looks like
this:

# build the command line

    $cmd_line = "pqcreate";

    if ($verbose) {
        $cmd_line .= " -v";
    }

    if ($pq_clobber) {
        $cmd_line .= " -c";
    }

    if ($pq_fast) {
        $cmd_line .= " -f";
    }

    $cmd_line .= " -q $pq_path -s $pq_size";

and change that last line to 

    $cmd_line .= " -S <slots> -q $pq_path -s $pq_size";

where <slots> is an appropriate number.  What's an appropriate number? 
In looking at your logs, I see lots of tiny products, many less than 100
bytes.  The best way would be to get the average product size and divide
that into 10000000.  Baring that, you could start with simply, say,
5000, which implies average product size is about half of the default.

There is another subroutine in ldmadmin, make_surf_pq, that also uses
pqcreate.  But, I think you can leave that one alone since it doesn't
look like you're using pqsurf.

The only potential problem with modifying ldmadmin is that whenever you
upgrade you must remember to duplicate this change.  This is something
that people often forget to do.

One last thing.  From the log entries above I see that you were running
pqexpire.  Starting with version 5.1 we recommend against running
pqexpire.  For that reason, I commented that line out of your ldmd.conf
file before starting up the ldm, so you won't see pqexpire running
anymore.

Please let me know if any of this is unclear or if you have any further
questions. 

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************