[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[IDD #NOY-159685]: LDM on allegan is dying



Hi John,

re:
> Can we set up webcat.jql.usu.edu to bypass allegan.gis.usu.edu and get
> it's feed directly from one of the other servers, at least as a
> secondary?

Yes. This is not a problem.

> Allegan has been having issues lately.

I logged onto allegan to scope things out and found that all of the problems
you have been seeing are related to the inability of the McIDAS decoder
processes being unable to write to the /data/mcidas filesystem. I don't
recall that this was mounted on /webcat1 previously, but I could be
mistaken.

In order to get allegan's LDM working again, I did the following:

- stop the LDM, delete and remake its queue
- comment out invocation of 'xcd_run MONITOR' from ~ldm/etc/ldmd.conf
- comment out invocation of 'pqact' from ~ldm/etc/ldmd.conf
- start the LDM

The reason I commented out the running of McIDAS-XCD processes
('xcd_run MONITOR') and all data processing ('pqact') is that
both invocations will attempt to write to /data/mcidas (i.e., /webcat1),
and writes to that file system are failing.

The current setup relays data to webcat for its use.

> LDM on allegan is dying off after running for several hours.  The most
> recent errors that appeared in  /data/ldm/logs/XCD_START.LOG include:
> 
> Starting DDS at 06082.025103
> ingetext.k: Cannot make positive UC: could not create 384300-byte shared 
> memory segment
> Starting DDS at 06082.025104
> ld.so.1: ingetext.k: fatal: /opt/SUNWspro/lib/libfsumai.so.1: mmap failed: 
> Resource temporarily unavailable
> ld.so.1: ingetext.k: fatal: libfsumai.so.1: open failed: No such file or 
> directory

This indicates that the system has run out of shared memory (McIDAS processes
use shared memory).  Again, the reason for the system running out of shared
memory was the inability to write to /data/mcidas.  What happened is that
the invocation of 'xcd_run DDS' and 'xcd_run HDS' in ~ldm/etc/pqact.conf
would get rerun after a certain amount of time since they had not
invoked a McIDAS-XCD ingest processes (either ingebin.d or ingetext.k)
that could successfully read from STDIN (data sent by pqact) and write
to the needed output spool file (either /data/mcidas/HRS.SPL for
ingebin.k or /data/mcidas/*.XCD for ingetext.k).  When the LDM (pqact)
has backed up data to process it will create a new instance of the
"decoder" that is supposed to do something with the data.  In your case
the "decoders" were ingetext.k and ingebin.k.  The effect as that more
and more instances of ingetext.k were running concurrently, and your
system ran out of shared memory (each ingetext.k needs its own chunk
of shared memory).

In short, the problem lies totally with the inability of 'mcidas' and
'ldm' processes being able to write to /data/mcidas (/webcat1).

I would check to see if some change was recently made that effected
the writability of /data/mcidas.

The one thing I am unsure of (since I didn't continue investigating
past what I wrote above) was whether or not the LDM and McIDAS installations
on webcat are working.

Questions:

- which machine (allegan or webcat) is _supposed_ to decode
  data into McIDAS usable formats?

- is /webcat1 a filesystem used by both allegan and webcat for McIDAS
  decoding.  If yes, then decoding of UNIWISC, NLDN, and FSL2 data
  will not be working at the moment since the file ROUTE.SYS on
  /data/mcidas is empty (due to the inability to write to /webcat1
  from allegan)

Synopsis:

The problems you are currently seeing is totally due to the NFS mounted
file system /webcat1 on allegan.  If you solve the problem of the
inability ot write to this file system from allegan, and if this
file system is _NOT_ being written to by webcat McIDAS-XCD processes,
then you can safely turn on McIDAS processing in allegan's ~ldm/etc/ldmd.conf
file (i.e., uncomment the exec of xcd_run and pqact and then restar the LDM)

Please let me know if/when you have questions about what I found.

Cheers,

Tom
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: NOY-159685
Department: Support IDD
Priority: Normal
Status: Closed