[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030318: ldm on waldo at stc; McIDAS-XCD scouring moved to 'ldm' account



>From: "Anderson, Alan C. " <address@hidden>
>Organization: St. Cloud State
>Keywords: 200303181640.h2IGeXB2004042 LDM-6 McIDAS-XCD scour

Alan,

>Noticed that our ldm has stopped getting data from papagayo
>as of about 10Z on 17 Mar.  My log files seemed ok up to that 
>time,  then data log stopped.  I have checked with Clint, see
>his response below.
>Any suggestions.

OK.  The messages in Clint's log file confirm/demonstrate the inability
of his LDM to send you data.

>Have stopped and restarted my ldm this morning, but it is still 
>not ingesting.

I logged on and was able to run notifyme to papagayo to verify that
nothing has changed on Clint's side (allows, etc.):

<as 'ldm'>
notifyme -vxl- -f ANY -o 3600 -h papagayo.unl.edu

Data lists came back immediately proving that Clint's machine is
correctly setup to allow feeds from waldo.

I then ran top and noticed that the load average on waldo was 44.
Since this is extremely unusual, I decided to shutdown the LDM
and run some checks on the queue.

/usr/local/ldm% pqcat -s -q data/ldm.pq -l-
Mar 18 18:28:36 pqcat: Starting Up (9152)
Mar 18 18:28:36 pqcat: assertion "IsAlloc(rep)" failed: file "pq.c", line 1907
Abort (core dumped)

This looked as though the queue was corrupted, so I decided to try and
delete and remake it:

/usr/local/ldm% ldmadmin delqueue
/usr/local/ldm/data/ldm.pq: No such file or directory

After verifying that there was still a link between /var/data/ldm and
/usr/local/ldm/data, I looked for a queue:

/usr/local/ldm% cd data
/usr/local/ldm/data% ls -alt
total 22
drwxr-xr-x   5 ldm      data         512 Mar 18 18:28 ./
drwxr-xr-x   2 ldm      data        6656 Mar 18 16:08 logs/
drwxrwxr-x   4 ldm      data         512 Nov  6 21:01 gempak/
drwxrwxr-x   3 ldm      data         512 Sep 25 01:00 surface/
drwxrwxr-x   4 ldm      data         512 Nov 24  1999 ../

So, your problem was that your LDM queue somehow got deleted!

I remade the queue and then restarted your LDM:

/usr/local/ldm% ldmadmin mkqueue
/usr/local/ldm% ldmadmin start

Data is once again flowing into waldo.  Now, the question is how the
LDM queue got deleted!?

While I was on waldo, I decided to move the scouring of McIDAS-XCD
produced data files to the 'ldm' account:

<as 'ldm'>
cd util                   <- ~ldm/util is in the PATH for 'ldm'
cp ~mcidas/workdata/mcscour.sh .

<I looked at the contents of mcscour.sh to make sure that all the
environment variables are set correctly, and they are>

I changed the mcscour.sh logging from /home/mcidas/workdata/scour.log
to ~ldm/logs/mcscour.log.  This puts almost all of your LDM related
log files into ~ldm/logs.  The only one that I didn't move/change
was /home/mcidas/workdata/ROUTEPP.LOG.  This can easily be moved
by editing the MCLOG setting in ~ldm/decoders/batch.k.

Next, I moved McIDAS ADDE server logging from ~mcidas/workdata to
~ldm/logs.  This required that I:

o setup a McIDAS REDIRECTion for SERVER.* in the 'mcidas' account
o change the permissions on /var/data/ldm/logs so that it was
  group writable (mcidas and mcadde are in the same group as ldm)
o move ~mcidas/workdata/SERVER.LOG to ~ldm/logs and change its
  permission to be writable by mcadde
o add a cron entry to 'ldm's crontab to rotate the SERVER.LOG* files

Then, since the dostats action is commented out in 'ldm's crontab
file, I edited ~ldm/etc/ldmd.conf to stop pqbinstats from running.
This prevents the .stats files from being created in ~ldm/logs.
This is necessary since the bin/ldmadmin dostats action normally
run from cron is what scours the .stats files.

The last thing I did was run ~ldm/util/mcscour.sh "by hand" as 'ldm'
to make sure that it worked.  It apparently does since the March 16
.XCD file in /var/data/mcidas and its associated .IDX files were
scoured off.  This leaves that file system with about 3.5 GB of
space:

% df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/proc                      0       0       0     0%    /proc
/dev/dsk/c0d0s0      7396768 3681199 3641602    51%    /
fd                         0       0       0     0%    /dev/fd
swap                  802576     312  802264     1%    /tmp


Recap:

- the LDM was not receiving data since something had deleted the LDM
  queue even though the LDM was till running.  I remade the queue
  and restarted the LDM.  Data is being received and processed
  normally once again

- I moved the XCD scouring to an 'ldm' cron job and move the log
  file to ~ldm/logs/mcscour.log

- I move the McIDAS ADDE remote server logging to ~ldm/logs and setup
  a cron entry to rotate the log files

- I stopped pqbinstats from being run at LDM startup

We need to keep an eye on the McIDAS-XCD scouring done by mcscour.sh
to make sure that it continues to work.  

Please let me know if you see anything amiss on waldo.

Tom

>-----Original Message-----
>From: Clint Rowe [mailto:address@hidden]
>Sent: Tuesday, March 18, 2003 10:33 AM
>To: Anderson, Alan C. 
>Subject: Re: ldm at papagayo
>
>
>Alan,
>I seem to have all the data and papagayo's been chugging along without any
>problems.  There are some errors regarding waldo in yesterday's log file:
>
>Mar 17 10:10:08 papagayo waldo(feed)[4767]: up6.c:168: HEREIS: RPC: Unable to 
>send; errno = Broken pipe
>Mar 17 10:10:08 papagayo waldo(feed)[4767]: up6.c:369: Product send failure: I
> /O 
>error
>Mar 17 10:10:16 papagayo rpc.ldmd[28230]: child 4767 exited with status 6
>
>...
>
>Mar 17 10:21:58 papagayo waldo(feed)[28849]: up6.c:168: HEREIS: RPC: Unable to
>  
>send; errno = Broken pipe
>Mar 17 10:21:58 papagayo waldo(feed)[28849]: up6.c:369: Product send failure: 
>I/O error
>Mar 17 10:22:06 papagayo rpc.ldmd[28230]: child 28849 exited with status 6
>
>...
>
>Mar 17 10:35:22 papagayo waldo(feed)[28847]: up6.c:168: HEREIS: RPC: Unable to
>  
>send; errno = Broken pipe
>Mar 17 10:35:22 papagayo waldo(feed)[28847]: up6.c:369: Product send failure: 
>I/O error
>Mar 17 10:35:30 papagayo rpc.ldmd[28230]: child 28847 exited with status 6
>
>I think the problem is at your end, as I'm getting data and nobody else has 
>complained.
>
>Let me know if you can't get restarted.
>Clint
>
>
>>
>>Hi Clint
>>
>>We stopped getting data from papagayo yesterday,  Mar. 17 at about  10Z
>>
>>Is there a problem at unl ?
>>
>>Alan Anderson
>>St. Cloud State