Re: decoder not working on one LDM machine but it does on another LDM and Re: pbuf_flush

NOTE: The decoders mailing list is no longer active. The list archives are made available for historical reasons.

  • To: Gregory Grosshans <grosshans@xxxxxxxxxxxx>
  • Subject: Re: decoder not working on one LDM machine but it does on another LDM and Re: pbuf_flush
  • From: Robb Kambic <rkambic@xxxxxxxxxxxxxxxx>
  • Date: Thu, 6 Apr 2000 11:36:08 -0600 (MDT)
Greg,

After much discussion, I believe the problem is that the K220 computer is
too slow to support the dcmetr decoder with the extra FSL5 feed.  That's
why you are seeing all the pbuf_flush entries in the log files.  What we
believe is happening, the dcmetr decoder PIPE is getting filled, so the
LDM is spawning another dcmetr decoder.  SInce both decoders are writing
to the same file, it's getting a signal 11 and making a corrupt output
file. You should check the Gempak log file for muliple starts of the
dcmetr decoder without shutting down messages.  Also, if you are just
using the NOAAport stream the dcmetr like you said it work OK, ie it can
keep up.

Robb...

~On Thu, 6 Apr 2000, Gregory Grosshans wrote:

I recently added a NOAAPORT feed similar to the DDS/PPS feed (i.e. request FSL5
(^[A-FK-X]) ), to a machine running LDM (5.0.9) already receiving the following
feedtypes (AFOS, DDS, PPS).  Once the NOAAPORT feed started coming in the metar
decoder (i.e. dcmetr) began terminating frequently from signal 11 resulting in a
corrupted GEMPAK surface observation file.  This particular machine is a K220 (2
100 MHz CPUs with 1 GB RAM).

On a K370 (3 200 MHz CPUs and faster with 1.2 GB RAM) the same LDM (5.0.8) setup
exists and the resulting surface file is fine and the dcmetr doesn't terminate.

Do you have any ideas on why a problem exists on one machine but not on the
other?  Have you seen anything like this before?  I've provided additional
information about the two configurations and trouble shooting steps I've already
taken below.  The NOAAPORT ingest runs on a second K210 server with LDM (5.0.8)
running and the FSL acqserver code (modified by SPC/AWC to make the data look 
like
its from the FOS).  This second K210 then feeds NOAAPORT data (via the FSL5
feedtype) to the K210 and K370.

Thanks,
Gregg

Some specifics follow:

On the K370 (everything works fine) the following configuration follows:

---cut from ldmd.conf---
exec    "pqexpire -a .5"
request AFOS    .*      hostname-removed
request DDS     .*      hostname-removed
request PPS     .*      hostname-removed
request FSL5    (^[A-FK-X])     hostname-removed  This is the NOAAPORT stream
received in-house
request PCWS    "^FSL\.NetCDF\.ACARS\.QC\..*"   hostname-removed
---
$pq_size = 750000000;

The K370 uses dcuair and dcmetr decoders and the FEEDTYPE for each decoder is
FSL5|DDS.  This allows the receipt of data to be ASAP as well as redundant (in
case the FSL5 feed goes down).

On the K210 (everything except the metar GEMPAK file works fine) the following
configuration follows:

---cut from the ldmd.conf---
exec    "pqexpire -a .23333"
request AFOS    .*      hostname-removed
request DDS     .*      hostname-removed
request PPS     .*      hostname-removed
#request FSL5   (^[A-FK-X])     hostname-removed
# UA data and AVN MOS
#request FSL5   (^U[EGHKLPQSZ]|^FE)             hostname-removed
# SFC/UA data and AVN MOS
request FSL5    (^U[EGHKLPQSZ]|^FE|^S[AP])              hostname-removed
---
#$pq_size = 750000000;
$pq_size = 200000000;

The K210 uses dcuair and dcmetr decoders.  The dcuair output is fine.  The 
dcmetr
output becomes corrupted.

Before adding the entire NOAAPORT feed to the K210 it was receiving the UA and 
AVN
MOS data from NOAAPORT just fine.  The LDM at this time was 5.0.8.  After adding
the entire NOAAPORT stream (request ^[A-FK-X]) the dcmetr became corrupted.  I
then copied the
dcmetr decoder from the K370 to the K210 with the GEMPAK surface file still
becoming corrupted.  Next, I upgraded to LDM-5.0.9 on the K210 but still no
success.  Then I decreased the request statement to be just upperair, metar and
AVN MOS data.  However the surface file was still becoming corrupted.  The most
recent attempt was to lower the queue size from 750 MB to 200 MB (since i'm not
ingesting all the NOAAPORT data) but the surface file still becomes corrupted.

PBUF messages are still received  sporadically, usually soon after starting LDM
up:

ldmspock@spock> ll ldmd.*
-rw-r--r--   1 ldmspock   data       8522013 Apr  6 13:56 ldmd.log
-rw-r--r--   1 ldmspock   data       5645263 Apr  6 00:00 ldmd.log.1
-rw-r--r--   1 ldmspock   data       1262077 Apr  5 15:19 ldmd.log.2
-rw-r--r--   1 ldmspock   data       8324510 Apr  5 13:19 ldmd.log.3
-rw-r--r--   1 ldmspock   data       2973459 Apr  5 00:00 ldmd.log.4
-rw-r--r--   1 ldmspock   data       5841177 Apr  4 19:03 ldmd.log.5
-rw-r--r--   1 ldmspock   data          3835 Apr  4 16:25 ldmd.log.6
-rw-r--r--   1 ldmspock   data       26420210 Apr  4 16:23 ldmd.log.7
ldmspock@spock> grep pbuf ldmd.log ldmd.log.1
ldmd.log:Apr 06 06:10:12 spock pqact[17920]: pbuf_flush 35: time elapsed
3.339170
ldmd.log:Apr 06 06:10:23 spock pqact[17920]: pbuf_flush 35: time elapsed
6.926433
ldmd.log:Apr 06 06:10:25 spock pqact[17920]: pbuf_flush 35: time elapsed
2.288729
ldmd.log.1:Apr 05 15:25:05 spock pqact[17920]: pbuf_flush 35: time elapsed
4.764068
ldmd.log.1:Apr 05 15:25:14 spock pqact[17920]: pbuf_flush 35: time elapsed
5.619390
ldmd.log.1:Apr 05 15:25:20 spock pqact[17920]: pbuf_flush 35: time elapsed
3.642364
ldmd.log.1:Apr 05 15:25:35 spock pqact[17920]: pbuf_flush 35: time elapsed
11.463704
ldmd.log.1:Apr 05 15:25:46 spock pqact[17920]: pbuf_flush 35: time elapsed
7.433896
ldmd.log.1:Apr 05 15:25:53 spock pqact[17920]: pbuf_flush 35: time elapsed
5.602335
ldmd.log.1:Apr 05 15:26:12 spock pqact[17920]: pbuf_flush 35: time elapsed
6.995401
ldmd.log.1:Apr 05 15:26:17 spock pqact[17920]: pbuf_flush 35: time elapsed
2.613388
ldmd.log.1:Apr 05 15:26:22 spock pqact[17920]: pbuf_flush 35: time elapsed
3.780494
ldmd.log.1:Apr 05 15:26:34 spock pqact[17920]: pbuf_flush 35: time elapsed
2.924638
ldmd.log.1:Apr 05 15:26:44 spock pqact[17920]: pbuf_flush 35: time elapsed
2.035412
ldmd.log.1:Apr 05 15:30:16 spock pqact[17920]: pbuf_flush 35: time elapsed
3.280656
ldmd.log.1:Apr 05 21:10:08 spock pqact[17920]: pbuf_flush 35: time elapsed
5.369202
ldmd.log.1:Apr 05 23:10:09 spock pqact[17920]: pbuf_flush 35: time elapsed
2.265447

DCMETR processes killed follow.  The processes killed during 15Z in ldmd.log.1
were another decoder receiving test products.  As you can see there haven't been
many child processes killed today, but the file is corrupted.  Performing a 
sflist
on the surface obs file resulted in a segmentation fault will the 1Z output was
displaying.
grep child ldmd.log ldmd.log.1
ldmd.log:Apr 06 04:07:27 spock pqact[17920]: child 5137 terminated by signal 11
ldmd.log.1:Apr 05 15:28:49 spock pqact[17920]: child 20923 terminated by signal 
11

ldmd.log.1:Apr 05 15:28:50 spock pqact[17920]: child 20938 terminated by
signal 11

ldmd.log.1:Apr 05 15:28:57 spock pqact[17920]: child 21095 terminated by signal 
11

ldmd.log.1:Apr 05 15:31:52 spock pqact[17920]: child 24263 terminated by signal 
11

ldmd.log.1:Apr 05 15:31:54 spock pqact[17920]: child 24292 terminated by signal 
11

ldmd.log.1:Apr 05 15:33:17 spock pqact[17920]: child 25397 terminated by signal 
11

ldmd.log.1:Apr 05 16:07:30 spock pqact[17920]: child 5674 terminated by signal 
11
ldmd.log.1:Apr 05 16:13:49 spock pqact[17920]: child 15384 terminated by signal 
11

ldmd.log.1:Apr 05 16:31:30 spock pqact[17920]: child 24463 terminated by signal 
11

ldmd.log.1:Apr 05 16:37:15 spock pqact[17920]: child 27998 terminated by signal 
11

ldmd.log.1:Apr 05 16:57:56 spock pqact[17920]: child 9436 terminated by signal 
11
ldmd.log.1:Apr 05 17:02:03 spock pqact[17920]: child 10868 terminated by signal 
11

ldmd.log.1:Apr 05 17:17:29 spock pqact[17920]: child 20707 terminated by signal 
11

ldmd.log.1:Apr 05 17:20:54 spock pqact[17920]: child 20898 terminated by signal 
11

ldmd.log.1:Apr 05 17:21:53 spock pqact[17920]: child 22583 terminated by signal 
11

ldmd.log.1:Apr 05 17:23:35 spock pqact[17920]: child 22924 terminated by signal 
11

ldmd.log.1:Apr 05 17:27:07 spock pqact[17920]: child 23765 terminated by signal 
11

ldmd.log.1:Apr 05 17:30:04 spock pqact[17920]: child 25231 terminated by signal 
11

ldmd.log.1:Apr 05 17:41:46 spock pqact[17920]: child 741 terminated by signal 11
ldmd.log.1:Apr 05 17:46:38 spock pqact[17920]: child 1216 terminated by signal 
11
ldmd.log.1:Apr 05 17:50:08 spock pqact[17920]: child 3376 terminated by signal 
11
ldmd.log.1:Apr 05 18:01:35 spock pqact[17920]: child 8924 terminated by signal 
11
ldmd.log.1:Apr 05 18:01:55 spock pqact[17920]: child 9569 terminated by signal 
11
ldmd.log.1:Apr 05 18:02:08 spock pqact[17920]: child 9736 terminated by signal 
11
ldmd.log.1:Apr 05 18:04:20 spock pqact[17920]: child 9786 terminated by signal 
11
ldmd.log.1:Apr 05 18:05:06 spock pqact[17920]: child 10241 terminated by signal 
11

ldmd.log.1:Apr 05 18:08:47 spock pqact[17920]: child 15683 terminated by signal 
11

ldmd.log.1:Apr 05 18:14:30 spock pqact[17920]: child 16273 terminated by signal 
11

ldmd.log.1:Apr 05 18:15:40 spock pqact[17920]: child 17962 terminated by signal 
11

ldmd.log.1:Apr 05 18:15:44 spock pqact[17920]: child 18700 terminated by signal 
11

ldmd.log.1:Apr 05 18:16:07 spock pqact[17920]: child 18714 terminated by signal 
11

ldmd.log.1:Apr 05 18:16:08 spock pqact[17920]: child 18899 terminated by signal 
11

ldmd.log.1:Apr 05 18:26:21 spock pqact[17920]: child 21652 terminated by signal 
11

ldmd.log.1:Apr 05 18:26:41 spock pqact[17920]: child 23263 terminated by signal 
11

ldmd.log.1:Apr 05 18:46:58 spock pqact[17920]: child 28329 terminated by signal 
11

ldmd.log.1:Apr 05 18:57:50 spock pqact[17920]: child 6435 terminated by signal 
11
ldmd.log.1:Apr 05 18:58:56 spock pqact[17920]: child 6554 terminated by signal 
11
ldmd.log.1:Apr 05 19:02:36 spock pqact[17920]: child 7837 terminated by signal 
11
ldmd.log.1:Apr 05 19:03:07 spock pqact[17920]: child 8165 terminated by signal 
11
ldmd.log.1:Apr 05 19:03:50 spock pqact[17920]: child 8377 terminated by signal 
11
ldmd.log.1:Apr 05 19:08:03 spock pqact[17920]: child 13935 terminated by signal 
11

ldmd.log.1:Apr 05 19:09:06 spock pqact[17920]: child 14026 terminated by signal 
11





Robb Kambic wrote:

> On Thu, 30 Mar 2000, Gregory Grosshans wrote:
>
> > Can someone explain what a 'pbuf_flash 15' entry in the ldmd.log file
> > means?
> >
> > This LDM ingests the Gateway channel from NOAAPORT (as FSL5) and PPS,
> > DDS   (note request
> > patter below).  The ldm queue is 750 MB.
>
> Greg,
>
> The pbuf_flush 15 means that the ACTION ie ( FILE, STDIO, PIPE, etc )
> could not write to the destinations because the resource was not available
> at that time or it was blocked.  The log entry ususally occurs when disk
> activity is at it's peak for some reason.  It could be the LDM doing many
> disk writes or another process tying up the disk.
>
> time elapsed 5.690810
>
> means that it took 5.6 sec to complete the activity,
>
> If this occurs a bunch of time, one could start to loose data.  Then it's
> time to determine the reason for the pbuf_flush entries.
>
> Robb...
> >
> > Thanks,
> > Gregg Grosshans
> > SPC
> >
> > ---information from ldmd.conf---------
> > pqexpire -a .23333
> > request FSL5    (^[A-FK-X])     (hostname was here)
> >
> > ---information from ldmadmin---------
> > $pq_size = 750000000;
> >
> >
> > ------portion of ldmd.log-------------
> >
> > Mar 30 15:07:24 spock pqact[14333]: pbuf_flush 15: time elapsed
> > 5.690810
> > Mar 30 15:07:24 spock DCMETR[10991]: Normal Termination
> > Mar 30 15:07:24 spock DCMETR[11773]: Starting Up
> > Mar 30 15:07:31 spock pqact[14333]: pbuf_flush 15: time elapsed
> > 6.166595
> >
> >

==============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
rkambic@xxxxxxxxxxxxxxxx                   WWW: http://www.unidata.ucar.edu/
==============================================================================


  • 2000 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the decoders archives: