NOTE: The decoders
mailing list is no longer active. The list archives are made available for historical reasons.
Greg, After much discussion, I believe the problem is that the K220 computer is too slow to support the dcmetr decoder with the extra FSL5 feed. That's why you are seeing all the pbuf_flush entries in the log files. What we believe is happening, the dcmetr decoder PIPE is getting filled, so the LDM is spawning another dcmetr decoder. SInce both decoders are writing to the same file, it's getting a signal 11 and making a corrupt output file. You should check the Gempak log file for muliple starts of the dcmetr decoder without shutting down messages. Also, if you are just using the NOAAport stream the dcmetr like you said it work OK, ie it can keep up. Robb... ~On Thu, 6 Apr 2000, Gregory Grosshans wrote:
I recently added a NOAAPORT feed similar to the DDS/PPS feed (i.e. request FSL5 (^[A-FK-X]) ), to a machine running LDM (5.0.9) already receiving the following feedtypes (AFOS, DDS, PPS). Once the NOAAPORT feed started coming in the metar decoder (i.e. dcmetr) began terminating frequently from signal 11 resulting in a corrupted GEMPAK surface observation file. This particular machine is a K220 (2 100 MHz CPUs with 1 GB RAM). On a K370 (3 200 MHz CPUs and faster with 1.2 GB RAM) the same LDM (5.0.8) setup exists and the resulting surface file is fine and the dcmetr doesn't terminate. Do you have any ideas on why a problem exists on one machine but not on the other? Have you seen anything like this before? I've provided additional information about the two configurations and trouble shooting steps I've already taken below. The NOAAPORT ingest runs on a second K210 server with LDM (5.0.8) running and the FSL acqserver code (modified by SPC/AWC to make the data look like its from the FOS). This second K210 then feeds NOAAPORT data (via the FSL5 feedtype) to the K210 and K370. Thanks, Gregg Some specifics follow: On the K370 (everything works fine) the following configuration follows: ---cut from ldmd.conf--- exec "pqexpire -a .5" request AFOS .* hostname-removed request DDS .* hostname-removed request PPS .* hostname-removed request FSL5 (^[A-FK-X]) hostname-removed This is the NOAAPORT stream received in-house request PCWS "^FSL\.NetCDF\.ACARS\.QC\..*" hostname-removed --- $pq_size = 750000000; The K370 uses dcuair and dcmetr decoders and the FEEDTYPE for each decoder is FSL5|DDS. This allows the receipt of data to be ASAP as well as redundant (in case the FSL5 feed goes down). On the K210 (everything except the metar GEMPAK file works fine) the following configuration follows: ---cut from the ldmd.conf--- exec "pqexpire -a .23333" request AFOS .* hostname-removed request DDS .* hostname-removed request PPS .* hostname-removed #request FSL5 (^[A-FK-X]) hostname-removed # UA data and AVN MOS #request FSL5 (^U[EGHKLPQSZ]|^FE) hostname-removed # SFC/UA data and AVN MOS request FSL5 (^U[EGHKLPQSZ]|^FE|^S[AP]) hostname-removed --- #$pq_size = 750000000; $pq_size = 200000000; The K210 uses dcuair and dcmetr decoders. The dcuair output is fine. The dcmetr output becomes corrupted. Before adding the entire NOAAPORT feed to the K210 it was receiving the UA and AVN MOS data from NOAAPORT just fine. The LDM at this time was 5.0.8. After adding the entire NOAAPORT stream (request ^[A-FK-X]) the dcmetr became corrupted. I then copied the dcmetr decoder from the K370 to the K210 with the GEMPAK surface file still becoming corrupted. Next, I upgraded to LDM-5.0.9 on the K210 but still no success. Then I decreased the request statement to be just upperair, metar and AVN MOS data. However the surface file was still becoming corrupted. The most recent attempt was to lower the queue size from 750 MB to 200 MB (since i'm not ingesting all the NOAAPORT data) but the surface file still becomes corrupted. PBUF messages are still received sporadically, usually soon after starting LDM up: ldmspock@spock> ll ldmd.* -rw-r--r-- 1 ldmspock data 8522013 Apr 6 13:56 ldmd.log -rw-r--r-- 1 ldmspock data 5645263 Apr 6 00:00 ldmd.log.1 -rw-r--r-- 1 ldmspock data 1262077 Apr 5 15:19 ldmd.log.2 -rw-r--r-- 1 ldmspock data 8324510 Apr 5 13:19 ldmd.log.3 -rw-r--r-- 1 ldmspock data 2973459 Apr 5 00:00 ldmd.log.4 -rw-r--r-- 1 ldmspock data 5841177 Apr 4 19:03 ldmd.log.5 -rw-r--r-- 1 ldmspock data 3835 Apr 4 16:25 ldmd.log.6 -rw-r--r-- 1 ldmspock data 26420210 Apr 4 16:23 ldmd.log.7 ldmspock@spock> grep pbuf ldmd.log ldmd.log.1 ldmd.log:Apr 06 06:10:12 spock pqact[17920]: pbuf_flush 35: time elapsed 3.339170 ldmd.log:Apr 06 06:10:23 spock pqact[17920]: pbuf_flush 35: time elapsed 6.926433 ldmd.log:Apr 06 06:10:25 spock pqact[17920]: pbuf_flush 35: time elapsed 2.288729 ldmd.log.1:Apr 05 15:25:05 spock pqact[17920]: pbuf_flush 35: time elapsed 4.764068 ldmd.log.1:Apr 05 15:25:14 spock pqact[17920]: pbuf_flush 35: time elapsed 5.619390 ldmd.log.1:Apr 05 15:25:20 spock pqact[17920]: pbuf_flush 35: time elapsed 3.642364 ldmd.log.1:Apr 05 15:25:35 spock pqact[17920]: pbuf_flush 35: time elapsed 11.463704 ldmd.log.1:Apr 05 15:25:46 spock pqact[17920]: pbuf_flush 35: time elapsed 7.433896 ldmd.log.1:Apr 05 15:25:53 spock pqact[17920]: pbuf_flush 35: time elapsed 5.602335 ldmd.log.1:Apr 05 15:26:12 spock pqact[17920]: pbuf_flush 35: time elapsed 6.995401 ldmd.log.1:Apr 05 15:26:17 spock pqact[17920]: pbuf_flush 35: time elapsed 2.613388 ldmd.log.1:Apr 05 15:26:22 spock pqact[17920]: pbuf_flush 35: time elapsed 3.780494 ldmd.log.1:Apr 05 15:26:34 spock pqact[17920]: pbuf_flush 35: time elapsed 2.924638 ldmd.log.1:Apr 05 15:26:44 spock pqact[17920]: pbuf_flush 35: time elapsed 2.035412 ldmd.log.1:Apr 05 15:30:16 spock pqact[17920]: pbuf_flush 35: time elapsed 3.280656 ldmd.log.1:Apr 05 21:10:08 spock pqact[17920]: pbuf_flush 35: time elapsed 5.369202 ldmd.log.1:Apr 05 23:10:09 spock pqact[17920]: pbuf_flush 35: time elapsed 2.265447 DCMETR processes killed follow. The processes killed during 15Z in ldmd.log.1 were another decoder receiving test products. As you can see there haven't been many child processes killed today, but the file is corrupted. Performing a sflist on the surface obs file resulted in a segmentation fault will the 1Z output was displaying. grep child ldmd.log ldmd.log.1 ldmd.log:Apr 06 04:07:27 spock pqact[17920]: child 5137 terminated by signal 11 ldmd.log.1:Apr 05 15:28:49 spock pqact[17920]: child 20923 terminated by signal 11ldmd.log.1:Apr 05 15:28:50 spock pqact[17920]: child 20938 terminated by
signal 11
ldmd.log.1:Apr 05 15:28:57 spock pqact[17920]: child 21095 terminated by signal 11 ldmd.log.1:Apr 05 15:31:52 spock pqact[17920]: child 24263 terminated by signal 11 ldmd.log.1:Apr 05 15:31:54 spock pqact[17920]: child 24292 terminated by signal 11 ldmd.log.1:Apr 05 15:33:17 spock pqact[17920]: child 25397 terminated by signal 11 ldmd.log.1:Apr 05 16:07:30 spock pqact[17920]: child 5674 terminated by signal 11 ldmd.log.1:Apr 05 16:13:49 spock pqact[17920]: child 15384 terminated by signal 11 ldmd.log.1:Apr 05 16:31:30 spock pqact[17920]: child 24463 terminated by signal 11 ldmd.log.1:Apr 05 16:37:15 spock pqact[17920]: child 27998 terminated by signal 11 ldmd.log.1:Apr 05 16:57:56 spock pqact[17920]: child 9436 terminated by signal 11 ldmd.log.1:Apr 05 17:02:03 spock pqact[17920]: child 10868 terminated by signal 11 ldmd.log.1:Apr 05 17:17:29 spock pqact[17920]: child 20707 terminated by signal 11 ldmd.log.1:Apr 05 17:20:54 spock pqact[17920]: child 20898 terminated by signal 11 ldmd.log.1:Apr 05 17:21:53 spock pqact[17920]: child 22583 terminated by signal 11 ldmd.log.1:Apr 05 17:23:35 spock pqact[17920]: child 22924 terminated by signal 11 ldmd.log.1:Apr 05 17:27:07 spock pqact[17920]: child 23765 terminated by signal 11 ldmd.log.1:Apr 05 17:30:04 spock pqact[17920]: child 25231 terminated by signal 11 ldmd.log.1:Apr 05 17:41:46 spock pqact[17920]: child 741 terminated by signal 11 ldmd.log.1:Apr 05 17:46:38 spock pqact[17920]: child 1216 terminated by signal 11 ldmd.log.1:Apr 05 17:50:08 spock pqact[17920]: child 3376 terminated by signal 11 ldmd.log.1:Apr 05 18:01:35 spock pqact[17920]: child 8924 terminated by signal 11 ldmd.log.1:Apr 05 18:01:55 spock pqact[17920]: child 9569 terminated by signal 11 ldmd.log.1:Apr 05 18:02:08 spock pqact[17920]: child 9736 terminated by signal 11 ldmd.log.1:Apr 05 18:04:20 spock pqact[17920]: child 9786 terminated by signal 11 ldmd.log.1:Apr 05 18:05:06 spock pqact[17920]: child 10241 terminated by signal 11 ldmd.log.1:Apr 05 18:08:47 spock pqact[17920]: child 15683 terminated by signal 11 ldmd.log.1:Apr 05 18:14:30 spock pqact[17920]: child 16273 terminated by signal 11 ldmd.log.1:Apr 05 18:15:40 spock pqact[17920]: child 17962 terminated by signal 11 ldmd.log.1:Apr 05 18:15:44 spock pqact[17920]: child 18700 terminated by signal 11 ldmd.log.1:Apr 05 18:16:07 spock pqact[17920]: child 18714 terminated by signal 11 ldmd.log.1:Apr 05 18:16:08 spock pqact[17920]: child 18899 terminated by signal 11 ldmd.log.1:Apr 05 18:26:21 spock pqact[17920]: child 21652 terminated by signal 11 ldmd.log.1:Apr 05 18:26:41 spock pqact[17920]: child 23263 terminated by signal 11 ldmd.log.1:Apr 05 18:46:58 spock pqact[17920]: child 28329 terminated by signal 11 ldmd.log.1:Apr 05 18:57:50 spock pqact[17920]: child 6435 terminated by signal 11 ldmd.log.1:Apr 05 18:58:56 spock pqact[17920]: child 6554 terminated by signal 11 ldmd.log.1:Apr 05 19:02:36 spock pqact[17920]: child 7837 terminated by signal 11 ldmd.log.1:Apr 05 19:03:07 spock pqact[17920]: child 8165 terminated by signal 11 ldmd.log.1:Apr 05 19:03:50 spock pqact[17920]: child 8377 terminated by signal 11 ldmd.log.1:Apr 05 19:08:03 spock pqact[17920]: child 13935 terminated by signal 11 ldmd.log.1:Apr 05 19:09:06 spock pqact[17920]: child 14026 terminated by signal 11 Robb Kambic wrote: > On Thu, 30 Mar 2000, Gregory Grosshans wrote: > > > Can someone explain what a 'pbuf_flash 15' entry in the ldmd.log file > > means? > > > > This LDM ingests the Gateway channel from NOAAPORT (as FSL5) and PPS, > > DDS (note request > > patter below). The ldm queue is 750 MB. > > Greg, > > The pbuf_flush 15 means that the ACTION ie ( FILE, STDIO, PIPE, etc ) > could not write to the destinations because the resource was not available > at that time or it was blocked. The log entry ususally occurs when disk > activity is at it's peak for some reason. It could be the LDM doing many > disk writes or another process tying up the disk. > > time elapsed 5.690810 > > means that it took 5.6 sec to complete the activity, > > If this occurs a bunch of time, one could start to loose data. Then it's > time to determine the reason for the pbuf_flush entries. > > Robb... > > > > Thanks, > > Gregg Grosshans > > SPC > > > > ---information from ldmd.conf--------- > > pqexpire -a .23333 > > request FSL5 (^[A-FK-X]) (hostname was here) > > > > ---information from ldmadmin--------- > > $pq_size = 750000000; > > > > > > ------portion of ldmd.log------------- > > > > Mar 30 15:07:24 spock pqact[14333]: pbuf_flush 15: time elapsed > > 5.690810 > > Mar 30 15:07:24 spock DCMETR[10991]: Normal Termination > > Mar 30 15:07:24 spock DCMETR[11773]: Starting Up > > Mar 30 15:07:31 spock pqact[14333]: pbuf_flush 15: time elapsed > > 6.166595 > > > >
============================================================================== Robb Kambic Unidata Program Center Software Engineer III Univ. Corp for Atmospheric Research rkambic@xxxxxxxxxxxxxxxx WWW: http://www.unidata.ucar.edu/ ==============================================================================
decoders
archives: