[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "pbuf_flush: time elapsed" problem



Justin,

>Date: Thu, 06 Oct 2005 09:23:50 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem

The above message contained the following:

> We completed the installation of LDM 6.4.2.4 yesterday (~15Z)  and 
> things were running well until 01:08Z today, that's when we again 
> stopped receiving NEXRAD2 data.  I did a restart of LDM once I got in 
> this morning and NEXRAD2 is coming in. 
> 
> The TOC Level II radar cluster feeds us the NEXRAD2 data (IP: 
> 140.90.85.102).
> 
> I have attached the log files from that system; they only seem to have 
> the standard logging enabled and an LDM restart is in their cron for 
> twice a day (this has to do with their cluster availability).
>
> I didn't see anything around the 01:08Z (10/06/05) time that would 
> indicate why we stopped receiving NEXRAD2 data.

Did the outage correspond to a restart of the LDM on 140.90.85.102?

If you send USR2 signals to the downstream LDM process that's receiving
the NEXRAD2 data-products from 140.90.85.102, then you can put the
process into debug mode and see where it's spending its time.  Just
don't leave it in that mode!  :-)

> Here is a snippet from our ldmd.log at the time we stopped receiving data:
> ---
> Oct 06 01:08:25 b2n1 140.90.85.102[1437928] INFO:    10539 
> 20051006010822.446 NEXRAD2 386060  L2-BZIP2/KMPX/20051006010419/386/60
> Oct 06 01:08:25 b2n1 pqact[815190] INFO:    10539 20051006010822.446 
> NEXRAD2 386060  L2-BZIP2/KMPX/20051006010419/386/60

ASIDE: The above entries indicate that a data-product that has the time
2005-10-06 01:04:19 encoded in it's product-identifier was created at
2005-10-06 01:08:22.446 -- a difference of over 4 minutes.  I hope this is
OK.

> Oct 06 01:08:25 b2n1 pqact[815190] INFO:                file: -close    
> /dcomdev/us007003/ldmdata/obs/upperair/nexrad_level2/KMPX/KMPX_20051006_010419.bz2
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:     3666 20051006010825.900 
> FNEXRAD 555  UJXX82 KWBC 060104
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/upperair/sonde/20051006.sonde
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                pipe: 
> bin/decod_dcusnd  -v 2  -t 600  -d 
> /dcomdev/us007003/decoder_logs/decod_dcusnd.log        
> tables/stns/sonde.land.tbl      tables/stns/sonde.ship.tbl      
> /dcomdev/us007003/bufrtab.002
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:       84 20051006010825.900 
> FNEXRAD 556  UANT01 KWBC 060105
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/upperair/aircraft/20051006.aircraft
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                pipe: 
> bin/decod_dcacft  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dcacft.log        
> tables/stns/pirep.tbl   tables/stns/airep.tbl   
> /dcomdev/us007003/bufrtab.004
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:      194 20051006010825.900 
> FNEXRAD 557  UPPK01 OPKC 060000
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/upperair/sonde/20051006.sonde
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                pipe: 
> bin/decod_dcusnd  -v 2  -t 600  -d 
> /dcomdev/us007003/decoder_logs/decod_dcusnd.log        
> tables/stns/sonde.land.tbl      tables/stns/sonde.ship.tbl      
> /dcomdev/us007003/bufrtab.002
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:      144 20051006010825.900 
> FNEXRAD 558  SAGR33 LGAT 060100
> Oct 06 01:08:25 b2n1 140.90.85.102[1437928] INFO:    22130 
> 20051006010822.509 NEXRAD2 559026  L2-BZIP2/KLWX/20051006005943/559/26

ASIDE: The last entry indicates that a data-product that has the time
2005-10-06 00:59:43 encoded in it's product-identifier was created at
2005-10-06 01:08:22.509 -- a difference of almost 9 minutes.  I hope
this is OK.

> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/surface/metar/20051006.metar
> Oct 06 01:08:25 b2n1 pqact[889074] INFO:                pipe: 
> bin/decod_dcmetr  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dcmetr.log        
> /dcomdev/us007003/bufrtab.000   tables/stns/metar.tbl
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:      556 20051006010825.964 
> FNEXRAD 579  SAUS70 KWBC 060100 RRE
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:      136 20051006010825.964 
> FNEXRAD 580  SAUS80 KWBC 060100 RRE
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:      140 20051006010825.968 
> FNEXRAD 581  SPUS80 KWBC 060104
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:      149 20051006010825.969 
> FNEXRAD 582  SASD31 OEJD 060100 RRA
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:       72 20051006010825.971 
> FNEXRAD 583  ULCM01 FKKD 060000
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:       82 20051006010825.972 
> FNEXRAD 584  SAAG00 SACO 060100
> Oct 06 01:08:25 b2n1 pqing[208898] INFO:     1177 20051006010825.974 
> FNEXRAD 585  SAIY40 LIIB 060100
> ----
> 
> The last entry related to NEXRAD2 is this entry into the queue: "Oct 06 
> 01:08:25 b2n1 140.90.85.102[1437928] INFO:    22130 20051006010822.509 
> NEXRAD2 559026  L2-BZIP2/KLWX/20051006005943/559/26", pqact never acts 
> on the entry...
> 
> I was poking around the email archives and saw where you suggested to a 
> user that 4 instances of pqact is sometimes a good idea when working 
> with the CRAFT data,
> 
> pqact -f CRAFT -p BZIP2/K[A-D] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[E-K] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[L-R] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[S-Z] etc/pqact.craft
> 
> Do you think adding these entries to our ldmd.conf would make any 
> difference?

Processing data-products using multiple, disjoint pqact(1) processes can 
improve throughput over a single pqact(1) process if the single process
has reached the limit on the number of open file-descriptors and has to
constantly close file-descriptors in order to open a new ones.  You can
determine the limit on the number of open file descriptors via the
commands "ulimit -a" or "getconf OPEN_MAX".

Also, be sure that the pqact(1) configuration-file contains only
relevant entries because the pqact(1) program tests every data-product
against every entry.

Using a faster computer can also help -- as will having sufficient
memory so that the relevant processes remain in memory rather than being
swapped to disk (the top(1) utility can help determine this).

> Thanks again for the help,
> 
> Justin
[snip]

Regards,
Steve Emmerson