[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "pbuf_flush: time elapsed" problem



Justin,

>Date: Wed, 12 Oct 2005 15:54:43 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA/FSL
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem

The above message contained the following:

> Well over the holiday weekend we had two more instances where our 
> NEXRAD2 feed stopped, below are portions of the ldmd.log in DEBUG mode 
> from our (downstream) LDM.
> 
> The first one is from October 09:
> ----
> Oct 09 15:21:56 b2n1 pqact[1187888] INFO: c3fc16cf594ab8cf9d5cfeaf5c91941d    
>  9114 20051009152154.075 NEXRAD2 2030  L2-BZIP2/KSFX/20051009151924/2/30
> Oct 09 15:21:56 b2n1 pqact[1187888] INFO:                file: -close   
> /dcomdev/us007003/ldmdata/obs/upperair/nexrad_level2/KSFX/KSFX_20051009_151924.bz2
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_open: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_prodput: 5 
> L2-BZIP2/KSFX/20051009151924/2/30
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_dbufput: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_close: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: Delay: 0.0472 sec
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 1.9540 s
> Oct 09 15:21:56 b2n1 pqact[1187888] INFO: 29416a274c2aabe7a6c45009e0173dcf    
>  8349 20051009152154.233 NEXRAD2 523005  L2-BZIP2/KGGW/20051009152121/523/5
> Oct 09 15:21:56 b2n1 pqact[1187888] INFO:                file: -close   
> /dcomdev/us007003/ldmdata/obs/upperair/nexrad_level2/KGGW/KGGW_20051009_152121.bz2
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_open: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_prodput: 5 
> L2-BZIP2/KGGW/20051009152121/523/5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_dbufput: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_close: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: Delay: 0.0420 sec
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 2.0309 s
> Oct 09 15:21:56 b2n1 pqact[1187888] INFO: bd768c57c6fbbac96075ef067052426c    
> 31240 20051009152154.163 NEXRAD2 240017  L2-BZIP2/KCBW/20051009151943/240/17
> Oct 09 15:21:56 b2n1 pqact[1187888] INFO:                file: -close   
> /dcomdev/us007003/ldmdata/obs/upperair/nexrad_level2/KCBW/KCBW_20051009_151943.bz2
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_open: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_prodput: 5 
> L2-BZIP2/KCBW/20051009151943/240/17
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_dbufput: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG:     unio_close: 5
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: Delay: 0.0349 sec
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0035 s
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: Delay: 0.0208 sec
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0010 s
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: Delay: 0.0162 sec
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0008 s
> Oct 09 15:21:56 b2n1 pqact[1187888] DEBUG: End of Queue
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:      274 20051009152153.547 FNEXRAD 
> 054  SAAF13 LFPW 091500 RRA
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/surface/metar/20051009.metar
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                pipe: 
> bin/decod_dcmetr  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dcmetr.log        
> /dcomdev/us007003/bufrtab.000       tables/stns/metar.tbl
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:      113 20051009152153.562 FNEXRAD 
> 060  UANT01 CWAO 091517
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/upperair/aircraft/20051009.aircraft
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                pipe: 
> bin/decod_dcacft  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dcacft.log        tables/stns/pirep.tbl  
>      tables/stns/airep.tbl   /dcomdev/us007003/bufrtab.004
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:      327 20051009152153.563 FNEXRAD 
> 061  SICR20 GCLP 091500
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/surface/land/20051009.land
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                pipe: 
> bin/decod_dclsfc  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dclsfc.log        
> /dcomdev/us007003/bufrtab.000       tables/stns/lsfc.tbl    
> tables/decod_WMO.Res40.headers
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:      164 20051009152153.566 FNEXRAD 
> 062  SSVX15 LFVW 091335
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/surface/drifter/20051009.drifter
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                pipe: 
> bin/decod_dcdrbu  -v 2  -t 365  -d 
> /dcomdev/us007003/decoder_logs/decod_dcdrbu.log        
> /dcomdev/us007003/bufrtab.001
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:      672 20051009152153.568 FNEXRAD 
> 063  SIYG20 LYBM 091500
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/surface/land/20051009.land
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                pipe: 
> bin/decod_dclsfc  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dclsfc.log        
> /dcomdev/us007003/bufrtab.000       tables/stns/lsfc.tbl    
> tables/decod_WMO.Res40.headers
> ----
> 
> After  the entry at 15:21:56 the CRAFT specific pqact[1187888] only had 
> the following log entries:
> ----
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: Delay: 14.9890 sec
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: Delay: 14.9866 sec
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: Delay: 14.9481 sec
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0091 s
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: Delay: 14.9480 sec
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 09 15:22:11 b2n1 pqact[1187888] DEBUG: Delay: 14.9377 sec

The above entries are consistent with the pqact(1) process 1187888
seeing five data-products that did not match its selection criteria.
The time interval from when the products were created until 
the time that they were inserted into the product-queue on host b2n2 was
about zero seconds.  The time interval from when the products were
created until they were seen by the pqact(1) process was, however, about
15 seconds.  The pqact(1) process must be running behind.

What was the selection-criteria for this pqact(1) process (i.e.,
feedtype and pattern)?  Would you please send me the EXEC entry of the
LDM configuration-file that started this process.

> ----
> 
> Even after several hours the pqact[1187888]  process entries look the 
> same, still with no new NEXRAD2 data:
> ----
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.7625 sec
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.7604 sec
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.7585 sec
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.7555 sec
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.7536 sec
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.7516 sec
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.7482 sec
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.7468 sec
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 09 21:52:19 b2n1 pqact[1187888] DEBUG: Delay: 4.3895 sec

Again, this is consistent with the pqact(1) process seeing data-products
that don't match its selection-criteria.  Do you know for a fact that
data-products were being inserted into the product-queue that did match
the selection-criteria?  (I take it that is the case.)

> ----
> 
> Once we restarted LDM the NEXRAD2 feed resumed.
> 
> 
> The second is from October 10:
> ----
> Oct 10 11:15:10 b2n1 pqact[540898] INFO: 4f8ac5147e318902ed72b00ddb9783e9     
> 9937 20051010111508.716 NEXRAD2 397005  L2-BZIP2/KCRP/20051010111448/397/5
> Oct 10 11:15:10 b2n1 pqact[540898] INFO:                file: -close    
> /dcomdev/us007003/ldmdata/obs/upperair/nexrad_level2/KCRP/KCRP_20051010_111448.bz2
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG:     unio_open: 5
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG:     unio_prodput: 5 
> L2-BZIP2/KCRP/20051010111448/397/5
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG:     unio_dbufput: 5
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG:     unio_close: 5
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG: Delay: 0.0026 sec
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0075 s
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG: Delay: 0.0015 sec
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 10 11:15:10 b2n1 pqact[540898] DEBUG: End of Queue
> Oct 10 11:15:10 b2n1 pqing[209050] INFO:      489 20051010111510.766     HDS 
> 720  IUPT43 KBOU 101052
> Oct 10 11:15:10 b2n1 pqing[209050] INFO:     1255 20051010111510.768     HDS 
> 721  IUPT42 KBOU 100953
> Oct 10 11:15:10 b2n1 pqing[209050] INFO:     1255 20051010111510.769     HDS 
> 722  IUPT42 KBOU 101052
> Oct 10 11:15:10 b2n1 pqing[209050] INFO:      217 20051010111510.772     HDS 
> 723  IUPD21 EHDB 101100
> Oct 10 11:15:10 b2n1 pqact[909472] INFO:      685 20051010111510.742 FNEXRAD 
> 890  SAZA31 FAPR 101100
> Oct 10 11:15:10 b2n1 pqact[909472] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/surface/metar/20051010.metar
> Oct 10 11:15:10 b2n1 pqact[909472] INFO:                pipe: 
> bin/decod_dcmetr  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dcmetr.log        
> /dcomdev/us007003/bufrtab.000       tables/stns/metar.tbl
> ----
> 
> Again, after we stopped seeing NEXRAD2 data being entered into the 
> queue, the pqact[540898] process still has log entries:
> ----
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: Delay: 14.9999 sec
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: Delay: 14.9982 sec
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: Delay: 14.9957 sec
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: Delay: 14.9892 sec
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: Delay: 14.9566 sec
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0001 s
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: Delay: 14.9545 sec
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: Delay: 14.9513 sec
> Oct 10 11:15:25 b2n1 pqact[540898] DEBUG: pq_sequence(): 
> time(insert)-time(create): 0.0000 s
> ---
> 
> Restarting LDM resumed the feed.
> 
> When we upgraded to 6.3.0 from 6.0.15 the main reason was to fix the 
> same problem we are now experiencing, the NEXRAD2 feed was randomly 
> stopping and would only resume after an LDM restart.  After several 
> weeks of being on 6.3.0 the NEXRAD2 feed never stopped but that's when 
> we ran into the write error problem caused by pbuf_flush timeouts.  
> 6.4.2.4 has taken care of the pbuf_flush errors but brought back the 
> NEXRAD2 stoppages.  Are there any differences between 6.3.0 and 6.4.X 
> that could explain this? 

Nothing leaps to mind (but that means nothing).

It's as if the data-product selection-criteria was being overwritten or
corrupted, somehow, leading to the non-matching of data-products that
should match.

I'll investigate the changes from 6.3.0 to 6.4.2 in this area.

> We have saved the full ldmd.log files for both days and can make them 
> available to you if they may help.
> 
> Thanks again,
> 
> Justin
> 

Regards,
Steve Emmerson