[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "pbuf_flush: time elapsed" problem



Justin,

>Date: Fri, 14 Oct 2005 08:39:44 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA/FSL
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem

The above message contained the following:

> >How do you know that NEXRAD2 data-products stopped being inserted into
> >the product-queue around 15:21 UTC?
> >
> Oct 09 15:21:56 b2n1 140.90.85.102[1622180] INFO: 
> c3fc16cf594ab8cf9d5cfeaf5c91941d     9114 20051009152154.075 NEXRAD2 2030  
> L2-BZIP2/KSFX/20051009151924/2/30
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:     1147 20051009152153.496 FNEXRAD 
> 035  SNCN19 CWAO 091519
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                file: 
> /dcomdev/us007003/ldmdata/obs/surface/land/20051009.land
> Oct 09 15:21:56 b2n1 pqact[815108] INFO:                pipe: 
> bin/decod_dclsfc  -v 2  -t 300  -d 
> /dcomdev/us007003/decoder_logs/decod_dclsfc.log        
> /dcomdev/us007003/bufrtab.000   tables/stns/lsfc.tbl      
> tables/decod_WMO.Res40.headers
> Oct 09 15:21:56 b2n1 140.90.85.102[1622180] INFO: 
> 29416a274c2aabe7a6c45009e0173dcf     8349 20051009152154.233 NEXRAD2 523005  
> L2-BZIP2/KGGW/20051009152121/523/5
> 
> These are the last entries for the rpc.ldmd process [1622180] that is 
> putting NEXRAD2 products into the queue, there is nothing else in the 
> log for PID 1622180 after that. 

I expect, then, that an strace(1) of downstream-LDM process 1622180
would have shown that it's hung-up in making futex(2) FUTEX_WAIT calls
that never return.

> >Were data-products of feedtypes other than NEXRAD2 also being inserted
> >into the product-queue?
> >
> Yes, we were still receiving PCWS and FSL2 feedtypes:
>
> Oct 09 15:22:39 b2n1 eldm4[1507454] INFO:   727508 20051009152237.671    PCWS 
> 000  FSL.CompressedNetCDF.MADIS.acars.20051009_1300.gz
> Oct 09 15:26:46 b2n1 eldm4[1020044] INFO:    34605 20051009152616.747    FSL2 
> 000  FSL.CompressedNetCDF.MADIS.coop.20051009_1500.gz

That would explain the output of the pqact(1) process that indicated
some data-products didn't match its selection-criteria.

> As a test I have put the rpc.ldmd for NEXRAD2 into silent mode (it has 
> always been in verbose mode) to see if it has any effect.  When we were 
> in version 6.0.15 putting that feed into silent mode seemed to "fix" 
> whatever was causing the stoppage.

As far as I've been able to determine, the problem is caused by this:

    ... The output shows that the top-level LDM is calling the futex(2)
    system call using the FUTEX_WAIT operation (i.e., it's waiting on a
    mutex).  The wait is interrupted by a SIGCONT signal, which occurs
    everytime a data-product is inserted into the product-queue.  The
    thread then returns to whatever runtime function made the futex(2)
    call (the LDM code doesn't call that function directly) whereupon
    the runtime function again calls futex(2).

    The problem is that the futex(2) call never returns due to a
    FUTEX_WAKE operation on the given futex.

Unfortunately, we're not currently equiped to debug the Linux futex(2)
call -- which is what it would take to diagnose this problem.

What version of the kernel are you running?

Regards,
Steve Emmerson