[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20040802: LDM pqact child Core Dump Problem




Kevin,

The broken pipe message occurs when the decoder dies,
so that is just a warning that your pqact wasn't able to
send the data to the decoder....but you knew that already.

This failure would be unrelated to ETA and NEXRAD data losses
that you had mentioned previously. Vertainly, you want to upgrade
to the most recent version of dcacft if you are seeing problems
with that decoder though.

If you are losing MODEL and NEXRAD data, I'd suspect that your
pqact processes are backing up due to slow IO (especially
since the NEXRAD data doesn't require any PIPE actions,
just FILE actions).

I split the pqact processing up into multiple invocations here
as demonstrated in the $NAWIPS/ldm/etc/pqact_templates examples.
For NEXRAD and CRAFT in particular, which are lots of FILE
actions to unique files, you probably want a separate pqact
process for each FEED. 

Steve Chiswell




On Mon, 2004-08-02 at 15:16, Kevin W. Thomas wrote:
> >Kevin,
> >
> >The child of pqact is most probably a decoder or script that you
> >are piping to from pqact. If you have a core file, the comand "file
> >core" will tell you what process created it- or check your decoder logs
> >and see if you can match the process ID with the child process. 
> >You can cycle pqact into verbose mode (using the -USR2 fignal)
> >to get a line by line listing of pqact if you need tosee each action
> >being processed to track down the child.
> >
> >One common problem with decoders is having more than one instance
> >writing to the output file, which can occur if your
> >system if falling behind. In this case, you would also probably notice
> >pbuf messages in your LDM logs. A second -USR2 signal to pqact will
> >cycle logging into debug mode. You would see "delay" messages there
> >indicating how long it takes to process products in the queue.
> >If the delay time is climbing, it also would signal that your
> >products are backing up.
> >
> >Steve Chiswell
> >Unidata User Support
> 
> Steve...
> 
> I've been able to follow the pid's to find the offending program, dcacft.
> This is the one that keeps failing.  My pqact.conf entry looks like:
> 
> DDS|IDS       (^U[ABDR].... ....|^XRXX84 KAWN|^YIXX84 KAWN) 
> ([0-3][0-9])([0-2][0-9])
>       PIPE    /usr/GEMPAK5.6/bin/linux/dcacft
>       -e GEMTBL=/usr/GEMPAK5.6/gempak/tables
>       /arpsdata/ldm1/ingest/gempak/pirep/YYMMDDHH_pirep.gem
> 
> It looks like our version of GEMPAK is rather old, so the first step is to
> get our local admin person to upgrade it.
> 
> You mention "pbuf" messages.  Are these the messages that say "pbuf_flush"?
> I've always had *LOTS* of them, so I assumed they were normal.
> 
> I just noticed that I have a bunch of
> 
>       pbuf_flush (##) write: Broken pipe
> 
> that seem to occur around the same time "dcacft" fails.
> 
> Thanks for your assist.
> 
>       == kwthomas ==
> 
> >
> >On Mon, 2004-08-02 at 12:02, Kevin W. Thomas wrote:
> >> Hi...
> >> 
> >> Recently, while looking over some ETA and NEXRAD files received via LDM I
> >> noticed that there were periods when data was lost.  After doing lots of
> >> checking around, with the help of a local System Administrator, I 
> >> discovered
> >> that them data gaps are strongly correlated with the message:
> >> 
> >>    pqact[pid]: child ##### terminated by signal 11
> >> 
> >> Signal 11 is "segmentation violation".
> >> 
> >> I have a second LDM system running a similar ingest configuration.  
> >> Checking
> >> its log files shows the same problem.
> >> 
> >> Both systems are Intel, though I don't know what cpu's.  Both run RedHat 
> >> 9.x.
> >> The first has logged 29 seg faults today, with the second logging 16.  
> >> There
> >> are no common times in either log.
> >> 
> >> I checked another Intel machine, unknown version of RedHat, probably 7.x or
> >> 9.x, that had been running LDM a few months ago.  It had 30 on the last 
> >> full
> >> day of operation at that system.
> >> 
> >> Everything is running LDM 6.0.14.
> >> 
> >> Any ideas would be greatly appreciated.
> >> 
> >>    Kevin W. Thomas
> >>    Center for Analysis and Prediction of Storms
> >>    University of Oklahoma
> >>    Norman, Oklahoma
> >>    Email:  address@hidden