[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #WPM-702818]: pqact errors in LDM 6.7.0



Justin,

> We are currently running LDM 6.7.0 on an IBM P6 cluster supercomputer at
> OS AIX 5.3 and are encountering errors in our LDM after it has been
> running for approx 24 - 30 hours. Here is clip from out ldmd.log:
> 
> Apr 30 22:01:38 c1n5 local0:warn|warning pqact[450700] WARN:
> write(956,,32768) to decoder timed-out (60 s):
> /nwprod/exec/decod_dcmetr-v2-t300-d/dcom/us007003/decoder_logs/decod_dcmetr.log/nwprod/fix/bufrtab.000/nwprod/dictionaries/metar.tbl
> Apr 30 22:03:09 c1n5 local0:warn|warning pqact[450700] WARN:
> write(956,,32768) to decoder took 58 s:
> /nwprod/exec/decod_dcmetr-v2-t300-d/dcom/us007003/decoder_logs/decod_dcmetr.log/nwprod/fix/bufrtab.000/nwprod/dictionaries/metar.tbl
> Apr 30 22:04:09 c1n5 local0:warn|warning pqact[450700] WARN:
> write(956,,32768) to decoder timed-out (60 s):
> /nwprod/exec/decod_dcmetr-v2-t300-d/dcom/us007003/decoder_logs/decod_dcmetr.log/nwprod/fix/bufrtab.000/nwprod/dictionaries/metar.tbl
> Apr 30 22:05:56 c1n5 local0:warn|warning pqact[450700] WARN:
> write(963,,32768) to decoder timed-out (60 s):
> /nwprod/exec/decod_dcmetr-v2-t300-d/dcom/us007003/decoder_logs/decod_dcmetr.log/nwprod/fix/bufrtab.000/nwprod/dictionaries/metar.tbl
> Apr 30 22:06:24 c1n5 local0:warn|warning pqact[450700] WARN:
> write(943,,32768) to decoder took 25 s:
> /nwprod/exec/decod_dcacft-v2-t300-d/dcom/us007003/decoder_logs/decod_dcacft.log/nwprod/dictionaries/pirep.tbl/nwprod/dictionaries/airep.tbl/nwprod/fix/bufrtab.004
> Apr 30 22:07:27 c1n5 local0:warn|warning pqact[450700] WARN:
> write(943,,32768) to decoder timed-out (60 s):
> /nwprod/exec/decod_dcacft-v2-t300-d/dcom/us007003/decoder_logs/decod_dcacft.log/nwprod/dictionaries/pirep.tbl/nwprod/dictionaries/airep.tbl/nwprod/fix/bufrtab.004
> Apr 30 22:08:27 c1n5 local0:warn|warning pqact[450700] WARN:
> write(970,,32768) to decoder took 36 s:
> /nwprod/exec/decod_dcmetr-v2-t300-d/dcom/us007003/decoder_logs/decod_dcmetr.log/nwprod/fix/bufrtab.000/nwprod/dictionaries/metar.tbl

The warning messages indicate that the pqact(1) process took too-long to write 
to the given decoders.  This could be because the decoders are too slow or 
because the system is overloaded.

What version of GEMPAK are you using?

Does the top(1) (or similar) utility indicate anything amiss with the load on 
the system when this happens?

Is there anything significant in the dcmetr log file 
(/dcom/us007003/decoder_logs/decod_dcmetr.log)?

> Many other processes running on the node slowdown when these errors
> appear, stopping and letting the processes restart breaks free the
> slowdown until it builds up again in 24 - 30 hours. We are running two
> separate instances of LDM on two nodes of the supercomputer and both
> show these errors but at different times. IBM is investigating this
> issue as it has only recently started to occur in the last week even
> though we have had LDM running on these nodes for a couple of months. To
> assist in their troubleshooting I'm hoping that can give some insight
> into possible system problems that would cause LDM/pqact to have these
> errors. Immediately after stopping and letting the LDM restart all the
> data that was having problems being acted on is piped to the decoders fine.

You're getting the messages because the system is overloaded rather than the 
messages causing the system to become overloaded.

> Thanks for any insight,
> 
> Justin Cooke
> NCEP Central Operations


Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: WPM-702818
Department: Support LDM
Priority: Normal
Status: Closed