[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010128: Recurrent LDM disconnect problems



Unidata Support wrote:
> 
> ------- Forwarded Message
> 
> >From: "Arthur A. Person" <address@hidden>
> >Subject: Recurrent disconnect problems
> >Organization: Penn State
> >Keywords: 200101291535.f0TFZVX03461 LDM disconnect
> 
> Hi...
> 
> Over the weekend our ldm got into a recurrent disconnect problem with most
> or all of our upstream sites disconnecting and reconnecting every few
> minutes or seconds.  By stopping and restarting the ldm, the problem would
> disappear for awhile (maybe and hour or two) and then reappear.  Can you
> help me identify what is causing this problem?  I've place several of our
> recent ldmd.log files on anonymous ftp://navier.meteo.psu.edu/pub/ldm for
> you to reveiw.  A sample from the logs looks like this:
> 

Hi Art,

From your logs, I'm suspecting that the disconnects and reconnects are
due to sites not being able to write to your queue.  I suspect that the
reason they can't write to your queue is because pqact is having trouble
keeping up with the processing it's been told to perform, as evidenced
by the overly frequent "pbuf_flush ... time elapsed" entries.  You can
read more about this at:
http://www.unidata.ucar.edu/packages/ldm/troubleshooting/pq_del_oldest.html
and also at: 
http://www.unidata.ucar.edu/cgi-bin/mfs/65/3944?25#mfs.  These pages
will tell you how to confirm if pqact is falling behind and gives you
some possible reasons.  

I wonder if there's a problem with your disk.  Maybe it's just not fast
enough. Maybe some other unrelated process is tying up the disk.  The
'iostat' command will give you information including how much time the
system has spent waiting for I/O. The 'top' command might show you
processes that are hogging the disk.  Also, it's always good to confirm
that you have enough disk space - try 'df -k'.  If disk speed alone
really is your problem, possible solutions are: get a faster disk,
decrease the number of processes using the disk, or offload some
processing to another machine.

From the logs, it looks like the "pbuf_flush ... time elapsed" messages
have been increasing in frequency over the past few days.  Has anything
changed?  Are you requesting or processing more data?  Are you
requesting data that might have increased in volume?

I hope this helps.  If and when you discover the cause of your problem,
please let me know.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************


> Jan 28 02:03:57 navier pqact[28053]: pbuf_flush 4: time elapsed   2.443661
> Jan 28 02:04:27 navier pqact[28052]: pbuf_flush 32: time elapsed
> 12.209329
> Jan 28 02:04:35 navier pqact[28053]: pbuf_flush 4: time elapsed   2.279346
> Jan 28 02:04:50 navier pqact[28053]: pbuf_flush 4: time elapsed   2.494542
> Jan 28 02:04:53 navier pqact[28052]: pbuf_flush 32: time elapsed
> 14.906033
> Jan 28 02:05:17 navier pqact[28053]: pbuf_flush 4: time elapsed  14.893457
> Jan 28 02:05:26 navier pqact[28052]: pbuf_flush 32: time elapsed
> 31.881709
> Jan 28 02:05:35 navier pqact[28053]: pbuf_flush 4: time elapsed   7.832572
> Jan 28 02:05:40 navier pqact[28053]: pbuf_flush 4: time elapsed   4.854681
> Jan 28 02:05:59 navier nids2area[21940]: NIDS2AREA -- BEGIN
> Jan 28 02:06:00 navier nids2area[21940]: PRODUCT CODE=RI          1027
> 223500
> Jan 28 02:06:00 navier pqact[28052]: pbuf_flush 32: time elapsed
> 2.930228
> Jan 28 02:06:00 navier nids2area[21994]: NIDS2AREA -- BEGIN
> Jan 28 02:06:00 navier nids2area[21994]: PRODUCT CODE=R1          1027
> 223500
> Jan 28 02:06:00 navier nids2area[21940]: NIDS2AREA -- DONE AREA 1158
> Jan 28 02:06:01 navier nids2area[21994]: NIDS2AREA -- DONE AREA 1098
> Jan 28 02:06:20 navier pqact[28053]: pbuf_flush 4: time elapsed   3.075678
> Jan 28 02:06:26 navier pqact[28053]: pbuf_flush 4: time elapsed   6.161774
> Jan 28 02:06:33 navier motherlode[28058]: pq_del_oldest: conflict on
> 242391760
> Jan 28 02:06:33 navier motherlode[28058]: comings: pqe_new: Resource
> temporarily unavailable
> Jan 28 02:06:33 navier motherlode[28058]:        :
> 9166f9f751ac3d1514ff96df99b37a9e    23958 20010128015404.7
> 72    NMC2 097  /u/ftp/g
> Jan 28 02:06:34 navier motherlode[28058]: Connection reset by peer
> Jan 28 02:06:34 navier motherlode[28058]: Disconnect
> Jan 28 02:06:34 navier 128.117.13.119[28055]: pq_del_oldest: conflict on
> 242391760
> Jan 28 02:06:34 navier 128.117.13.119[28055]: comings: pqe_new: Resource
> temporarily unavailable
> Jan 28 02:06:34 navier 128.117.13.119[28055]:        :
> 07ae99fd2c2b61d7c71f722d43ae5bea    26980 200101280206
> 23.647     HDS 439  ZTRB
> Jan 28 02:06:34 navier 128.117.13.119[28055]: Connection reset by peer
> Jan 28 02:06:34 navier 128.117.13.119[28055]: Disconnect
> Jan 28 02:06:34 navier sunshine[28056]: pq_del_oldest: conflict on
> 242391760
> Jan 28 02:06:34 navier sunshine[28056]: hereis: pq_insert failed: Resource
> temporarily unavailable: d05ea2b4c
> b764b3c56bdff717592e5d4
> Jan 28 02:06:34 navier sunshine[28056]: Connection reset by peer
> Jan 28 02:06:34 navier sunshine[28056]: Disconnect
> Jan 28 02:06:41 navier pqact[28053]: pbuf_flush 4: time elapsed   3.925051
> Jan 28 02:06:52 navier pqact[28053]: pbuf_flush 4: time elapsed   3.801298
> Jan 28 02:06:55 navier nids2area[22971]: NIDS2AREA -- BEGIN
> Jan 28 02:06:55 navier nids2area[22971]: PRODUCT CODE=R1          1027
> 224000
> Jan 28 02:06:55 navier nids2area[22971]: NIDS2AREA -- DONE AREA 1267
> Jan 28 02:06:57 navier nids2area[22972]: NIDS2AREA -- BEGIN
> Jan 28 02:06:57 navier nids2area[22972]: PRODUCT CODE=RI          1027
> 224000
> Jan 28 02:06:57 navier nids2area[22972]: NIDS2AREA -- DONE AREA 1257
> Jan 28 02:06:58 navier nids2area[22973]: NIDS2AREA -- BEGIN
> Jan 28 02:06:58 navier nids2area[22973]: PRODUCT CODE=R3          1027
> 223000
> Jan 28 02:06:59 navier nids2area[22973]: NIDS2AREA -- DONE AREA 1112
> Jan 28 02:07:04 navier motherlode[28058]: run_requester:
> 20010128015411.862 TS_ENDT {{SPARE,  ".*"},{NMC2,  "
> .*"}}
> Jan 28 02:07:04 navier 128.117.13.119[28055]: run_requester:
> 20010128020623.598 TS_ENDT {{FSL2|WMO,  ".*"}}
> Jan 28 02:07:04 navier motherlode[28058]: FEEDME(motherlode.ucar.edu): OK
> Jan 28 02:07:04 navier 128.117.13.119[28055]: FEEDME(128.117.13.119): OK
> Jan 28 02:07:04 navier sunshine[28056]: run_requester: 20010128020633.758
> TS_ENDT {{NNEXRAD,  ".*"}}
> Jan 28 02:07:04 navier sunshine[28056]: FEEDME(sunshine.ssec.wisc.edu): OK
> Jan 28 02:07:06 navier nids2area[23016]: NIDS2AREA -- BEGIN
> Jan 28 02:07:06 navier nids2area[23016]: PRODUCT CODE=R2          1027
> 223500
> Jan 28 02:07:06 navier nids2area[23016]: NIDS2AREA -- DONE AREA 1271
> Jan 28 02:07:06 navier pqact[28053]: pbuf_flush 4: time elapsed   3.693097
> Jan 28 02:07:09 navier nids2area[23017]: NIDS2AREA -- BEGIN
> Jan 28 02:07:09 navier nids2area[23018]: NIDS2AREA -- BEGIN
> Jan 28 02:07:09 navier nids2area[23018]: PRODUCT CODE=R6          1027
> 223000
> Jan 28 02:07:09 navier nids2area[23017]: PRODUCT CODE=R9          1027
> 223000
> Jan 28 02:07:09 navier nids2area[23018]: NIDS2AREA -- DONE AREA 1277
> Jan 28 02:07:09 navier nids2area[23017]: NIDS2AREA -- DONE AREA 1227
> Jan 28 02:07:10 navier nids2area[23022]: NIDS2AREA -- BEGIN
> Jan 28 02:07:10 navier nids2area[23022]: PRODUCT CODE=R7          1027
> 223000
> Jan 28 02:07:10 navier nids2area[23022]: NIDS2AREA -- DONE AREA 1212
> Jan 28 02:07:16 navier pqact[28053]: pbuf_flush 4: time elapsed   3.964320
> Jan 28 02:07:20 navier 128.117.13.119[28055]: pq_del_oldest: conflict on
> 57165712
> Jan 28 02:07:20 navier 128.117.13.119[28055]: comings: pqe_new: Resource
> temporarily unavailable
> Jan 28 02:07:20 navier 128.117.13.119[28055]:        :
> f8b031b07fdad0785c811bed83b25b82    26980 200101280206
> 34.686     HDS 520  ZURB
> Jan 28 02:07:20 navier motherlode[28058]: pq_del_oldest: conflict on
> 57165712
> Jan 28 02:07:20 navier motherlode[28058]: comings: pqe_new: Resource
> temporarily unavailable
> Jan 28 02:07:20 navier motherlode[28058]:        :
> a25a4b56c51d90f255cc85fb8020b550    23958 20010128015413.3
> 76    NMC2 428  /u/ftp/g
> Jan 28 02:07:20 navier 128.117.13.119[28055]: Connection reset by peer
> Jan 28 02:07:20 navier 128.117.13.119[28055]: Disconnect
> Jan 28 02:07:21 navier motherlode[28058]: Connection reset by peer
> Jan 28 02:07:21 navier motherlode[28058]: Disconnect
> Jan 28 02:07:27 navier nids2area[23030]: NIDS2AREA -- BEGIN
> Jan 28 02:07:27 navier nids2area[23030]: PRODUCT CODE=R8          1027
> 223000
> Jan 28 02:07:27 navier nids2area[23030]: NIDS2AREA -- DONE AREA 1192
> Jan 28 02:07:28 navier nids2area[23033]: NIDS2AREA -- BEGIN
> Jan 28 02:07:28 navier nids2area[23033]: PRODUCT CODE=RA          1027
> 223000
> Jan 28 02:07:28 navier nids2area[23033]: NIDS2AREA -- DONE AREA 1187
> Jan 28 02:07:48 navier pqact[28053]: pbuf_flush 4: time elapsed  20.578496
> Jan 28 02:07:51 navier 128.117.13.119[28055]: run_requester:
> 20010128020633.618 TS_ENDT {{FSL2|WMO,  ".*"}}
> Jan 28 02:07:51 navier motherlode[28058]: run_requester:
> 20010128015413.234 TS_ENDT {{SPARE,  ".*"},{NMC2,  "
> .*"}}
> Jan 28 02:07:51 navier 128.117.13.119[28055]: FEEDME(128.117.13.119): OK
> Jan 28 02:07:51 navier motherlode[28058]: FEEDME(motherlode.ucar.edu): OK
> Jan 28 02:07:55 navier pqact[28053]: pbuf_flush 4: time elapsed   3.921334
> Jan 28 02:07:59 navier pqact[28053]: pbuf_flush 4: time elapsed   2.049001
> Jan 28 02:08:07 navier 128.117.13.119[28055]: pq_del_oldest: conflict on
> 281979744
> Jan 28 02:08:07 navier 128.117.13.119[28055]: comings: pqe_new: Resource
> temporarily unavailable
> Jan 28 02:08:07 navier 128.117.13.119[28055]:        :
> edf039aaad1b0014ab819a2a5799d375    23996 200101280206
> 40.384     HDS 591  ZORB
> Jan 28 02:08:07 navier 128.117.13.119[28055]: Connection reset by peer
> Jan 28 02:08:07 navier 128.117.13.119[28055]: Disconnect
> Jan 28 02:08:07 navier motherlode[28058]: pq_del_oldest: conflict on
> 281979744
> Jan 28 02:08:07 navier motherlode[28058]: comings: pqe_new: Resource
> temporarily unavailable
> Jan 28 02:08:07 navier motherlode[28058]:        :
> 08aa8a8d165e5621529588507e80dd4e    38874 20010128015413.9
> 17    NMC2 461  /u/ftp/g
> Jan 28 02:08:07 navier motherlode[28058]: Connection reset by peer
> Jan 28 02:08:07 navier motherlode[28058]: Disconnect
> Jan 28 02:08:08 navier nids2area[23095]: NIDS2AREA -- BEGIN
> Jan 28 02:08:08 navier nids2area[23095]: PRODUCT CODE=RJ          1027
> 223500
> Jan 28 02:08:08 navier nids2area[23096]: NIDS2AREA -- BEGIN
> Jan 28 02:08:08 navier nids2area[23095]: NIDS2AREA -- DONE AREA 1373
> Jan 28 02:08:09 navier nids2area[23096]: PRODUCT CODE=RE          1027
> 223500
> Jan 28 02:08:09 navier nids2area[23097]: NIDS2AREA -- BEGIN
> Jan 28 02:08:09 navier nids2area[23096]: NIDS2AREA -- DONE AREA 1399
> 
>                                          Thanks.
> 
>                                            Art.
> 
> Arthur A. Person
> Research Assistant, System Administrator
> Penn State Department of Meteorology
> email:  address@hidden, phone:  814-863-1563
> 
> ------- End of Forwarded Message