[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010123: Strange LDM freezes



Pete Stamus wrote:
> 
> Hi Anne...
> 
> We had another freeze last night, around 0350z (like last time).  I have a 
> script
> running that checks the size of the metar files and if they're missing or too 
> small
> it emails my pager, so I started looking at the system around 0415z.  The 
> system
> seemed to be responding normally (wasn't sluggish or anything).  So, I started
> colleting stuff that you suggested...here are a few hundred lines of it :)
> 
> The first thing I did was an "iostat"
> 
>       tty          fd0           sd0          sd16          nfs1          cpu
>  tin tout kps tps serv  kps tps serv  kps tps serv  kps tps serv  us sy wt id
>    0    1   0   0    0  326  14  110    0   0    0    0   0    0   7  6  4 84
> 
> Then I did an "ldmadmin check", which just sat there.  After about 2 min with
> no messages or any signs of activity, I did a ^y out of it.  Then I did an
> "ldmadmin queuecheck", which returned without a message.  Next I did a "ps 
> -elf"
> 
>  F S      UID   PID  PPID  C PRI NI     ADDR     SZ    WCHAN    STIME TTY     
>  TIME CMD
> 19 T     root     0     0  0   0 SY fec14660      0            Jan 10 ?       
>  0:05 sched
>  8 S     root     1     0  0  41 20 e0a5e728    462 e0a5e94c   Jan 10 ?       
>  6:05 /etc/init -
> 19 S     root     2     0  0   0 SY e0a5e008      0 fec2bc50   Jan 10 ?       
>  0:00 pageout
> 19 S     root     3     0  0   0 SY e0a61730      0 fec7b008   Jan 10 ?       
> 13:01 fsflush
>  8 S     root   248   237  0  40 20 e0aac738    509 e07d1af8   Jan 10 ?       
>  0:00 /opt/nport/bin/inge
>  8 S     root   233     1  0  41 20 e0d6c778    367 e0729216   Jan 10 ?       
>  0:01 /usr/lib/nfs/nfsd -a 16
>  8 S     root   240     1  0  41 20 e0a61010    336 e07d1d38   Jan 10 ?       
>  0:01 /usr/lib/saf/sac -t 300
>  8 S   daemon   147     1  0  67 20 e0aac018    495 e07296c6   Jan 10 ?       
>  0:00 /usr/lib/nfs/statd
>  8 S     root   197     1  0  41 20 e0ab6020    522 e0729df6   Jan 10 ?       
>  0:00 /usr/sbin/vold
>  8 S     root   107     1  0  40 20 e0c3d748    479 e0729996   Jan 10 ?       
>  1:24 /usr/sbin/rpcbind
>  8 S     root   109     1  0  46 20 e0c3d028    458 e07298f6   Jan 10 ?       
>  0:00 /usr/sbin/keyserv
>  8 S     root   151     1  0  51 20 e0c9f750    409 e07297b6   Jan 10 ?       
>  0:00 /usr/sbin/inetd -s
>  8 S     root   183     1  0  41 20 e0c9f030    681 e07294e6   Jan 10 ?       
>  0:00 /usr/lib/lpsched
>  8 S     root   154     1  0  41 20 e0cae758    816 e07295d6   Jan 10 ?       
>  2:45 /usr/sbin/syslogd
>  8 S     root   231     1  0  41 20 e0ab6740    609  8052dd0   Jan 10 ?       
>  0:02 /usr/lib/nfs/mountd
>  8 S     root   143     1  0  67 20 e0cd9760    370 e07298a6   Jan 10 ?       
>  0:00 /usr/lib/nfs/lockd
>  8 S     root   163     1  0  41 20 e0cd9040    364 e07d1eb8   Jan 10 ?       
>  0:00 /usr/sbin/cron
>  8 Z     root   244   237  0   0                                              
>  0:00 <defunct>
>  8 S     root   196     1  0  41 20 e0d41048    205 e0729536   Jan 10 ?       
>  0:01 /usr/lib/utmpd
>  8 S     root 16766     1  0  51 20 e0d41768    361 e0729bc6   Jan 29 console 
>  0:00 /usr/lib/saf/ttymon -g -h -p noaapo
>  8 S     root   225     1  0  40  8 e0d4c050    434 e0d4c274   Jan 10 ?       
>  0:08 /usr/lib/inet/xntpd
>  8 S     root   237     1  0  40 20 e0d6c058    685 e07d1a38   Jan 10 ?       
> 1996:04 /opt/nport/bin/inge
>  8 S     root   243   240  0  41 20 e0d6e780    358 e0d6e9a4   Jan 10 ?       
>  0:01 /usr/lib/saf/ttymon
>  8 S      ldm 26412 26410  0  41 20 e1235758  31534 e123597c   Jan 26 ?       
>  7:38 pqact
>  8 S      ldm 26413 26410  0  41 20 e0726108  31537 e0db2606   Jan 26 ?       
> 27:11 pqing -5 -f WMO -v -P 1501 noaaport
>  8 S      ldm 26410     1  0  40 20 e0bd6008  31507 e0dcd17e   Jan 26 ?       
>  0:00 rpc.ldmd -v -q /home/ldm/data/ldm.p
>  8 S   nobody  1277   151  0  41 20 e0d4c770    185 e07d1978   Jan 10 ?       
> 17:36 cat /tmp/jmb.fifo.2
>  8 R      ldm 13574 13572  0  41 20 e1b431c0    433          21:13:41 pts/0   
>  0:00 -csh
>  8 S   nobody 26414   151  0  41 20 e1b438e0    185 e0db2a26   Jan 26 ?       
>  0:35 cat /tmp/jmb.fifo.1
>  8 S     root 13747 13746  0   0 RT e1be48a8    233 e0ab05b8 21:22:07 ?       
>  0:00 /opt/nport/bin/inge
>  8 S      ldm 26428 26410  0  41 20 e0ffd028  31507 e0ffd24c   Jan 26 ?       
>  4:11 rpc.ldmd -v -q /home/ldm/data/ldm.p
>  8 O     root 13748 13574  0  41 20 e0726828    329          21:22:09 pts/0   
>  0:00 ps -elf
>  8 S     root 13746 13745  0   0 RT e17ae170    493 e072c74c 21:22:02 ?       
>  0:00 /opt/nport/bin/inge
>  8 S      ldm 26411 26410  0  41 20 e0d6e060  31572 e0d6e284   Jan 26 ?       
>  3:16 pqbinstats
>  8 S     root 13572   151  0  61 20 e0bd6728    410 e0dcddae 21:13:41 ?       
>  0:00 in.telnetd
>  8 S     root 13745     1  0   0 RT e16d2178    493 e0729064 21:21:57 ?       
>  0:00 /opt/nport/bin/inge
> 
> Then I did an "lsof -p <pid>" on the rcp.ldmd
> 
> COMMAND    PID USER   FD   TYPE     DEVICE  SIZE/OFF   NODE NAME
> rpc.ldmd 26410  ldm  cwd   VDIR       63,3     10240  36160 /var/data/ldm/logs
> rpc.ldmd 26410  ldm  txt   VREG       63,6   1393856 392858 /home 
> (/dev/dsk/c1t0d0s6)
> rpc.ldmd 26410  ldm  txt   VREG       63,3 127057920  72262 
> /var/data/ldm/ldm.pq
> rpc.ldmd 26410  ldm  txt   VREG       63,4    227076   6194 
> /usr/lib/libresolv.so.2
> rpc.ldmd 26410  ldm  txt   VREG       63,4     11488   6212 
> /usr/lib/nss_dns.so.1
> rpc.ldmd 26410  ldm  txt   VREG       63,4     26392   6213 
> /usr/lib/nss_files.so.1
> rpc.ldmd 26410  ldm  txt   VREG       63,4      9940   6217 /usr 
> (/dev/dsk/c1t0d0s4)
> rpc.ldmd 26410  ldm  txt   VREG       63,4     17388   6183 
> /usr/lib/libmp.so.2
> rpc.ldmd 26410  ldm  txt   VREG       63,4    936736   6175 /usr/lib/libc.so.1
> rpc.ldmd 26410  ldm  txt   VREG       63,4     52988   6201 
> /usr/lib/libsocket.so.1
> rpc.ldmd 26410  ldm  txt   VREG       63,4    684112   6186 
> /usr/lib/libnsl.so.1
> rpc.ldmd 26410  ldm  txt   VREG       63,4     26256 156841 
> /usr/ucblib/librpcsoc.so.1
> rpc.ldmd 26410  ldm  txt   VREG       63,4      5636   6210 /usr 
> (/dev/dsk/c1t0d0s4)
> rpc.ldmd 26410  ldm  txt   VREG       63,4     59624   6223 /usr/lib/libm.so.1
> rpc.ldmd 26410  ldm  txt   VREG       63,4      4372   6167 
> /usr/lib/libdl.so.1
> rpc.ldmd 26410  ldm  txt   VREG       63,4    173272   6064 /usr/lib/ld.so.1
> rpc.ldmd 26410  ldm    0u  inet 0xe13f0018       0t0    TCP *:ldm (LISTEN)
> rpc.ldmd 26410  ldm    1u  VREG       63,3 127057920  72262 
> /var/data/ldm/ldm.pq
> rpc.ldmd 26410  ldm    2w  VCHR       21,0       0t0  60206 
> /devices/pseudo/log@0:conslog->LOG
> 
> and on the pqing process
> 
> COMMAND   PID USER   FD   TYPE     DEVICE   SIZE/OFF   NODE NAME
> pqing   26413  ldm  cwd   VDIR       63,3      10240  36160 /var/data/ldm/logs
> pqing   26413  ldm  txt   VREG       63,6    1074520 392842 /home 
> (/dev/dsk/c1t0d0s6)
> pqing   26413  ldm  txt   VREG       63,3  127057920  72262 
> /var/data/ldm/ldm.pq
> pqing   26413  ldm  txt   VREG       63,4      26392   6213 
> /usr/lib/nss_files.so.1
> pqing   26413  ldm  txt   VREG       63,4      17388   6183 
> /usr/lib/libmp.so.2
> pqing   26413  ldm  txt   VREG       63,4     936736   6175 /usr/lib/libc.so.1
> pqing   26413  ldm  txt   VREG       63,4      52988   6201 
> /usr/lib/libsocket.so.1
> pqing   26413  ldm  txt   VREG       63,4     684112   6186 
> /usr/lib/libnsl.so.1
> pqing   26413  ldm  txt   VREG       63,4      26256 156841 
> /usr/ucblib/librpcsoc.so.1
> pqing   26413  ldm  txt   VREG       63,4      59624   6223 /usr/lib/libm.so.1
> pqing   26413  ldm  txt   VREG       63,4       4372   6167 
> /usr/lib/libdl.so.1
> pqing   26413  ldm  txt   VREG       63,4     173272   6064 /usr/lib/ld.so.1
> pqing   26413  ldm    0u  inet 0xe13f0398 0x2531eaf2    TCP 
> noaaport.colorado-research.com:32951->noaaport.colorado-research.com:nporta 
> (ESTABLISHED)
> pqing   26413  ldm    1w  VCHR        0,0        0t0  60201 
> /devices/pseudo/cn@0:console
> pqing   26413  ldm    2w  VCHR       21,0        0t0  60206 
> /devices/pseudo/log@0:conslog->LOG
> pqing   26413  ldm    3u  VREG       63,3  127057920  72262 
> /var/data/ldm/ldm.pq
> 
> Here are the last system messages from /var/adm/messages
> 
> Jan 28 10:59:59 noaaport pqing[26413]: Not a WMO format message.             
> 145         976   @RU00 KWBC 281059
> Jan 28 11:03:43 noaaport pqing[26413]: Not a WMO format message.             
> 108         146   @RU00 KWBC 281103
> Jan 28 12:04:15 noaaport pqing[26413]: Not a WMO format message.             
> 108         719   @RU00 KWBC 281204
> Jan 28 13:05:03 noaaport pqing[26413]: Not a WMO format message.             
> 108         341   @RU00 KWBC 281305
> Jan 28 16:01:44 noaaport pqing[26413]: Not a WMO format message.             
> 108         469   @RU00 KWBC 281601
> Jan 28 16:56:10 noaaport pqing[26413]: Not a WMO format message.             
> 100         188   @RU00 KWBC 281656
> Jan 28 17:02:54 noaaport pqing[26413]: Not a WMO format message.             
> 100         004   @RU00 KWBC 281702
> Jan 28 17:04:04 noaaport pqing[26413]: Not a WMO format message.             
> 108         591   @RU00 KWBC 281704
> Jan 28 19:37:00 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 28 19:50:34 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 28 20:25:21 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 28 23:03:37 noaaport pqing[26413]: Not a WMO format message.             
> 114         557   @RU00 KWBC 282303
> Jan 28 23:03:59 noaaport pqing[26413]: Not a WMO format message.             
> 114         681   @RU00 KWBC 282303
> Jan 29 00:14:42 noaaport pqing[26413]: Not a WMO format message.             
> 114         665   @RU00 KWBC 290014
> Jan 29 01:08:52 noaaport pqing[26413]: Not a WMO format message.             
> 114         491   @RU00 KWBC 290108
> Jan 29 02:26:48 noaaport pqing[26413]: Not a WMO format message.             
> 114         321   @RU00 KWBC 290226
> Jan 29 03:09:45 noaaport pqing[26413]: Not a WMO format message.             
> 114         042   @RU00 KWBC 290309
> Jan 29 04:51:13 noaaport pqing[26413]: Not a WMO format message.              
> 99         850   @RU00 KWBC 290451
> Jan 29 08:44:21 noaaport pqing[26413]: Not a WMO format message.             
> 114         149   @RU00 KWBC 290844
> Jan 29 10:12:54 noaaport pqing[26413]: Not a WMO format message.             
> 108         179   @RU00 KWBC 291012
> Jan 29 10:47:01 noaaport pqing[26413]: Not a WMO format message.             
> 118         188   @RU00 KWBC 291047
> Jan 29 13:06:45 noaaport pqing[26413]: Not a WMO format message.             
> 108         140   @RU00 KWBC 291306
> Jan 29 15:10:01 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 29 15:10:39 noaaport last message repeated 1 time
> Jan 29 08:26:33 noaaport unix: WARNING: iprb0: no MII link detected
> Jan 29 08:26:38 noaaport unix: NOTICE: iprb0: 100 Mbps full-duplex link up
> Jan 29 20:10:38 noaaport pqing[26413]: Not a WMO format message.             
> 108         129   @RU00 KWBC 292010
> Jan 29 21:34:08 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 29 22:14:10 noaaport pqing[26413]: Not a WMO format message.             
> 122         428   @RU00 KWBC 292214
> Jan 29 22:46:17 noaaport pqing[26413]: Not a WMO format message.             
> 132         597   @RU00 KWBC 292246
> Jan 30 00:06:34 noaaport pqing[26413]: Not a WMO format message.             
> 117         509   @RU00 KWBC 300006
> Jan 30 01:35:48 noaaport pqing[26413]: Not a WMO format message.             
> 114         078   @RU00 KWBC 300135
> Jan 30 02:13:01 noaaport pqing[26413]: Not a WMO format message.             
> 114         294   @RU00 KWBC 300213
> Jan 30 02:20:46 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 03:04:17 noaaport pqing[26413]: Not a WMO format message.             
> 114         022   @RU00 KWBC 300304
> Jan 30 04:56:45 noaaport pqing[26413]: Not a WMO format message.             
> 132         954   @RU00 KWBC 300456
> Jan 30 07:38:52 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 07:40:15 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 08:23:34 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 14:07:48 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 16:54:50 noaaport pqing[26413]: Not a WMO format message.             
> 146         122   @RU00 KWBC 301654
> Jan 30 20:03:25 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 22:53:06 noaaport pqing[26413]: Not a WMO format message.             
> 152         626   @RU00 KWBC 302253
> Jan 31 00:06:00 noaaport pqing[26413]: Not a WMO format message.             
> 114         753   @RU00 KWBC 310006
> Jan 31 01:17:49 noaaport pqing[26413]: Not a WMO format message.             
> 114         829   @RU00 KWBC 310117
> Jan 31 02:07:20 noaaport pqing[26413]: Not a WMO format message.             
> 114         791   @RU00 KWBC 310207
> Jan 31 02:44:22 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 31 03:08:26 noaaport pqing[26413]: Not a WMO format message.             
> 114         117   @RU00 KWBC 310308
> 
> Then I did an "ldmadmin stop", and it said that it was stopping the ldm.  
> When it
> was done, I did an "ldmadmin ps", and it said that there were no ldm processes
> running.  But I did a "ps -ef" and both the 'rpc.ldmd' and 'pqing' processes
> were still there
> 
>  F S      UID   PID  PPID  C PRI NI     ADDR     SZ    WCHAN    STIME TTY     
>  TIME CMD
> 19 T     root     0     0  0   0 SY fec14660      0            Jan 10 ?       
>  0:05 sched
>  8 S     root     1     0  0  41 20 e0a5e728    462 e0a5e94c   Jan 10 ?       
>  6:05 /etc/init -
> 19 S     root     2     0  0   0 SY e0a5e008      0 fec2bc50   Jan 10 ?       
>  0:00 pageout
> 19 S     root     3     0  0   0 SY e0a61730      0 fec7b008   Jan 10 ?       
> 13:01 fsflush
>  8 S     root   248   237  0  40 20 e0aac738    509 e07d1af8   Jan 10 ?       
>  0:00 /opt/nport/bin/inge
>  8 S     root   233     1  0  41 20 e0d6c778    367 e0729216   Jan 10 ?       
>  0:01 /usr/lib/nfs/nfsd -a 16
>  8 S     root   240     1  0  41 20 e0a61010    336 e07d1d38   Jan 10 ?       
>  0:01 /usr/lib/saf/sac -t 300
>  8 S   daemon   147     1  0  67 20 e0aac018    495 e07296c6   Jan 10 ?       
>  0:00 /usr/lib/nfs/statd
>  8 S     root   197     1  0  41 20 e0ab6020    522 e0729df6   Jan 10 ?       
>  0:00 /usr/sbin/vold
>  8 S     root   107     1  0  41 20 e0c3d748    479 e0729996   Jan 10 ?       
>  1:24 /usr/sbin/rpcbind
>  8 S     root   109     1  0  46 20 e0c3d028    458 e07298f6   Jan 10 ?       
>  0:00 /usr/sbin/keyserv
>  8 S     root   151     1  0  51 20 e0c9f750    409 e07297b6   Jan 10 ?       
>  0:00 /usr/sbin/inetd -s
>  8 S     root   183     1  0  41 20 e0c9f030    681 e07294e6   Jan 10 ?       
>  0:00 /usr/lib/lpsched
>  8 S     root   154     1  0  41 20 e0cae758    816 e07295d6   Jan 10 ?       
>  2:45 /usr/sbin/syslogd
>  8 S     root   231     1  0  41 20 e0ab6740    609  8052dd0   Jan 10 ?       
>  0:02 /usr/lib/nfs/mountd
>  8 S     root   143     1  0  67 20 e0cd9760    370 e07298a6   Jan 10 ?       
>  0:00 /usr/lib/nfs/lockd
>  8 S     root   163     1  0  41 20 e0cd9040    364 e07d1eb8   Jan 10 ?       
>  0:00 /usr/sbin/cron
>  8 Z     root   244   237  0   0                                              
>  0:00 <defunct>
>  8 S     root   196     1  0  41 20 e0d41048    205 e0729536   Jan 10 ?       
>  0:01 /usr/lib/utmpd
>  8 S     root 16766     1  0  51 20 e0d41768    361 e0729bc6   Jan 29 console 
>  0:00 /usr/lib/saf/ttymon -g -h -p noaapo
>  8 S     root   225     1  0  40  8 e0d4c050    434 e0d4c274   Jan 10 ?       
>  0:08 /usr/lib/inet/xntpd
>  8 S     root   237     1  0  40 20 e0d6c058    685 e07d1a38   Jan 10 ?       
> 1996:04 /opt/nport/bin/inge
>  8 S     root   243   240  0  41 20 e0d6e780    358 e0d6e9a4   Jan 10 ?       
>  0:01 /usr/lib/saf/ttymon
>  8 O     root 13857 13574  0  41 20 e0ffd028    329          21:28:15 pts/0   
>  0:00 ps -elf
>  8 S      ldm 26413 26410  0  41 20 e0726108  31537 e0db2606   Jan 26 ?       
> 27:11 pqing -5 -f WMO -v -P 1501 noaaport
>  8 S      ldm 26410     1  0  45 20 e0bd6008    487 e0bd6074   Jan 26 ?       
>  0:00 rpc.ldmd -v -q /home/ldm/data/ldm.p
>  8 S   nobody  1277   151  0  41 20 e0d4c770    185 e07d1978   Jan 10 ?       
> 17:36 cat /tmp/jmb.fifo.2
>  8 R      ldm 13574 13572  0  41 20 e1b431c0    433          21:13:41 pts/0   
>  0:00 -csh
>  8 S   nobody 26414   151  0  41 20 e1b438e0    185 e0db2a26   Jan 26 ?       
>  0:35 cat /tmp/jmb.fifo.1
>  8 S     root 13854     1  0   0 RT e1be4188    493 e0729064 21:28:03 ?       
>  0:00 /opt/nport/bin/inge
>  8 S     root 13572   151  0  61 20 e0bd6728    410 e0dcddae 21:13:41 ?       
>  0:00 in.telnetd
>  8 S     root 13856 13855  0   0 RT e1235758    233 e0ab05b8 21:28:13 ?       
>  0:00 /opt/nport/bin/inge
>  8 S     root 13855 13854  0   0 RT e0726828    493 e072c59c 21:28:08 ?       
>  0:00 /opt/nport/bin/inge
> 
> Then I tried a "kill -HUP" on the rpc.ldmd (no effect) and the pqing (no 
> effect
> either) processes; when they didn't do anything I did a "kill -9" and they
> both went away (one at a time).  I did another "ldmadmin queuecheck", which
> returned with no comments, then did an "ldmadmin start" which started 
> everthing
> back up just fine.
> 
> I didn't see anything obvious in the log files; here's the last 25 lines
> 
> Jan 31 03:53:28 noaaport pqing[26413]:      128 20010131035328.881     WMO 
> 895  SAUS41 KLWX 310353 /pMTRDCA
> Jan 31 03:53:28 noaaport pqing[26413]:      127 20010131035328.888     WMO 
> 896  SAUS41 KOKX 310353 /pMTRTEB
> Jan 31 03:53:29 noaaport pqing[26413]:     1820 20010131035329.282     WMO 
> 903  FZHW50 PHFO 310353 /pCWFHI
> Jan 31 03:53:29 noaaport pqing[26413]:     1491 20010131035329.709     WMO 
> 911  FPUS71 KBGM 310353 /pNOWBGM
> Jan 31 03:53:36 noaaport pqing[26413]:      152 20010131035336.867     WMO 
> 960  SAUS41 KILN 310353 /pMTRCVG
> Jan 31 03:53:36 noaaport pqing[26413]:     1779 20010131035336.915     WMO 
> 961  FPUS71 KILN 310353 /pNOWILN
> Jan 31 03:53:36 noaaport pqing[26413]:      125 20010131035336.946     WMO 
> 962  SAUS41 KBOX 310353 /pMTRBDL
> Jan 31 03:53:37 noaaport pqing[26413]:      123 20010131035337.128     WMO 
> 963  SAUS41 KAKQ 310353 /pMTRORF
> Jan 31 03:53:39 noaaport pqing[26413]:      144 20010131035339.695     WMO 
> 975  SAUS43 KMKX 310353 /pMTRMKE
> Jan 31 03:53:39 noaaport pqing[26413]:      358 20010131035339.914     WMO 
> 982  SXUS70 KWAL 310351
> Jan 31 03:53:39 noaaport pqing[26413]: Product already in queue
> Jan 31 03:53:41 noaaport pqing[26413]:      142 20010131035341.038     WMO 
> 008  SAUS41 KRLX 310353 /pMTRHTS
> Jan 31 03:53:41 noaaport pqing[26413]:      581 20010131035341.080     WMO 
> 011  SXUS81 KCLE 310353 /pOMRCLE
> Jan 31 03:53:41 noaaport pqing[26413]:      139 20010131035341.109     WMO 
> 012  SAUS45 KPIH 310353 /pMTRSNT
> Jan 31 03:53:45 noaaport pqing[26413]:      128 20010131035345.372     WMO 
> 024  SAUS41 KOKX 310353 /pMTRLGA
> Jan 31 03:53:45 noaaport pqing[26413]:      123 20010131035345.420     WMO 
> 025  SAUS41 KBOX 310353 /pMTRPVD
> Jan 31 03:53:45 noaaport pqing[26413]:      136 20010131035345.443     WMO 
> 026  SAUS41 KPHI 310353 /pMTRABE
> Jan 31 03:53:45 noaaport pqing[26413]:      140 20010131035345.474     WMO 
> 028  SAUS41 KOKX 310353 /pMTRJFK
> Jan 31 03:53:50 noaaport pqing[26413]:      128 20010131035350.827     WMO 
> 079  SAUS45 KLKN 310353 /pMTRAWH
> Jan 31 04:26:08 noaaport rpc.ldmd[26410]: Exiting
> Jan 31 04:26:08 noaaport rpc.ldmd[26410]: Terminating process group
> Jan 31 04:26:08 noaaport fen00(feed)[26428]: Exiting
> Jan 31 04:26:08 noaaport pqbinstats[26411]: Exiting
> Jan 31 04:26:08 noaaport pqact[26412]: Exiting
> Jan 31 04:28:48 noaaport rpc.ldmd[26410]: _NOT_ ReReading configuration file 
> /home/ldm/etc/ldmd.conf
> 
> I have the complete set of logs for this cycle...from when I restarted things
> on the 26th.  Let me know just how much (if any) you'd like to see...even
> compressed they're a bit large...and I can get them to you.  Any suggestions
> what to look for??
> 
> Thanks for your help...drop me a line or give me a call if there's anything
> else I can provide.
> 
> ps
> -------------------------------------------------------------------------
> Pete Stamus                          | Phone: (303) 415-9701 x224
> Colorado Research Associates (CoRA)* | Fax:   (303) 415-9702
> 3380 Mitchell Lane                   | email: address@hidden
> Boulder, Colorado 80301  USA         | *( CoRA is a division of NWRA )
> -------------------------------------------------------------------------
>    You can't trust your eyes when your imagination is out of focus.
>                                                       -- Mark Twain
> -------------------------------------------------------------------------

Hi Pete,

I've been comparing this information from your site with our set up on
our ingest machine, desi.  They are both using the SSEC ingest card and
software.  But we have not had these problems on desi.

It's kind of a wild shot, but I'm suspecting that your pqing is getting
a binary character when it's expecting only text.  

First, do you have the most recent version of the software?  Our
/opt/nport/bin/inge binary is dated August 21.  And our
/opt/nport/exceptions file is dated Feb. 2, 2000.  I'm not sure if they
still use the exceptions file any more, but if they do it's probably
best the have the most recent version.  I just heard today that a new
version of the code (called SDI?) is available from SSEC.  Robb said he
would forward the annoucement to me.  When I get it I'll forward it to
you.

I'm also wondering about your invocation of pqing.  I know that WMO is a
feed type handled by pqing, but on desi we are running two separate
invocations of pqing, one for binary products and one for text:
        pqing -f HRS /tmp/jmb.fifo.2
        pqing -f IDS|DDPLUS /tmp/jmb.fifo.1

Since WMO includes HRS and IDS|DDPLUS, I'm not sure how it determines
whether a product is binary or not.  I know that it's not uncommon for
text products, many of which are generated by hand, to have binary
characters in them. 

You'll see that we're reading from a FIFO rather than directly from a
port as you are.  I'm not sure how to tell inge to write to the fifo (I
assume that's what needs to happen), but I could find out if you want to
give that a try.  I see that you are 'cat'ing a fifo, but are you using
that elsewhere?

It's interesting that ldmadmin check did not return.  Maybe it's because
your log files are very large with all the verbose logging.  I suspect
it would return eventually.  It might be worthwhile to wait longer next
time.

I don't have other ideas to propose at this point.  If it happens again
at a convenient time I would still like to take a look.   Do you build
the software from source or do you get a binary distribution?  If it's
from source and you have a debugger on your machine I could see where
it's hung up.

By the way, sending a 'kill -HUP' won't have an effect on rpc.ldmd.  For
daemon processes, like rpc.ldmd, SIGHUP is commonly used to notify them
to reread their configuration file.  (That's why you see the line in the
logs saying "_NOT_ ReReading configuration file" - rpc.ldmd won't reread
ldmd.conf on the fly anymore.)  

Try sending both rpc.ldmd and pqing a simple 'kill' command, which will
send a SIGTERM, the normal, non-brutal termination signal that will
allow processes to die gracefully (if they can, indeed, die).  If that
doesn't work try 'kill -9'.  When you use 'kill -9' on rpc.ldmd you run
the risk of corrupting your queue, as the rpc.ldmd may not be able to
finish writing a product and die gracefully when it receives that
signal.

I hope this is helpful.  Please let me know what transpires.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************