[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #TIB-337303]: New IDD relay - ldm is hanging after some time



Pete,

> Finally got a backtrace from it. Process 8472 was stuck at 100% CPU when
> I started gdb.
> 
> [root@idd ldm-6.9.7]# gdb ldmd 8472
> GNU gdb (GDB) Red Hat Enterprise Linux (7.1-29.el6)
> Copyright (C) 2010 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /usr/local/ldm/bin/ldmd...done.
> Attaching to program: /usr/local/ldm/bin/ldmd, process 8472
> Reading symbols from /usr/local/ldm/lib/libldm.so.0...done.
> Loaded symbols for /usr/local/ldm/lib/libldm.so.0
> Reading symbols from /usr/local/ldm/lib/libxml2.so.2...done.
> Loaded symbols for /usr/local/ldm/lib/libxml2.so.2
> Reading symbols from /lib64/libdl.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libdl.so.2
> Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib64/libm.so.6
> Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libnss_files.so.2
> Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libnss_dns.so.2
> Reading symbols from /lib64/libresolv.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libresolv.so.2
> 0x000000337088431b in _wordcopy_fwd_dest_aligned () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install
> glibc-2.12-1.7.el6_0.5.x86_64
> (gdb) bt
> #0  0x000000337088431b in _wordcopy_fwd_dest_aligned () from
> /lib64/libc.so.6
> #1  0x0000003370883856 in bcopy () from /lib64/libc.so.6

How could a process be stuck using 100% of the CPU in bcopy(3)? That's a 
rhetorical question. The answer is "It can't." I suspect, therefore, that 
something's wrong with either your runtime library or with the operating-system.

> #2  0x00007f3aa8030f20 in xdrmem_putbytes (xdrs=0x7fff1bd97df0,
> addr=<value optimized out>, len=404167) at xdr_mem.c:150
> #3  0x00007f3aa803078a in my_xdr_opaque (xdrs=0x7fff1bd97df0, cp=<value
> optimized out>, cnt=<value optimized out>) at xdr.c:463
> #4  0x00007f3aa800f73c in xproduct (pq=0x1be6160, prod=0x7fff1bd97e90)
> at pq.c:4541
> #5  pq_insertNoSig (pq=0x1be6160, prod=0x7fff1bd97e90) at pq.c:5538
> #6  0x00007f3aa800fb26 in pq_insert (pq=<value optimized out>,
> prod=<value optimized out>) at pq.c:5578

The process is trying to insert something into the product-queue.

> #7  0x00007f3aa8017543 in dh_saveDataProduct (pq=<value optimized out>,
> info=0x1be8910, data=<value optimized out>, wasHereis=1, notifyAutoShift=1)
> at DownHelp.c:159
> #8  0x00007f3aa801c65d in hereis_6_svc (prod=<value optimized out>,
> rqstp=0x7fff1bd984d0) at ldm6_server.c:661
> #9  0x00007f3aa801acf1 in ldmprog_6 (rqstp=0x7fff1bd984d0,
> transp=0x1cb12d0) at ldm6_svc.c:100
> #10 0x00007f3aa802e629 in svc_getreqsock (sock=<value optimized out>) at
> svc.c:543
> #11 0x00007f3aa801ffa2 in one_svc_run (xp_sock=4, inactive_timeo=<value
> optimized out>) at one_svc_run.c:91
> #12 0x00007f3aa8022aca in run_service (upName=0x1be6140
> "idd.unidata.ucar.edu", port=30085840, request=<value optimized out>,
> inactiveTimeout=60,
> pqPathname=0x7fff1bd987a4 "\377\177", pq=<value optimized out>,
> isPrimary=1) at requester6.c:229
> #13 req6_new (upName=0x1be6140 "idd.unidata.ucar.edu", port=30085840,
> request=<value optimized out>, inactiveTimeout=60,
> pqPathname=0x7fff1bd987a4 "\377\177", pq=<value optimized out>,
> isPrimary=1) at requester6.c:672
> #14 0x00007f3aa8014a62 in prog_requester (ldmPort=<value optimized out>)
> at acl.c:1631
> #15 run_requester (ldmPort=<value optimized out>) at acl.c:1809
> #16 new_requester (ldmPort=<value optimized out>) at acl.c:1865
> #17 requester_add (ldmPort=<value optimized out>) at acl.c:1908
> #18 invert_request_acl (ldmPort=<value optimized out>) at acl.c:1974
> #19 0x0000000000405980 in read_conf (conf_path=0x7f3aa82517a0
> "/usr/local/ldm/etc/ldmd.conf", doSomething=1, defaultPort=388) at
> parser.y:593
> #20 0x00000000004065b7 in main (ac=<value optimized out>, av=<value
> optimized out>) at ldmd.c:1049
> (gdb)
> 
> 
> ps -eflu ldm shows this while connected to gdb:
> 
> [ldm@idd ~]$ ps -flu ldm
> F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY
> TIME CMD
> 4 S ldm       2697  2696  0  80   0 - 28344 n_tty_ May24 pts/2
> 00:00:03 -tcsh
> 5 S ldm       3098  3093  0  80   0 - 24362 poll_s May24 ?
> 00:02:07 sshd: ldm@pts/3
> 0 S ldm       3099  3098  0  80   0 - 28270 rt_sig May24 pts/3
> 00:00:00 -tcsh
> 5 S ldm       8462     1  0  80   0 -  5277 poll_s 16:14 ?
> 00:00:00 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 0 S ldm       8464  8462  0  80   0 - 6024619 fcntl_ 16:14 ?
> 00:00:26 rtstats -h rtstats.unidata.ucar.edu
> 1 S ldm       8465  8462  0  80   0 - 6025590 fcntl_ 16:14 ?
> 00:00:19 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8466  8462  0  80   0 - 6025585 fcntl_ 16:14 ?
> 00:00:17 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8467  8462  0  80   0 - 6025672 fcntl_ 16:14 ?
> 00:00:20 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8468  8462  0  80   0 - 6025653 fcntl_ 16:14 ?
> 00:00:16 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8469  8462  0  80   0 - 6029770 fcntl_ 16:14 ?
> 00:00:30 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8470  8462  0  80   0 - 6029677 fcntl_ 16:14 ?
> 00:00:01 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8471  8462  0  80   0 - 6026399 fcntl_ 16:14 ?
> 00:00:26 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 T ldm       8472  8462 10  80   0 - 6025021 utrace 16:14 ?
> 00:12:43 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8473  8462  3  80   0 - 6025008 fcntl_ 16:14 ?
> 00:04:12 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8474  8462  8  80   0 - 6025041 fcntl_ 16:14 ?
> 00:09:54 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8475  8462  2  80   0 - 6025007 fcntl_ 16:14 ?
> 00:03:20 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8476  8462  0  80   0 - 6025039 fcntl_ 16:14 ?
> 00:00:09 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8477  8462  8  80   0 - 6025050 fcntl_ 16:14 ?
> 00:10:01 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 1 S ldm       8478  8462  0  80   0 - 6025061 poll_s 16:14 ?
> 00:00:02 ldmd -I 144.92.130.90 -P 388 -M 512 -m 3600 -o 3600 -q
> /usr/local/ldm/var/queues/ldm.pq /us
> 5 S ldm       8524  8520  0  80   0 - 24362 poll_s 16:15 ?
> 00:00:01 sshd: ldm@pts/5
> 0 S ldm       8525  8524  0  80   0 - 28354 rt_sig 16:15 pts/5
> 00:00:00 -tcsh
> 0 R ldm       8880  8525 95  80   0 - 26993 -      18:16 pts/5
> 00:00:13 ps -flu ldm
> 5 S ldm      12678 12673  0  80   0 - 24362 poll_s May26 ?
> 00:00:00 sshd: ldm@pts/4
> 0 S ldm      12679 12678  0  80   0 - 28251 n_tty_ May26 pts/4
> 00:00:00 -tcsh
> 4 S ldm      38352 38350  0  80   0 - 28261 rt_sig 16:04 pts/0
> 00:00:00 -tcsh
> 
> As soon as I detached gdb from the process it went back to 100% CPU.

I don't know what to tell you, Pete. I've never seen this happen and, as far as 
I can tell, it shouldn't happen. 100% of CPU in bcopy(3)!?

> --
> Pete Pokrandt - Systems Programmer
> UW-Madison Dept of Atmospheric and Oceanic Sciences
> 608-262-3086  - address@hidden
Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: TIB-337303
Department: Support LDM
Priority: Normal
Status: Closed