[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20051209: need to upgrade LDM on sasquatch ?



>From: Unidata User Support <address@hidden>
>Organization: Unidata Program Center/UCAR
>Keywords: 200512081502.jB8F2Y7s029370 LDM 20051207 5Z assertion failure

Hi Gerry,

I got your voicemail...

I just logged onto sasquatch to take a look at the LDM exits you
noted in your voicemail.  The failure December 8:

ldmd.log.3:

 ...
Dec 08 05:32:11 sasquatch bigbird[6159] ERROR: assertion "n > 0" failed: file 
"pq.c", line 2259
 ...
Dec 08 05:34:17 sasquatch rpc.ldmd[6153] NOTE: child 6159 terminated by signal 6
Dec 08 05:34:17 sasquatch rpc.ldmd[6153] NOTE: Killing (SIGINT) process group
Dec 08 05:34:17 sasquatch rpc.ldmd[6153] NOTE: SIGINT
 ...

is the same that was reported by 4 sites on December 7.  Steve
(who has been at the AGU meeting in San Francisco) commented on this
failure to another user:

  From address@hidden  Thu Dec  8 11: 44:34 2005
  
  Amazing!
  
  The product-queue module contains a bad assert() that's activated
  if the first four bytes of the MD5 checksum of a data-product (the
  data-product's signature) are all zeros.  In order for this to occur,
  the LDM package must also have been compiled with assertions enabled
  (which is not the default).
  
  Apparently, this is the first such occurrence in eleven years.
  
  I've removed the offending assertion and will make a new release when I
  return.
  
  Regards,
  Steve Emmerson

I don't know how Steve built the version of the LDM you are using
(6.4.2), but I found the following in the ~ldm/ldm-6.4.2/src/macros.make
file:

CFLAGS  = -g -m64
CONFIGURE_CPPFLAGS      = -UNDEBUG
GDBMLIB =
CONFIGURE_LIBS  =

The -UNDEBUG setting of CONFIGURE_CPPFLAGS was most likely the culprit
in activating assertions, and it was an assertion that caused the LDM
to exit.  Steve may have set this to try and capture errant behavior on
sasquatch (he/you were trying to figure out previous failures you were
seeing).

The other LDM exits I saw recently were caused from someone stopping
the LDM.  In at least one case, this was due to fflush errors from
pqact because a file system was full:

from ~ldm/logs/ldmd.log.1:

Dec 09 16:23:55 sasquatch pqact[3625] ERROR: fflush: 
/usr/local/ldm/data/ddplus/marine/05120916.CGDUMP: No space left on device
Dec 09 16:23:57 sasquatch last message repeated 5 times
Dec 09 16:23:57 sasquatch rpc.ldmd[3622] NOTE: SIGTERM
Dec 09 16:23:57 sasquatch rpc.ldmd[3622] NOTE: Terminating process group

The question now is if the LDM has been exiting in any other ways
that I did not see?  If no, then it may be a good time to (re)build either
the version you are already using (6.4.2) or the latest available
(6.4.4) with assertions turned off.

One last comment:  Steve noted that he has already turned off the
offending assertion and will make a new release when he gets back to
the office.  I am not sure if this will be today (he is not in yet) or
Monday, but I think that Monday is more likely.

So, what would you like to do at this point:

- rebuild your current LDM without assertions

- build the latest LDM without assertions

- wait for Steve's new release

Cheers,

Tom