[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030827: abnormal LDM termination: product-queue assertion failure



Alan,

> To: address@hidden
> From: "Alan Hall" <address@hidden>
> Subject: LDM Termninations
> Organization: NOAA/NCDC

The above message contained the following:

> Below is an excerpt (actually it is the whole logfile) from and ldmd.log that
> has terminated or crashed twice.  The LDM is version 6.0.13 and is running on 
> an
> IBM AIX 5.1 machine.  Products are inserted into the ldm.pq via an acqserver C
> program that connects directly to a NRS.

What's an NRS?

Who wrote the "acqserver" program?  Does it *only* use the "pq" module
of the LDM package to access the product-queue or does it access the
product-queue directly?

> I can't think of any other details, so just let me know if you need any more
> info.
> 
> Alan.

> Aug 25 11:11:34 humboldt rpc.ldmd[69076]: Starting Up (version: 6.0.13; 
> built: Jun  9 2003 13:20:12)
> Aug 25 11:11:34 humboldt eldm[104262]: Starting Up(6.0.13): 
> eldm.fsl.noaa.gov: TS_ZERO TS_ENDT {{FSL5,  
> "^FSL\.CompressedNetCDF\.MADIS\.(mesonet|hydro).\..*"}}
> Aug 25 11:11:34 humboldt eldm[104262]: Desired product class: 
> 20030825101134.509 TS_ENDT {{FSL5,  
> "^FSL\.CompressedNetCDF\.MADIS\.(mesonet|hydro).\..*"}}
> Aug 25 11:11:35 humboldt eldm[104262]: NOTICE: requester6.c:421; 
> ldm_clnt.c:272: nullproc_6 failure to eldm.fsl.noaa.gov; ldm_clnt.c:141: RPC: 
> 1832-012 Program/version mismatch; low version = 4, high version = 5
> Aug 25 11:11:35 humboldt eldm[104262]: Connected to upstream LDM-5
> Aug 25 11:11:35 humboldt eldm[104262]: FEEDME(eldm.fsl.noaa.gov): OK
> Aug 25 11:11:37 humboldt reflect(feed)[52540]: up6.c:299: Starting 
> Up(6.0.13/6): 20030823193322.338 TS_ENDT {{WMO,  "^NOAAPORT.*"}}
> Aug 25 11:11:37 humboldt reflect(feed)[52540]: topo:  reflect.ncdc.noaa.gov 
> WMO
> Aug 25 12:06:26 humboldt nomads3(feed)[60360]: up6.c:299: Starting 
> Up(6.0.13/6): 20030825110624.029 TS_ENDT {{WMO, 
> "^NOAAPORT\.NWSTG\.GRID\.([YZ]).....\.KWB..*"}}
> Aug 25 12:06:26 humboldt nomads3(feed)[60360]: topo:  nomads3.ncdc.noaa.gov 
> WMO
> Aug 25 17:44:19 humboldt eldm[104262]: Connection closed by upstream LDM
> Aug 25 17:44:49 humboldt eldm[104262]: Desired product class: 
> 20030825174326.602 TS_ENDT {{FSL5,  
> "^FSL\.CompressedNetCDF\.MADIS\.(mesonet|hydro).\..*"}}
> Aug 25 17:44:53 humboldt eldm[104262]: NOTICE: requester6.c:421; 
> ldm_clnt.c:272: nullproc_6 failure to eldm.fsl.noaa.gov; ldm_clnt.c:141: RPC: 
> 1832-012 Program/version mismatch; low version = 4, high version = 5
> Aug 25 17:44:53 humboldt eldm[104262]: Connected to upstream LDM-5
> Aug 25 17:44:57 humboldt eldm[104262]: FEEDME(eldm.fsl.noaa.gov): OK
> Aug 25 18:11:42 humboldt eldm[104262]: Connection closed by upstream LDM
> Aug 25 18:12:12 humboldt eldm[104262]: Desired product class: 
> 20030825180946.305 TS_ENDT {{FSL5,  
> "^FSL\.CompressedNetCDF\.MADIS\.(mesonet|hydro).\..*"}}
> Aug 25 18:13:06 humboldt eldm[104262]: ERROR: requester6.c:426; 
> ldm_clnt.c:242: Couldn't connect to LDM 6 on eldm.fsl.noaa.gov using 
> portmapper; ldm_clnt.c:112: : RPC: 1832-019 Program not registered
> Aug 25 18:13:36 humboldt eldm[104262]: Desired product class: 
> 20030825180946.305 TS_ENDT {{FSL5,  
> "^FSL\.CompressedNetCDF\.MADIS\.(mesonet|hydro).\..*"}}
> Aug 25 18:13:36 humboldt eldm[104262]: NOTICE: requester6.c:421; 
> ldm_clnt.c:272: nullproc_6 failure to eldm.fsl.noaa.gov; ldm_clnt.c:141: RPC: 
> 1832-012 Program/version mismatch; low version = 4, high version = 5
> Aug 25 18:13:36 humboldt eldm[104262]: Connected to upstream LDM-5
> Aug 25 18:13:37 humboldt eldm[104262]: FEEDME(eldm.fsl.noaa.gov): OK
> Aug 26 21:21:49 humboldt eldm[104262]: assertion "rl->nelems + rl->nfree + 
> rl->nempty == rl->nalloc" failed: file "pq.c", line 2056

The above assertion-failure should not have occurred.  If access to the
product-queue is only through the "pq" module's API, then the
product-queue should remain in a consistent state.  The only way I know
to by-pass this and get a product-queue in an inconsistent state is for
the operating system to crash or for the LDM to be terminated with
extreme prejudice -- such as by sending it a SIGKILL (9).  Did this
happen?

In any event, your product-queue is now corrupt and should be recreated:

    1.  Stop the LDM system (ldmadmin stop).

    2.  Rename the product-queue to get it out of the way and,
        potentially, allow us to examine it.

    3.  Execute the command "ldmadmin mkqueue".

    4.  Start the LDM system (ldmadmin start).

Aug 26 21:21:55 humboldt rpc.ldmd[69076]: child 104262 terminated by signal 6
Aug 26 21:21:55 humboldt rpc.ldmd[69076]: Killing (SIGINT) process group
Aug 26 21:21:55 humboldt rpc.ldmd[69076]: SIGINT
Aug 26 21:21:55 humboldt reflect(feed)[52540]: SIGINT
Aug 26 21:21:55 humboldt nomads3(feed)[60360]: SIGINT
Aug 26 21:21:56 humboldt rpc.ldmd[69076]: Terminating process group

> begin:vcard 
> n:Hall;Alan
> tel;fax:(828)271-4022
> tel;work:(828)271-4071
> x-mozilla-html:TRUE
> url:www.ncdc.noaa.gov
> org:National Climatic Data Center;Ingest & Processing Branch
> version:2.1
> email;internet:address@hidden
> title:Team Leader/Computer Specialist
> adr;quoted-printable:;;151 Patton Ave.=0D=0A;Asheville;NC;28801-5001;USA
> fn:Alan D. Hall
> end:vcard

Regards,
Steve Emmerson