[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

19990505: IRIX assertion failures



Neil,

An assertion failed message means that your queue is corrupt.
This generally happens when a machine goes down while data is
still being written.

The program that is detecting the corrupt queue is pqsurf,
so you should delete and remake the surf queue.

To do this, shut down any running ldm processes, then run:
ldmadmin delsurfqueue
ldmadmin mksurfqueue

Then try to restart the ldm.

If you ever have a corrupt product queue reported from
pqact or rpc.ldmd, then you would rebuild the ldm.pq with:
ldmadmin delqueue
ldmadmin mkqueue

Steve CHiswell
Unidata User SUpport




>From: "Neil R. Smith" <address@hidden>
>Organization: Dept. Meteorology, TAMU
>Keywords: 199905051811.MAA00030

>Hi,
>I've been having assertion failures and can't find anything
>relevant to our setup:
>Platform: SGI Indigo 2, IRIX 6.2, 
>LDM vers: LDM 5.0.5
>LDM pqsurf queue size config'd in ldmd.conf: 6000000 bytes
>LDM pqsurf.pq typical actual running size: 2330624
>
>Here is the status report ldm emailed:
>+++++++++++++++++++++++++++++++
>LDM status report from the logs for the last 11 hours.
>
>Currently coriolis is running  percent idle
>load average: 0.68, 0.77, 0.90
>Running version number 5.0.5.
>LDM was restarted 3 time(s)
>        Last LDM restart at May 04 18:10:11
>Max Queue usage is 92815352 bytes, it occurred at May 04 18:56:47
>
>Critical LDM problems that need immediate attention:
>assertion failure message occurred 3 time(s).
>        Last one at:  May 04 18:56:40
>        For pqsurf[2572] it happened 1 time(s).
>        For pqsurf[2801] it happened 1 time(s).
>        For pqsurf[3808] it happened 1 time(s).
>
>
>Potential LDM Problems:
>Non-zero Status message occurred 110 time(s).
>        Last one at:  May 05 12:33:47
>'NULLPROC error' message occurred 18 time(s).
>        Last one at:  May 05 12:08:42
>        For gergu3.gerg.tamu.edu it happened 18 time(s).
>
>Decoder LDM Problems:
>+++++++++++++++++++++++++++++++++
>
>ldm log entries referencing the last incidence, 3808, which
>is the PID of the starting pqsurf:
>+++++++++++++++++++++++++++++++++
>May 04 18:50:12 5Q:coriolis pqsurf[3808]: Starting Up (3803)
>May 04 18:50:31 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:50:48 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:51:24 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:52:32 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:52:40 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:52:53 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:53:04 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:53:38 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:53:57 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:54:15 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:54:17 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:54:31 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:54:58 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:56:13 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:56:23 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:56:23 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:56:24 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:56:27 3Q:coriolis pqsurf[3808]: surface_split: Can't handle
>MESSAGE_TYPE_UNKNOWN
>May 04 18:56:40 5Q:coriolis pqsurf[3808]: Growing index by 4096
>May 04 18:56:40 5Q:coriolis pqsurf[3808]: Growing data by 397312
>May 04 18:56:40 3Q:coriolis pqsurf[3808]: assertion "pq->ctlp->magic ==
>PQ_MAGIC" failed: file "pq.c", line 2514
>May 04 18:56:41 5Q:coriolis pqsurf[3808]: Exiting
>May 04 18:56:41 3Q:coriolis pqsurf[3808]: waitpid: No child processes
>May 04 18:56:41 5Q:coriolis pqsurf[3808]:   Queue usage (bytes): 1475008
>May 04 18:56:41 5Q:coriolis pqsurf[3808]:            (nregions):    8254
>May 04 18:56:41 5Q:coriolis pqsurf[3808]: Number of products 704
>May 04 18:56:41 5Q:coriolis pqsurf[3808]: Number of observations 2381
>May 04 18:56:41 5Q:coriolis pqsurf[3808]: Number of dups 162
>May 04 18:56:47 5Q:coriolis rpc.ldmd[3803]: child 3808 terminated by
>signal 6
>++++++++++++++++++++++++++++++
>
>My search of the ldm-users and uknidata support find only references
>to SUN problems in pq.c and possible small pqsurf.pq sizes.  They
>don't seem to apply to our config.
>
>Can you shed any light?  I'll be glad to respond with further 
>info from logs or config., if needed.
>
>Thanks,
>-Neil
>-- 
>Neil R. Smith, Sys. Admin.              address@hidden
>Dept. Meteorology, Texas A&M Univ.      409/845-6272 FAX:409/862-4466
>