Okay, the whole story goes like this...we had an old sun system that ran
these scripts. When I started taking care of it in September, it was
obvious that this system was too old and decrepit to continue. Luckily, I
was able to convince the powers that be that a new system was needed. The
new system came in December, and we transfered all scripts and updated
GEMPAK and LDM to the new versions. The system worked fine from about
Christmas until maybe a month ago. Around this time I added a couple
scripts (model differences). When the crashes started occuring, my
immediate thought was an error with one of the new scripts. Therefore, I
disabled all of them, to no avail. I then looked and found a java script
that had stopped working and thought that was the problem. I disbaled it
and found no change in performance.
It appears to me that the crashes occur randomly. At different times of
the day and after very different uptimes. Sometimes we're up for a week,
sometimes (like yesterday) 3 crashes in a single day. Therefore, I
conclude that if it is a single script, it's one that runs at least
hourly. I've made a list of all of these scripts and am now disabling
them one by one to see if I get any results.
Gabe,
If we saw problems like that here, we would immediately suspect the
hardware. (I'm a little late to this discussion, so pardon me if it's
been talked about already.)
Now before I continue: I have to admit that this could very well be a
software/operating system problem, BUT it could also be hardware-related.
Three things immediately come to mind: 1) Cooling problem. 2) Bad power
supply. 3) Bad (or failed) memory.
For 1), check all fans and all cooling fins on all heat sinks. Clean
and/or replace as necessary. Make sure the CPU heatsink(s) is (are)
properly seated on top of the CPU chip.
For 2), I don't know of an easy way to test a power supply, so we always
swap out suspect power supplies to see if it eliminates the problem.
For 3), run memtest86 if you can afford the downtime. See this website
for more information: <http://www.memtest86.com/> (Linux packages are
available.)
Some other things:
o Could be a bad CPU. We have had a handful of failed AMD CPUs over the
past several years.
o Re-seat all expansion cards and cables.
o If you are using a 3ware RAID controller, you might consider doing a
volume verify.
o Gilbert mentioned buggy BIOSs in another post. You might consider
checking with the motherboard vendor for BIOS updates.
Hope this helps a bit.
- Bryan
University of North Dakota