[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20011009: Cron Jobs Failing on Blizzard (cont.)



>From: "Patrick O'Reilly" <address@hidden>
>Organization:  UNI
>Keywords:  200110031943.f93Jhm108665 cron auditing

Patrick,

>I am finally back from my travels.  I read your suggestion,
>
>> <as 'root'>
>> audit -t
>
>and the problem still exists.

OK.

>I should also correct myself, in the previous
>email, I said the cron jobs began failing between 6:30 and 7:00 pm on
>Saturday, September 29, it was actually between 6:30 and 7:00 pm on Monday,
>October 1.  It was the same time we were remotely working on
>blizzard.storm.uni.edu from your office.

The smoking gun, eh :o)

I don't think that the crontab entry that I put in for the user 'ldm'
would cause all cron jobs to stop.  Turning auditing on is the most likely
culprit according to my system administrator, who also recommended that
you reboot AFTER making sure that you don't have something setup to
start auditing at boot.

The other thing a reboot will do for you is restart syslogd.  syslogd
is known to get hosed on Solaris systems, and the fact that your
/var/adm/messages* files are empty seems to be saying that syslogd
is not working correctly.  This contradicts, at the same time, the
fact that LDM logging is working (~ldm/logs/ldmd.log* are being updated
and rotated).

>I just thought that the
>coincidence stood out, as we did add a cron entry at around that time.  But,
>as I am not a UNIX system guru, it was just a suspicion.

I would be suspicious as well.

>Below is a clip from the cron log file around that time:
>
> > CMD: /usr/lib/sa sa1
>>  sys 23528 c Mon Oct  1 18:20:00 2001
><  sys 23528 c Mon Oct  1 18:20:00 2001 rc=1
>>  CMD: /usr/local/ldm/bin/ldmadmin dostats
>>  ldm 23545 c Mon Oct  1 18:35:00 2001
><  ldm 23545 c Mon Oct  1 18:35:02 2001
>>  CMD: /usr/lib/sa sa1
>>  sys 23686 c Mon Oct  1 18:40:00 2001
><  sys 23686 c Mon Oct  1 18:40:00 2001 rc=1
>>  CMD: /usr/sbin/audit -n

Here is where the audit daemon was told to close the current audit file
and open a new one in the current audit directory.

>>  root 23918 c Mon Oct  1 19:00:00 2001
>>  CMD: /usr/lib/sa sa1
>>  sys 23919 c Mon Oct  1 19:00:00 2001
>>  CMD: /usr/local/ldm/bin/ldmadmin scour
>>  ldm 23920 c Mon Oct  1 19:00:00 2001
>! cron audit problem. job failed (/usr/local/ldm/bin/ldmadmin scour) for
>user ldm Mon
> Oct  1 19:00:00 2001
><  ldm 23920 c Mon Oct  1 19:00:00 2001 rc=1
><  sys 23919 c Mon Oct  1 19:00:00 2001 rc=1
><  root 23918 c Mon Oct  1 19:00:00 2001
>>  CMD: /usr/lib/sa sa1
>>  sys 24151 c Mon Oct  1 19:20:00 2001
><  sys 24151 c Mon Oct  1 19:20:00 2001 rc=1
>! cron audit problem. job failed (/usr/local/ldm/bin/ldmadmin dostats) for
>user ldm M
>on Oct  1 19:35:00 2001
>
>As you can see the last ldm job that ran was dostats at 18:35:00 2001.

Right.

>The
>next, the scour at 19:00, failed, and all ldm cron jobs have been failing
>since.  It seems the root and sys cron stuff is still fine.  I am still
>trying to figure things out here, if you have any other suggestions, I would
>be glad to hear.  Thanks!

OK.  Since my system administrator is not here to guide me through this, I
will suggest that you:

1) make sure that auditing will not be on after a reboot
2) reboot
3) see if 'ldm's cron jobs still fail

If 'ldm's cron jobs still fail, then copy the entries to a file for
reference and then remove all actions except the 'ldmadmin scour' one
and see what happens then.

[ldm@blizzard]#crontab -l
35 * * * * /usr/local/ldm/bin/ldmadmin dostats
0 0 * * * /usr/local/ldm/bin/ldmadmin newlog
0 1,4,7,10,13,16,19,22 * * * /usr/local/ldm/bin/ldmadmin scour
0 21 * * * /usr/local/ldm/decoders/mcscour.sh

For reference, the only cron entry that I added when you were here in
my office was the last one, /usr/local/ldm/decoders/mcscour.sh.

Please keep me informed about the outcome of a reboot and other tests.

Tom