[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20020520: motherlode:/home disk sutuation



>From: "Mike Schmidt" <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200205202252.g4KMqUi1660870 McIDAS-XCD XCD_START.LOG

Mike,

>The /home partition on motherlode is still full and a little
>investigation reveals the following;
>
> - total disk usage of /home according to "du -sk" is about 4GB
> - actual disk usage from "df -kl" shows 15GB used
> - "lsof | grep home" show that several mcidas processes have what
>   appears to be an 11GB file open with inode 800062 that must have
>   had its directory entry removed, yet the file is still open
> - inode 800062 is likely to have resided in /home/mcidas/upcworkdata
>   given that inodes 800061 and 800063 reside there;
>
>(motherlode) 193 # ls -laiR | grep 80006
>    800060 -rw-rw-r--   1 ldm      ustaff       448 May 20 16:40 DCLSTIDX.PTR
>    800064 -rw-rw-r--   1 mcidas   ustaff      4020 May 20 16:39 ALLOC.WWW
>    800069 -rw-rw-r--   1 ldm      ustaff     22127 May 20 16:35 ROUTEPP.LOG
>    800068 -rw-r--r--   1 mcidas   ustaff   8901480 May 20 10:30 GRID9999
>    800063 -rw-rw-r--   1 ldm      ustaff       530 May 19 21:00 scour.log
>    800067 -rw-rw-r--   1 mcidas   ustaff     20820 Mar  2 16:53 PUBLIC.SRV
>    800061 -rw-rw-r--   1 mcidas   ustaff    292940 Feb  9 10:31 GROUPS.DAT
>    800065 -rw-rw-r--   1 mcidas   ustaff     13628 Jul  5  2000 RESOLV.BAK
>    800066 -rw-rw-r--   1 mcidas   ustaff       622 Jun 14  2000 SETUP.BAT
>
>do you have any thoughts on this issue?  Presumably, stopping all the
>mcidas processes will free 11GB.

I just saw your note.  I logged onto motherlode and did a 'df -k' that
yielded the following results:

/local/ldm% df -k        
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0        124330   71585   40312    64%    /
/dev/md/dsk/d6       1018382  480215  477065    51%    /usr
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
mnttab                     0       0       0     0%    /etc/mnttab
/dev/md/dsk/d3        493688  368986   75334    84%    /var
swap                 2919408      16 2919392     1%    /var/run
/dev/md/dsk/d1       17114522 15399537 1543840    91%    /scratch
/dev/md/dsk/d9       17114522 15762532 1180845    94%    /home
/dev/md/dsk/d8       17372586 4185845 13013016    25%    /opt
/dev/md/dsk/d7       608262842 536361370 65818844    90%    /data
/dev/md/dsk/d4       1018382  385138  572142    41%    /tmp

Note that /home is at 94% capacity.

I then stopped the LDM and reran 'df -k':

/local/ldm% ldmadmin stop
stopping the LDM server...
LDM server stopped
/local/ldm% df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0        124330   71585   40312    64%    /
/dev/md/dsk/d6       1018382  480215  477065    51%    /usr
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
mnttab                     0       0       0     0%    /etc/mnttab
/dev/md/dsk/d3        493688  368988   75332    84%    /var
swap                 3501760      16 3501744     1%    /var/run
/dev/md/dsk/d1       17114522 15399537 1543840    91%    /scratch
/dev/md/dsk/d9       17114522 12626044 4317333    75%    /home
/dev/md/dsk/d8       17372586 4185665 13013196    25%    /opt
/dev/md/dsk/d7       608262842 536366334 65813880    90%    /data
/dev/md/dsk/d4       1018382  389035  568245    41%    /tmp

Now /home is 75% fill.

Shutting down the LDM stopped all McIDAS-XCD processes from running.

I cleaned out ~ldm/.mctmp (entries are created by McIDAS processes
running from the LDM; the subdirectories under ~ldm/.mctmp are small);
restarted the LDM; and then reran 'df -k' with the following results:

/local/ldm% ldmadmin start
starting the LDM server...
the LDM server has been started
/local/ldm% df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d0        124330   71585   40312    64%    /
/dev/md/dsk/d6       1018382  480215  477065    51%    /usr
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
mnttab                     0       0       0     0%    /etc/mnttab
/dev/md/dsk/d3        493688  369030   75290    84%    /var
swap                 3560624      16 3560608     1%    /var/run
/dev/md/dsk/d1       17114522 15967865  975512    95%    /scratch
/dev/md/dsk/d9       17114522 4155606 12787771    25%    /home
/dev/md/dsk/d8       17372586 4185488 13013373    25%    /opt
/dev/md/dsk/d7       608262842 536366262 65813952    90%    /data
/dev/md/dsk/d4       1018382  383522  573758    41%    /tmp

I guess the question for me now is exactly what process was holding
open the huge file in /home/mcidas/upcworkdata after it had been deleted!?

The only things that comes to mind is the file ROUTEPP.LOG.  This file
contains log output from McIDAS PostProcess BATCH files which are
run upon receipt of image files in the MCIDAS stream.  ROUTEPP.LOG
gets deleted by the cron-initiated mcscour.sh process run by 'ldm'.
The puzzling thing is that the PP BATCH processes fire up and log
to ROUTEPP.LOG and then exit.  They shouldn't hold open a write to
the file.  Again, I am puzzled by this...

Let's chat about what could be going on tomorrow.

Tom

>>From address@hidden Mon May 20 17:05:36 2002
>
>More info -- I forgot about /proc.  Look in /proc/8442/fd which is the
>list of open fd's for one of the mcidas pid's and see this;
>
>--w--w----   0 ldm      ustaff   11879671265 May 20 16:17 1
>--w--w----   0 ldm      ustaff   11879671265 May 20 16:17 2
>
>as I recall, fd 1 and 2 is stdout and stderr.  You can cat the file;
>
>(motherlode) 234 # more 1
>Starting DDS at 02125.205929
>Domestic Data Service
>205929
>High Resolution Data Service
>ingebin.k: Done
> at 02125.205930
>startxcd.k: m0monexe - first startup of :DMSFC
>                                   9350
>startxcd.k: m0monexe - first startup of :DMRAOB
>                                  9351
>startxcd.k: m0monexe - first startup of :DMSYN
>                                   9352
>startxcd.k: m0monexe - first startup of :DMMISC
>                                  9353
>startxcd.k: m0monexe - first startup of :DMGRID
>                                  9354
>...

This is the log output for the starting of XCD processes.  This logging
is supposed to be written to upcworkdata/XCD_START.LOG.  This file
does not get deleted until the LDM is stopped and restarted.

Each time an XCD process starts, log output gets written to XCD_START.LOG.
Admittedly, XCD processes do start/restart on a regular basis, but
I would never guess that there was enough log output to consume 11 GB
of disk space over the course of the number of days that the LDM is up
and running.

Also, I just took a look at /proc:

/home/mcidas/upcworkdata% cd /proc
/proc% dir */fd/*

I see LOTS of entries that are huge:

 ...
    640300 -rw-rw-r--   1 ustaff   89991520 May 21 01:24 27253/fd/14
 ...
    640300 -rw-rw-r--   1 ustaff   89991520 May 21 01:24 27254/fd/13
 ...
    640300 -rw-rw-r--   1 ustaff   89991520 May 21 01:24 27258/fd/10
 ...
    640300 -rw-rw-r--   1 ustaff   89991520 May 21 01:24 27259/fd/11
 ...
    640300 -rw-rw-r--   1 ustaff   89991520 May 21 01:24 27264/fd/18

 ...
   7463818 -rw-rw-r--   1 ustaff   100237236 May 21 01:23 27260/fd/13

 ...
  17728129 -rw-rw-r--   1 ustaff   378889216 May 21 01:23 27559/fd/3

 ...
      6637 -r--r--r--   1 ustaff   1980915712 May 17 11:54 27217/fd/4
      6637 -r--r--r--   1 ustaff   1980915712 May 17 11:54 27218/fd/0
      6637 -r--r--r--   1 ustaff   1980915712 May 17 11:54 27219/fd/0
      6637 -r--r--r--   1 ustaff   1980915712 May 17 11:54 27220/fd/0
      6637 -r--r--r--   1 ustaff   1980915712 May 17 11:54 27222/fd/4
      6637 -rw-rw-r--   1 ustaff   1980915712 May 17 11:54 27334/fd/4

What do these mean?

Could this be the same thing we saw on shemp (when stopping the LDM
resulted in 600 MB magically appearing)?

Tom