[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing



"Arthur A. Person" wrote:
> 
> On Thu, 31 May 2001, Anne Wilson wrote:
> 
> >  And, what's the history on the
> > queue size?
> 
> I believe I started running the ldm with a queue size of 2GB around May
> 18, but with only a couple of rpc's in test mode.  I then added my
> downstream sites the end of last week, and over the weekend (Sunday) the
> system choked with the thrashing.  I came in and power-cycled, re-made the
> queue at 300MB and restarted in hopes I would get through the rest of the
> long weekend okay, and did.  At this point, my swap space was a 1.5GB
> partition and I began thinking I perhaps needed swap space larger than my
> queue size if the queue is mapped, so I added a swap 2GB swap file to the
> system and then restarted the ldm again with a re-made 2GB queue and
> restarted on Tuesday.  This morning I noticed the system was thrashing
> again, don't know exactly when it started.
> 
> >  Do you normally run with less than 300MB, and is that what
> > you're doing now?
> 
> I'm running with a 2GB queue now, which is what I want to run with.
> Actually, I'd make it bigger, but Linux (ldm?) doesnt' seem to support it.
> 

Please see below for a comment about this.

> > How many rpc.ldmd processes are currently running?  (I hope it's
> > responsive enough to tell.)
> 
> Perhaps this is revealing... there's a bunch of rpc's running, I think
> more than there should be:

yes, this doesn't look right.  Currently you have 78 of these processes
running.  That's 5 more than what you reported to me earlier, and four
more than when I first logged in.  The number seems to be growing.

[ldm@ldm ~/etc]$ ps -ef | grep rpc.ldmd | wc
     79     789    6380

The max number of rpc.ldmds you should have is the number of requests to
"unique" hosts plus the number of allows to "unique" hosts plus one.  (I
qualify "unique" because, as you know, the LDM will group
requests/allows to the same hosts unless you trick it by using the IP
address.)  You may have fewer rpc.ldmds if your upstreams hosts are
unavailable or your downstream sites are not connected.   Anyway, you
have way more than your should, based on the entries in your ldmd.conf:

[ldm@ldm ~/etc]$ grep -E "^request" ldmd.conf | wc
      5      20     228
[ldm@ldm ~/etc]$ grep -E "^allow" ldmd.conf | wc
     36     107    1669

I'm developing a hypothesis:  In looking at the PIDs of the running
rpc.ldmds and comparing those with the PIDs listed in the log, it looks
like sysu1.wsicorp.com is connecting a lot more than it's exiting.  Take
a look at this:

[ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Connection from" | wc
    177    1416   12213
[ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Exiting" | wc
    121     726    5203
[ldm@ldm ~/logs]$ ^sysu1^windfall
grep windfall ldmd.log | grep -E "Exiting" | wc
     44     264    2024
[ldm@ldm ~/logs]$ ^Exiting^Connection from
grep windfall ldmd.log | grep -E "Connection from" | wc
     44     352    3564
[ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Connection from" | wc
     18     144    1170
[ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Exiting" | wc
     18     108     846

The stats for windfall and bob are for comparison.  You'll see that for
those two hosts the number of connects and exits are the same.  I'd
expect them to be the same plus or minus one.

I know WSI has their own unique version of the LDM based on a very early
version.  As an experiment, are you able to do without this data for a
bit?  I will conduct my own experiment here on our 7.1 machine, but it
may take me a little time, as I have to be away for a few hours starting
soon.

One other point.  With your 2Gb queue, you have lots of data.  At the
time I ran the pqmon command below you had over 10 hours worth of data,
and it was growing (see the 'age' field - it gives the age of the oldest
product in the queue).  Also, the number of products in the queue is
also going up, so space is not yet being recycled:

[ldm@ldm ~/data]$ pqmon -i3
May 31 17:49:25 pqmon: Starting Up (17268)
May 31 17:49:26 pqmon: nprods nfree  nempty      nbytes  maxprods 
maxfree  minempty    maxext  age
May 31 17:49:26 pqmon: 185848    64  326088  2051968120    185848     
471    326088   1549296 36372
May 31 17:49:29 pqmon: 185868    64  326068  2052103712    185868     
471    326068   1549296 36376
May 31 17:49:33 pqmon: 185877    64  326059  2052139000    185877     
471    326059   1549296 36379
May 31 17:49:36 pqmon: 185890    64  326046  2052203688    185890     
471    326046   1549296 36382
May 31 17:49:39 pqmon: 185901    64  326035  2052238392    185901     
471    326035   1549296 36386
May 31 17:49:42 pqmon: 185916    64  326020  2052322080    185916     
471    326020   1549296 36389
May 31 17:49:46 pqmon: 185923    63  326014  2052353264    185923     
471    326014   1549296 36392
May 31 17:49:49 pqmon: 185938    63  325999  2052437608    185938     
471    325999   1549296 36395
May 31 17:49:52 pqmon: 185944    63  325993  2052463160    185944     
471    325993   1549296 36398
May 31 17:49:55 pqmon: 185947    63  325990  2052480008    185947     
471    325990   1549296 36402
May 31 17:49:59 pqmon: 185952    63  325985  2052525544    185952     
471    325985   1549296 36405
May 31 17:50:03 pqmon: 185959    63  325978  2052588304    185959     
471    325978   1549296 36409
May 31 17:50:06 pqmon: 185967    62  325971  2052651936    185967     
471    325971   1549296 36412
May 31 17:50:09 pqmon: 185977    62  325961  2052717376    185977     
471    325961   1549296 36416
May 31 17:50:12 pqmon: 185988    62  325950  2052812104    185988     
471    325950   1549296 36419
May 31 17:50:16 pqmon: 185992    62  325946  2052852920    185992     
471    325946   1549296 36422
May 31 17:50:19 pqmon: 186002    62  325936  2052912024    186002     
471    325936   1549296 36425
May 31 17:50:22 pqmon: 186013    62  325925  2053009880    186013     
471    325925   1549296 36428
May 31 17:50:25 pqmon: 186018    61  325921  2053029616    186018     
471    325921   1549296 36432
May 31 17:50:29 pqmon: 186031    61  325908  2053061800    186031     
471    325908   1549296 36435
May 31 17:50:32 pqmon: 186039    61  325900  2053099008    186039     
471    325900   1549296 36439
May 31 17:50:35 pqmon: 186048    61  325891  2053150176    186048     
471    325891   1549296 36442
May 31 17:50:39 pqmon: 186059    61  325880  2053246544    186059     
471    325880   1549296 36445
May 31 17:50:42 pqmon: 186070    61  325869  2053333296    186070     
471    325869   1549296 36448
May 31 17:50:45 pqmon: 186081    61  325858  2053422336    186081     
471    325858   1549296 36452
May 31 17:50:49 pqmon: 186095    61  325844  2053506456    186095     
471    325844   1549296 36455
May 31 17:50:52 pqmon: 186103    61  325836  2053532408    186103     
471    325836   1549296 36459
May 31 17:50:56 pqmon: 186112    61  325827  2053643864    186112     
471    325827   1549296 36462
May 31 17:50:59 pqmon: 186118    61  325821  2053755592    186118     
471    325821   1549296 36465
May 31 17:51:02 pqmon: 186124    61  325815  2053858840    186124     
471    325815   1549296 36469
May 31 17:51:05 pqmon: 186128    61  325811  2053906992    186128     
471    325811   1549296 36472
May 31 17:51:09 pqmon: 186139    61  325800  2054017464    186139     
471    325800   1549296 36475
May 31 17:51:12 pqmon: 186148    61  325791  2054157200    186148     
471    325791   1549296 36478
May 31 17:51:15 pqmon: 186155    61  325784  2054262720    186155     
471    325784   1549296 36481
May 31 17:51:19 pqmon: 186162    60  325778  2054333056    186162     
471    325778   1549296 36485
May 31 17:51:22 pqmon: 186172    60  325768  2054454576    186172     
471    325768   1549296 36488
May 31 17:51:26 pqmon: 186176    60  325764  2054533992    186176     
471    325764   1549296 36492
May 31 17:51:29 pqmon: 186185    60  325755  2054675840    186185     
471    325755   1549296 36495
May 31 17:51:32 pqmon: 186190    60  325750  2054758024    186190     
471    325750   1549296 36498
May 31 17:51:35 pqmon: 186197    59  325744  2054844960    186197     
471    325744   1549296 36501
May 31 17:51:36 pqmon: Interrupt
May 31 17:51:36 pqmon: Exiting

Do you really want to keep that much data?  If you have the space and
everything's working fine, I guess there's no reason not to...  This is
just a FYI.

Please let me know what you think about the WSI feed.  I will be leaving
here in about 15 minutes, but will give my own test a try later this
afternoon when I return.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************