[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing



On Thu, 31 May 2001, Arthur A. Person wrote:

> On Thu, 31 May 2001, Anne Wilson wrote:
>
> > "Arthur A. Person" wrote:
> > >
> > > On Thu, 31 May 2001, Anne Wilson wrote:
> > >
> > > >  And, what's the history on the
> > > > queue size?
> > >
> > > I believe I started running the ldm with a queue size of 2GB around May
> > > 18, but with only a couple of rpc's in test mode.  I then added my
> > > downstream sites the end of last week, and over the weekend (Sunday) the
> > > system choked with the thrashing.  I came in and power-cycled, re-made the
> > > queue at 300MB and restarted in hopes I would get through the rest of the
> > > long weekend okay, and did.  At this point, my swap space was a 1.5GB
> > > partition and I began thinking I perhaps needed swap space larger than my
> > > queue size if the queue is mapped, so I added a swap 2GB swap file to the
> > > system and then restarted the ldm again with a re-made 2GB queue and
> > > restarted on Tuesday.  This morning I noticed the system was thrashing
> > > again, don't know exactly when it started.
> > >
> > > >  Do you normally run with less than 300MB, and is that what
> > > > you're doing now?
> > >
> > > I'm running with a 2GB queue now, which is what I want to run with.
> > > Actually, I'd make it bigger, but Linux (ldm?) doesnt' seem to support it.
> > >
> >
> > Please see below for a comment about this.
> >
> > > > How many rpc.ldmd processes are currently running?  (I hope it's
> > > > responsive enough to tell.)
> > >
> > > Perhaps this is revealing... there's a bunch of rpc's running, I think
> > > more than there should be:
> >
> > yes, this doesn't look right.  Currently you have 78 of these processes
> > running.  That's 5 more than what you reported to me earlier, and four
> > more than when I first logged in.  The number seems to be growing.
> >
> > [ldm@ldm ~/etc]$ ps -ef | grep rpc.ldmd | wc
> >      79     789    6380
> >
> > The max number of rpc.ldmds you should have is the number of requests to
> > "unique" hosts plus the number of allows to "unique" hosts plus one.  (I
> > qualify "unique" because, as you know, the LDM will group
> > requests/allows to the same hosts unless you trick it by using the IP
> > address.)  You may have fewer rpc.ldmds if your upstreams hosts are
> > unavailable or your downstream sites are not connected.   Anyway, you
> > have way more than your should, based on the entries in your ldmd.conf:
> >
> > [ldm@ldm ~/etc]$ grep -E "^request" ldmd.conf | wc
> >       5      20     228
> > [ldm@ldm ~/etc]$ grep -E "^allow" ldmd.conf | wc
> >      36     107    1669
> >
> > I'm developing a hypothesis:  In looking at the PIDs of the running
> > rpc.ldmds and comparing those with the PIDs listed in the log, it looks
> > like sysu1.wsicorp.com is connecting a lot more than it's exiting.  Take
> > a look at this:
> >
> > [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Connection from" | wc
> >     177    1416   12213
> > [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Exiting" | wc
> >     121     726    5203
> > [ldm@ldm ~/logs]$ ^sysu1^windfall
> > grep windfall ldmd.log | grep -E "Exiting" | wc
> >      44     264    2024
> > [ldm@ldm ~/logs]$ ^Exiting^Connection from
> > grep windfall ldmd.log | grep -E "Connection from" | wc
> >      44     352    3564
> > [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Connection from" | wc
> >      18     144    1170
> > [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Exiting" | wc
> >      18     108     846
> >
> > The stats for windfall and bob are for comparison.  You'll see that for
> > those two hosts the number of connects and exits are the same.  I'd
> > expect them to be the same plus or minus one.
> >
> > I know WSI has their own unique version of the LDM based on a very early
> > version.  As an experiment, are you able to do without this data for a
> > bit?  I will conduct my own experiment here on our 7.1 machine, but it
> > may take me a little time, as I have to be away for a few hours starting
> > soon.
>
> I was starting to suspect the wsi feed as well.  I know they've had a lot
> of trouble staying connected here and when I saw all the rpc's, I started
> thinking about wsi trying to connect.  If their software is old, they
> should probably update since their feed has not been as reliable as it
> used to be, unless it's just a networking bandwidth problem.  I guess I
> would have to take that up with them.  But the connect/reconnect thing
> shouldn't haul my system down either, should it?
>
> > One other point.  With your 2Gb queue, you have lots of data.  At the
> > time I ran the pqmon command below you had over 10 hours worth of data,
> > and it was growing (see the 'age' field - it gives the age of the oldest
> > product in the queue).  Also, the number of products in the queue is
> > also going up, so space is not yet being recycled:
> >
> > [ldm@ldm ~/data]$ pqmon -i3
> > May 31 17:49:25 pqmon: Starting Up (17268)
> > May 31 17:49:26 pqmon: nprods nfree  nempty      nbytes  maxprods
> > maxfree  minempty    maxext  age
> > May 31 17:49:26 pqmon: 185848    64  326088  2051968120    185848
> > 471    326088   1549296 36372
> > May 31 17:49:29 pqmon: 185868    64  326068  2052103712    185868
> > 471    326068   1549296 36376
> > May 31 17:49:33 pqmon: 185877    64  326059  2052139000    185877
> > 471    326059   1549296 36379
> > May 31 17:49:36 pqmon: 185890    64  326046  2052203688    185890
> > 471    326046   1549296 36382
> > May 31 17:49:39 pqmon: 185901    64  326035  2052238392    185901
> > 471    326035   1549296 36386
> > May 31 17:49:42 pqmon: 185916    64  326020  2052322080    185916
> > 471    326020   1549296 36389
> > May 31 17:49:46 pqmon: 185923    63  326014  2052353264    185923
> > 471    326014   1549296 36392
> > May 31 17:49:49 pqmon: 185938    63  325999  2052437608    185938
> > 471    325999   1549296 36395
> > May 31 17:49:52 pqmon: 185944    63  325993  2052463160    185944
> > 471    325993   1549296 36398
> > May 31 17:49:55 pqmon: 185947    63  325990  2052480008    185947
> > 471    325990   1549296 36402
> > May 31 17:49:59 pqmon: 185952    63  325985  2052525544    185952
> > 471    325985   1549296 36405
> > May 31 17:50:03 pqmon: 185959    63  325978  2052588304    185959
> > 471    325978   1549296 36409
> > May 31 17:50:06 pqmon: 185967    62  325971  2052651936    185967
> > 471    325971   1549296 36412
> > May 31 17:50:09 pqmon: 185977    62  325961  2052717376    185977
> > 471    325961   1549296 36416
> > May 31 17:50:12 pqmon: 185988    62  325950  2052812104    185988
> > 471    325950   1549296 36419
> > May 31 17:50:16 pqmon: 185992    62  325946  2052852920    185992
> > 471    325946   1549296 36422
> > May 31 17:50:19 pqmon: 186002    62  325936  2052912024    186002
> > 471    325936   1549296 36425
> > May 31 17:50:22 pqmon: 186013    62  325925  2053009880    186013
> > 471    325925   1549296 36428
> > May 31 17:50:25 pqmon: 186018    61  325921  2053029616    186018
> > 471    325921   1549296 36432
> > May 31 17:50:29 pqmon: 186031    61  325908  2053061800    186031
> > 471    325908   1549296 36435
> > May 31 17:50:32 pqmon: 186039    61  325900  2053099008    186039
> > 471    325900   1549296 36439
> > May 31 17:50:35 pqmon: 186048    61  325891  2053150176    186048
> > 471    325891   1549296 36442
> > May 31 17:50:39 pqmon: 186059    61  325880  2053246544    186059
> > 471    325880   1549296 36445
> > May 31 17:50:42 pqmon: 186070    61  325869  2053333296    186070
> > 471    325869   1549296 36448
> > May 31 17:50:45 pqmon: 186081    61  325858  2053422336    186081
> > 471    325858   1549296 36452
> > May 31 17:50:49 pqmon: 186095    61  325844  2053506456    186095
> > 471    325844   1549296 36455
> > May 31 17:50:52 pqmon: 186103    61  325836  2053532408    186103
> > 471    325836   1549296 36459
> > May 31 17:50:56 pqmon: 186112    61  325827  2053643864    186112
> > 471    325827   1549296 36462
> > May 31 17:50:59 pqmon: 186118    61  325821  2053755592    186118
> > 471    325821   1549296 36465
> > May 31 17:51:02 pqmon: 186124    61  325815  2053858840    186124
> > 471    325815   1549296 36469
> > May 31 17:51:05 pqmon: 186128    61  325811  2053906992    186128
> > 471    325811   1549296 36472
> > May 31 17:51:09 pqmon: 186139    61  325800  2054017464    186139
> > 471    325800   1549296 36475
> > May 31 17:51:12 pqmon: 186148    61  325791  2054157200    186148
> > 471    325791   1549296 36478
> > May 31 17:51:15 pqmon: 186155    61  325784  2054262720    186155
> > 471    325784   1549296 36481
> > May 31 17:51:19 pqmon: 186162    60  325778  2054333056    186162
> > 471    325778   1549296 36485
> > May 31 17:51:22 pqmon: 186172    60  325768  2054454576    186172
> > 471    325768   1549296 36488
> > May 31 17:51:26 pqmon: 186176    60  325764  2054533992    186176
> > 471    325764   1549296 36492
> > May 31 17:51:29 pqmon: 186185    60  325755  2054675840    186185
> > 471    325755   1549296 36495
> > May 31 17:51:32 pqmon: 186190    60  325750  2054758024    186190
> > 471    325750   1549296 36498
> > May 31 17:51:35 pqmon: 186197    59  325744  2054844960    186197
> > 471    325744   1549296 36501
> > May 31 17:51:36 pqmon: Interrupt
> > May 31 17:51:36 pqmon: Exiting
> >
> > Do you really want to keep that much data?  If you have the space and
> > everything's working fine, I guess there's no reason not to...  This is
> > just a FYI.
>
> Yeh, I know it will hold a lot, but I like lots of data :)  As I said, if
> I could make the queue even bigger, I would.  Space is cheap these days
> and I figure as a relay, if someone downstream is down for a bunch of
> hours, they can still catch up on the data.
>
> > Please let me know what you think about the WSI feed.  I will be leaving
> > here in about 15 minutes, but will give my own test a try later this
> > afternoon when I return.
>
> Maybe I'll try killing off some rpc.ldmd processes and see if things
> improve assuming I don't jam the system.

Okay, I killed off most or all of the hung rpc's... not sure if it helped
much though.  Disk may be a little less busy, but it's still pretty busy.
Maybe this is still some sort of a queue issue...?

                                       Art.

Arthur A. Person
Research Assistant, System Administrator
Penn State Department of Meteorology
email:  address@hidden, phone:  814-863-1563