[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20000515: pbuf_flush log messages (fwd)



>
> Date: Mon, 15 May 2000 08:56:21 -0600
> From: Unidata Support <address@hidden>
> Reply-To: Chris Herbster <address@hidden>
> To: address@hidden
> Subject: 20000515: pbuf_flush log messages
>
> >To: address@hidden,
> >To: "Herbster, Chris" <address@hidden>
> >From: Chris Herbster <address@hidden>
> >Subject: pbuf_flush log messages
> >Organization: UCAR/Unidata
> >Keywords: 200005121726.e4CHQ1413099
>
> I have almost re-established our system after a hard disk failure that
> resulted in the complete loss of our LDM account (no tape device yet,
> and the tar file that I thought was on another machine evaporated -
> bummer).
>
> We used a 6 GB drive that we had in another system that has a video card
> problem.  The server (opwx.db.erau.edu) is running Redhat 5.2.
>
> [herbster@opwx ~]$ uname -a
> Linux opwx.db.erau.edu 2.0.36 #8 Tue Aug 31 13:14:50 EDT 1999 i686
> unknown
>
> I am getting repeated pbuf_flush messages in the log:
>
> [ldm@opwx ~]$ head -30 logs/ldmd.log
> May 12 16:46:45 opwx rpc.ldmd[1004]: Starting Up (built: May  5 2000
> 16:28:13)
> May 12 16:46:45 opwx pluto[1008]: run_requester: Starting Up:
> pluto.met.fsu.edu
> May 12 16:46:45 opwx pluto[1008]: run_requester: 20000512154646.006
> TS_ENDT {{DDPLUS,  ".*"}}
> May 12 16:46:46 opwx striker[1009]: run_requester: Starting Up:
> striker.atmos.albany.edu
> May 12 16:46:46 opwx striker[1009]: run_requester: 20000512154646.026
> TS_ENDT {{NLDN,  ".*"}}
> May 12 16:46:46 opwx pqexpire[1005]: Starting Up
> May 12 16:46:46 opwx pqbinstats[1006]: Starting Up (1004)
> May 12 16:46:46 opwx pqact[1007]: Starting Up
> May 12 16:46:46 opwx striker[1009]: FEEDME(striker.atmos.albany.edu): OK
> May 12 16:46:46 opwx pluto[1008]: FEEDME(pluto.met.fsu.edu): OK
> May 12 16:46:48 opwx localhost[1015]: Connection from localhost
> May 12 16:46:48 opwx localhost[1015]: Connection reset by peer
> May 12 16:46:48 opwx localhost[1015]: Exiting
> May 12 16:47:15 opwx pqact[1007]: pbuf_flush 4: time elapsed   2.535485
> May 12 16:47:37 opwx pqact[1007]: pbuf_flush 4: time elapsed   9.433849
> May 12 16:47:49 opwx pqact[1007]: pbuf_flush 4: time elapsed  11.849885
> May 12 16:47:53 opwx pqact[1007]: pbuf_flush 4: time elapsed   3.429464
> May 12 16:47:58 opwx pqact[1007]: pbuf_flush 4: time elapsed   4.590907
> May 12 16:48:14 opwx pqact[1007]: pbuf_flush 4: time elapsed  15.620028
> May 12 16:48:22 opwx pqact[1007]: pbuf_flush 4: time elapsed   8.316928
> May 12 16:48:50 opwx pqact[1007]: pbuf_flush 4: time elapsed  26.827392
> May 12 16:49:04 opwx pqact[1007]: pbuf_flush 4: time elapsed  13.800704
> May 12 16:49:18 opwx pqact[1007]: pbuf_flush 4: time elapsed  13.448355
> May 12 16:49:34 opwx pqact[1007]: pbuf_flush 4: time elapsed  15.369003
> May 12 16:49:36 opwx pqact[1007]: pbuf_flush 4: time elapsed   2.562947
> May 12 16:49:49 opwx pqact[1007]: pbuf_flush 4: time elapsed  12.757056
> May 12 16:50:02 opwx pqact[1007]: pbuf_flush 4: time elapsed  13.213933
> May 12 16:50:16 opwx pqact[1007]: pbuf_flush 4: time elapsed  13.347354
> May 12 16:50:29 opwx pqact[1007]: pbuf_flush 4: time elapsed  13.105565
> May 12 16:51:00 opwx pqact[1007]: pbuf_flush 4: time elapsed  31.202759
> ...
> May 12 17:00:02 opwx pqact[1007]: pbuf_flush 4: time elapsed  17.271805
> May 12 17:00:11 opwx pqact[1007]: pbuf_flush 4: time elapsed   6.612698
> May 12 17:00:31 opwx pqact[1007]: pbuf_flush 4: time elapsed  20.289586
> May 12 17:00:44 opwx pqact[1007]: pbuf_flush 4: time elapsed  13.348882
> May 12 17:01:04 opwx pqact[1007]: pbuf_flush 4: time elapsed  18.993940
> May 12 17:01:22 opwx pqact[1007]: pbuf_flush 4: time elapsed  17.884479
> May 12 17:01:37 opwx pqact[1007]: pbuf_flush 4: time elapsed  15.760065
> May 12 17:01:56 opwx pqexpire[1005]: > Recycled   3467.727 kb/hr (
> 5661.799 prods per hour)
> May 12 17:02:02 opwx pqact[1007]: pbuf_flush 4: time elapsed  24.224632
>
> Here is our request from ldmd.conf:
>
> # LDM5 servers request data
> #
> #       request <feedset> <pattern> <hostname pattern>
> #
> request DDPLUS  ".*"    pluto.met.fsu.edu
> #request HDS    ".*"    pluto.met.fsu.edu
> #request IDS    ".*"    pluto.met.fsu.edu
> #request MCIDAS ".*"    pluto.met.fsu.edu
>
> request NLDN    ".*"    striker.atmos.albany.edu
>
> This data volume doesn't seem to be too ambitious for testing the
> system, is it?
>
> I've looked in the ldm-support archive and think that this is related to
> having a "slow hard disk" - is this true?  (It is an ATA/IDE disk.)
>
> hdc: WDC AC26400R, 6149MB w/512kB Cache, CHS=13328/15/63
>
> Please let me know what, if anything, I can do to fix this.
>
> Thanks!!
>
> Chris Herbster

Good Morning, Chris!

No, your data volume is not too ambitious.    Here's what's going on:  pqact is
trying to send data down a pipe, presumably to a decoder.  The buffer got full
because the resource at the end of the pipe was not available
at the time, so the write had to block causing it to take more than 1 second.
These messages are informational - they may not be a problem if they occur
infrequently and you're not losing data.  But, the frequency and timing of your
messages seem to indicate a significant problem.

The question is, why was the resource unavailable?  The problem could indeed be 
a
slow
disk like you suggested, although it would be pretty darn slow if it's 
struggling
with your relatively small amount of data.  Or maybe the decoder timed out or
exited, although there would probably be other messages if that was the case.
So, I wouldn't think that the messages would occur with such frequency if the
problem was either of those causes.  Maybe some other unrelated process is tying
up the disk.  Or maybe the write permissions aren't set properly in your 
directory
structure.

Another possibility is a "SCSI termination problem", but that sounds pretty
speculative to me.   Still, if you exhaust other possibilities, take a look at
http://www.unidata.ucar.edu/cgi-bin/mfs/65/2435?39#mfs for more on this topic.

So,  check your permissions, ensure that any required decoders are indeed 
running
(although I would think you'd see a different message if they weren't), use 
'top'
to see what processes are running that might be hogging the disk, and ensure 
that
there's plenty of space on the disk.  If disk speed alone really is your 
problem,
possible solutions are: get a faster disk, decrease the number of processes 
using
the disk, or offload some processing to another machine.

Good luck!   When you discover the cause of your problem, it is always helpful 
to
us and others in our community to let us know.

Anne

--
***************************************************
Anne Wilson                     UCAR Unidata Program
address@hidden                  P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************