[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[TIGGE #RST-559527]: processed oldest product in queue



Manuel,

> This is the output of top on tigge-ldm.ecmwf.int:
> 
> top - 16:28:01 up 105 days,  5:06, 25 users,  load average: 5.36, 5.26, 5.34
> Tasks: 318 total,   1 running, 317 sleeping,   0 stopped,   0 zombie
> Cpu(s): 11.4% us, 10.7% sy,  0.0% ni, 41.9% id, 34.2% wa,  0.2% hi,  1.5% si
> Mem:  16464552k total, 16431636k used,    32916k free,   529384k buffers
> Swap:  1052248k total,    10164k used,  1042084k free, 12866472k cached
> 
> We have 16GBytes physical memory, only 10 MBytes current swap.

The CPU is mostly either idle or waiting on disk I/O.

All the physical memory is being used plus some swap space.

We believe that the system is overloaded with I/O.  The "iostat"
utility indicates that one partition, in particular, is being very
heavily used.  It is not one that the LDM uses, however.  Indeed,
as far as we can tell, the LDM is not using the system as much as
other processes are.  If some of those processes could be
offloaded to other computers, then a larger product-queue might
be feasible.

> The output of pqmon you analysed was from when they queue size was 4
> GBytes. The following are two outputs of pqmon taken today, with a queue
> size of 10 GBytes:
> 
> ldm@tigge-ldm:~/logs> pqmon
> Apr 12 13:49:36 pqmon NOTE: Starting Up (18219)
> Apr 12 13:49:36 pqmon NOTE: nprods nfree  nempty      nbytes  maxprods
> maxfree  minempty    maxext  age
> Apr 12 13:49:36 pqmon NOTE:  20500    14 2420892  9995688200    144747
> 5595   2296244    429072 2878

I don't like the age of the oldest product in the queue
(2878 seconds) because that's less than one hour.

The product-queue has slots for 2441406 products but has only
held a maximum of 144747 products.  This means that the
product-queue is limited by the size of its data portion and
not by the number of product slots.  This fact is much less
important, however, than the amount of I/O that the system is
doing.

> Apr 12 13:49:36 pqmon NOTE: Exiting
> ldm@tigge-ldm:~/logs> pqmon
> Apr 12 15:26:33 pqmon NOTE: Starting Up (20910)
> Apr 12 15:26:33 pqmon NOTE: nprods nfree  nempty      nbytes  maxprods
> maxfree  minempty    maxext  age
> Apr 12 15:26:33 pqmon NOTE:  31532   384 2409490  9996855672    144747
> 5595   2296244    469176 5601
> Apr 12 15:26:33 pqmon NOTE: Exiting
> 
> 
> Do you recommend an optimum queue size taking into account the
> characteristics of TIGGE ? Currently:
> * 169 GBytes pqinsert per day
> *   8 GBytes per day received from NCAR
> *  27 GBytes per day received from CMA (eventually)
> *  16 GBytes of physical memory in tigge-ldm.ecmwf.int
> Note that these figures correspond to daily production. We are currently
> 4 cycles behind, which means we have approximately 340 Gbytes
> outstanding to be inserted in LDM. We limit ourselves to a maximum
> insertion rate of 3.2 MBytes / second, which means we can potentially
> insert 270 GBytes into LDM per day.

My advice is for the queue to be able to hold at least one
hour of data, so the data portion of the queue should be
at least (270 GB/d + 8 GB/d + 27 GB/d) times 1 h, which is
about 12.71 GB.  Unfortunately, a queue that large will force
all the other processes to swap in and out of memory ---
greatly reducing the performance of the system.

I think you need to increase the amount of memory or
offload all those other I/O intensive processes, or
both.

I hope this helps.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: RST-559527
Department: Support IDD TIGGE
Priority: Normal
Status: Closed