[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #MMY-812358]: Possible memory leak in ldm



Bill,

> The National Weather Service uses the LDM as part of data transfers
> between offices on our LDAD servers.  We have two servers that have
> heartbeat failover capability.  I am very familiar with the LDM and
> configuration and have been using it successfully for almost 15 years
> now. Recently I added some new model data products to our LDM feed from
> the NWS office in Seattle to NWS Portland.

What version of the LDM is having problems?

> There are 5 large files
> that are transferred at different times twice a day.  Each of these
> files is 200-300MB.  There is also one file of around 900MB.

Are these files inserted into the product-queue all at once?  If so,
then the product-queue should be large enough to contain all of them
(e.g., 2.2 gigabytes).

> After
> these products were added to the LDM data transfer process we began
> experiencing serious problems here in Portland on the receiving end
> where the computer loads would go way up and the computer would slow and
> eventually fail over the the other server.   I have a queue size of
> 1.5GB  which should be more than enough to hold the files.

Your product-queue might need to be larger.  While pqact(1) processes
a data-product, the product can't be removed from the queue, which
might affect the reception of new products.

> The queue
> usage according to pqmon has never exceeded 1GB.   Initially there was
> only 1GB of main memory in these computers.
> 
> I monitored the memory usage as these products were being received and
> noticed that we were running out of main memory and using up to 1.5GB of
> swap space before the server would fail over the the other server.  This
> was a serious problem which was having impacts on our data collection.
> We added memory to the servers: 2GB to one server and 4GB to the other
> server to see if this would help since there is a note in the LDM
> documentation that says you should have enough memory to hold the entire
> queue.

You should have sufficient physical memory to hold the product-queue
and the LDM programs in memory.  This is to improve performance, however.
Ignoring performance, the the LDM doesn't depend on the amount of physical
memory for correct behaviour (unless you count missing a data-product due
to poor performance incorrect behaviour).

> After some more tests and monitoring the main memory would
> continue to be used up and would go into large amounts of swap space
> before the server would fail over.
> 
> At any time, if I stop the LDM the swap space and almost all of the main
> memory are immediately released.   It appears that the LDM is not
> releasing the memory properly as products are expired from the queue.
> Using the pqmon utility I can see the amount of memory used by the
> queue.  As the files arrive the memory usage grows and is then released
> after it is expired in around an hour.  But the actual memory usage
> continues to grow and grow as the next file arrives.  The physical
> memory use does not agree with the queue usage. The queue usage has
> never exceeded 1GB.

This is very odd.  We've never seen this behaviour before.

What operating system is running on the affected machine?  What version?

> I have implemented a temporary work around but is not the final
> solution.  I am restarting the LDM without, deleting the queue, on a
> cron job about an hour after the large files are received this releases
> the memory back down to almost nothing each time and allows the system
> to continue to operate.  The problem is that the files don't always
> arrive at the same time so the timing can be off.
> 
> It is also interesting to note that as far as I know there have been no
> problems inserting these files into the queue at WFO Seattle and they
> are not experience any problems even they only have 1GB of main
> memory.   Thus, the problem seems to be how the LDM manages the memory
> as it is receiving these large files. Although I am familiar with LDM I
> can't rule out the possibility that I have something configured
> incorrectly.  I have attached the ldmd.conf and pqact files.

You should remove the "EXEC pqexpire" entry from the LDM configuration
file, ldmd.conf.  That program is deprecated (assuming you're using a
recent version of the LDM).

> The large
> files are the ones with "mm5" in the file name.
> 
> My workaround is not foolproof either.  I noticed that the system failed
> over to the second server today when the LDM was being restarted.  This
> may be because the queue approaching 1 GB and was "flushing to disk"
> keeping the system very busy.  This appears to have caused delay in the
> heartbeat signals.
> 
> One other thing of note.  These files are being inserted into the queue
> in Seattle with the full path name (e.g.
> /data/ldad/mm5gfsd1/20090408_1200.mm5gfsd1.gz).  Of all the files we
> receive they are the only ones that contain the full path name.  Could
> that be contributing to the problem?

It shouldn't.

> I would appreciate any help or advice you could give us on fixing this
> problem.

Please send me the answers to the above questions.

> Thanks,
> Bill

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: MMY-812358
Department: Support LDM
Priority: Normal
Status: Closed