Re: [netcdf-java] NetCDF File and Variable Data Caching

  • To: James Gardner <james.gardner@xxxxxxxx>
  • Subject: Re: [netcdf-java] NetCDF File and Variable Data Caching
  • From: Christian Ward-Garrison <cwardgar@xxxxxxxx>
  • Date: Tue, 11 Jul 2017 16:58:03 -0600
Hi James,

> However I can't help wonder if this is best for the library.

That's fair. The best solution would probably have those values initialized
in their classes' <clinit> blocks (via system property or config file) and
immutable thereafter. At present, I don't think we do initialization like
that anywhere in the library. We have nj22Config.xml [1], but it must
explicitly be loaded by the application. There's no guarantee that the
config file will be read and applied at initialization time only.

> I imagine it would be useful to be able to change max size on a
per-file-basis.

I know you already touched on this above, but it seems to me that you could
use setSizeToCache in its current form to do exactly what you want, and it
wouldn't be all that cumbersome. After, you've created the NetcdfFile, just
do something like:

for (Variable var : ncFile.getVariables()) {
    var.setSizeToCache(myCustomCacheSize);
}

Is that no good? Do you want to configure the value during object
construction only?

> Can you again validate that this is actually the situation we are facing
and that I'm not missing anything here?

Ordinarily, there will only ever be one copy of a NetcdfDataset object in
the cache for any single dataset on disk, and threads will share it.
However, we don't want multiple threads fiddling with the same NetcdfFile
object at the same time (they're mutable). So, if Thread B attempts to
acquire from the cache a dataset that is already opened by Tread A, that
acquisition will fail and we'll create a brand new NetcdfFile object for
Thread B to use instead. So the cache now contains 2 copies of the same
dataset.

Taking this to the logical conclusion. If 10 threads are accessing the SAME
DATASET at the SAME TIME, then it is indeed possible for 10 copies of a
NetcdfDataset to be in the cache. This hasn't been a problem for us on our
production THREDDS server, but maybe you're expecting much more traffic
than we get. Also note that unnecessary cache elements are periodically
cleaned in the background, and you can configure how that is done. [2]

I hope that clears things up for you.
Christian

[1]
http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/RuntimeLoading.html#XML
[2]
https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/Caching.html,
see "Object Caching" section

On Tue, Jul 11, 2017 at 12:59 PM, James Gardner <james.gardner@xxxxxxxx>
wrote:

>
> Hi Christian,
>
> 1) Regarding variable max cache size, thank you for the (speedy)
> accommodation; it would likely solve my problem. However I can't help
> wonder if this is best for the library. Use of variable (non-final) statics
> effectively means employing globals. That's going to make it much more
> difficult to reason about their state at a given moment. Before they were
> hardcoded and final, so their value could be relied upon. Now it will be
> subject to change by any thread at any time, and new values will affect all
> default-using Variable instances everywhere in the app, regardless of which
> file they are associated with.
> I imagine it would be useful to be able to change max size on a
> per-file-basis. Is there a point in the code where Variable objects are
> being attached to the NetcdfFile object, where the limit could be set, to a
> value previously specified on that NetcdfFile object? That way one would
> have the opportunity to set a value appropriate to a particular netcdf
> file, without the above perils. The static default value could then remain
> constant, as-is.
> Let me know what you think.
>
> 2) Regarding each NetcdfDataset having it's own set of cached Variable
> data: it seems a consequence of this would be that in a multi-threaded
> environment like a web server (as in our use-case), where many
> NetcdfDataset objects would (by necessity) be 'checked out' (in use by
> various threads) at any given time, there would then, after a period of use
> during which the caches would populate, end up being multiple copies of
> each cached Variable in memory.
> For example, at the moment we are putting each of our netcdf files
> entirely in memory, and as you might imagine that uses up an immense amount
> of memory and of course is the most simplistic 'caching' solution possible.
> As it turns out, we only actually use about 1/10th of the variables in each
> netcdf file in question. So a lazy caching solution, such as netcdf-java
> lib's, should be a distinct advantage, which is why I was pursuing it,
> allowing us to reduce the amount of memory required to very roughly 1/10 of
> current.
> However, since each Netcdfdataset has its own copy of the same cached
> Variable data, not only does it negate this advantage, it could actually
> require vastly more memory since the number of request-servicing threads
> would easily exceed 10.
> Can you again validate that this is actually the situation we are facing
> and that I'm not missing anything here?
>
> Many thanks!
> James
>
>
>
> On 07/10/2017 04:33 PM, Christian Ward-Garrison wrote:
>
> Hi James,
>
> > is there a central place where one can change the max size of variable
> to cache
>
> There are 3, each used for a different purpose [1][2][3]. Unfortunately,
> none of them could be modified by the user, so I pushed a commit [4] that
> fixes that. I also built a new SNAPSHOT version of NetCDF-Java that
> includes the fix [5]. If you'd like to change those values, you should grab
> that version. If you're using Maven or Gradle to manage the dependencies of
> your program, follow these instructions [6], but pull from the
> "unidata-snapshots" repository.
>
> > Can you verify whether or not this is the case?
>
> It is indeed the case. The data are cached on the Variable objects, which
> are not shared among NetcdfDatasets.
>
> Cheers,
> Christian
>
> [1] https://github.com/cwardgar/thredds/blob/
> a88db4af71bac2c29429540bc1e5387741be7d68/cdm/src/main/java/
> ucar/nc2/Variable.java#L69
> [2] https://github.com/cwardgar/thredds/blob/
> a88db4af71bac2c29429540bc1e5387741be7d68/cdm/src/main/java/
> ucar/nc2/Variable.java#L70
> [3] https://github.com/cwardgar/thredds/blob/
> a88db4af71bac2c29429540bc1e5387741be7d68/cdm/src/main/java/
> ucar/nc2/dataset/CoordinateAxis.java#L76
> [4] https://github.com/cwardgar/thredds/commit/
> a88db4af71bac2c29429540bc1e5387741be7d68
> [5] http://artifacts.unidata.ucar.edu/content/repositories/
> unidata-snapshots/edu/ucar/netcdfAll/4.6.11-SNAPSHOT/
> [6] https://www.unidata.ucar.edu/software/thredds/current/
> netcdf-java/reference/BuildDependencies.html
>
> On Thu, Jul 6, 2017 at 2:05 PM, James Gardner <james.gardner@xxxxxxxx>
> wrote:
>
>>
>> Hi,
>>
>> If I might pick up where my coworker left off; I need to know a couple of
>> additional details about how caching is implemented in the netcdf-java lib.
>>
>> First, is there a central place where one can change the max size of
>> variable to cache, for instance on the netcdfFile/Dataset object to which
>> the variable belongs (or even some master location), or must one use
>> setSizeToCache on each separate variable?
>>
>> My second question is regarding the way/location in which variable data
>> is cached.
>> If one is using netcdf file caching, it seems that all netcdfDataset
>> objects in the cache/pool which represent a given actual netcdf file, each
>> have their own cached variable data. In other words, cached variable data
>> is not shared between acquired netcdfDataset objects.
>> I believe I have seen this reflected in the output of (the cache status
>> section of) getDetailInfo() when called on various such netcdfDataset
>> objects during runtime; some seem to reflect the custom setSizeToCache
>> changes I have made, and have cached data, while others do not, indicating
>> that they have their own copies of variables, along with their own cache
>> settings/thresholds.
>> Can you verify whether or not this is the case?
>>
>> Cheers,
>> James
>>
>>
>> Hi Kevin,
>>>
>>> I've done a little research and can provide some answers to your
>>> questions.
>>>
>>> > Then, next time that NetcdfDataset.acquireDataset() is called it causes
>>> the
>>> > FileCache.acquireCacheOnly() to return null because the cached
>>> NetcdfDataset.raf
>>> > (RandomAccessFile) is null so it makes the lastModified = 0.
>>>
>>> Prior to v4.6.5, this is indeed how caching of NetcdfDataset worked. It
>>> was
>>> broken. However, the commit I referenced earlier should've fixed that.
>>>
>>> > What does NetcdfDataset.acquireDataset() actually cache?
>>>
>>> It caches the actual NetcdfDataset object, which is the result of
>>> parsing a
>>> dataset's metadata to form a hierarchical structure and then optionally
>>> "enhancing" that structure.  Typical enhancements include construction of
>>> coordinate systems. These objects are heavyweight and non-trivial to
>>> create, so only making them once is a huge performance win, especially if
>>> the dataset aggregates smaller datasets.
>>>
>>> > Can I avoid having to do a Variable.read() for every request?
>>> > Shouldn't this data be cached inside of the netcdf file.
>>>
>>> No, you can't avoid calling Variable.read(). However, if a variable is
>>> small enough its data will be cached automatically [1]. It looks like the
>>> limits are 4,000 bytes for normal variables and 40,000 bytes for
>>> coordinate
>>> variables, though you could set different limits by calling
>>> Variable.setSizeToCache(). Alternatively, you could just explicitly cache
>>> the data yourself by calling Variable.setCachedData().
>>>
>>> > Should I be using those caching options and just storing those Variable
>>> objects
>>> > in memory in my own cache instead.
>>>
>>> With the recent caching fix, you shouldn't need to hold on to the
>>> Variable
>>> objects yourself. NetcdfDatasets will be cached, including the Variables
>>> that they contain.
>>>
>>> > Would it be a better option to use NetcdfFile.openInMemory().
>>>
>>> You could try that, especially if hardware resources are no object. I'd
>>> be
>>> interested in the results. Actually, I'd be interested in any performance
>>> data you collect as you optimize your response times.
>>>
>>> Just be aware that opening a file using the static methods or
>>> constructors
>>> in NetcdfFile will mean that enhancements won't be applied to it. If you
>>> need coordinate systems to be built, or calculation of
>>> scale/offset/missing
>>> values, you need to open with NetcdfDataset.
>>>
>>> ----
>>>
>>> In your original message, you mentioned you're using
>>> NetcdfDataset.initNetcdfFileCache(), which caches NetcdfDataset objects .
>>> Another potential performance improvement may come from caching the
>>> underlying RandomAccessFiles, via setGlobalFileCache(). If a
>>> RandomAccessFile is acquired from the cache rather than recreated, this
>>> saves you from performing an open() system call, as well as potentially a
>>> seek() and fill of its buffer.
>>>
>>> Here [3] are the global caches we run in the TDS. You don't need to worry
>>> about GribCdmIndex unless you're working with GRIB files.
>>>
>>> Cheers,
>>> Christian
>>>
>>>
>>> [1]
>>> https://github.com/Unidata/thredds/blob/v4.6.6/cdm/src/main/
>>> java/ucar/nc2/Variable.java#L848
>>> [2]
>>> https://github.com/Unidata/thredds/blob/v4.6.6/cdm/src/main/
>>> java/ucar/nc2/Variable.java#L69
>>> [3]
>>> https://github.com/Unidata/thredds/blob/v4.6.6/tds/src/main/
>>> java/thredds/server/config/CdmInit.java#L263
>>>
>>> On Wed, Jun 15, 2016 at 4:31 PM, Christian Ward-Garrison
>>> <cwardgar@xxxxxxxx>
>>> wrote:
>>>
>>> > Hi Kevin,
>>> >
>>> > Sorry for the delay in respondingâI was busy with the release of
>>> 4.6.6âbut
>>> > I have some time to work on this issue now. A couple questions:
>>> >
>>> > 1. What does your webapp do? It sounds like it takes a user-defined
>>> subset
>>> > of the data in a NetCDF file and returns it in JSON format. How
>>> similar is
>>> > it to our NetCDF Subset Service (example
>>> > <http://thredds.ucar.edu/thredds/ncss/grib/NCEP/NAM/Alaska_1
>>> 1km/Best/dataset.html>
>>> > )?
>>> > 2. What version of NetCDF-Java are you using. I suspect that much of
>>> the
>>> > slowness you're encountering was already fixed
>>> > <https://github.com/cwardgar/thredds/commit/075e9a819ee10714
>>> d53b355481a7cccac88b1fb9#diff-99981060deed76f1a9ddedc4362acd7fL155>
>>> > in v4.6.5.
>>> >
>>> > Cheers,
>>> > Christian
>>> >
>>> > On Wed, Jun 8, 2016 at 4:17 PM, Kevin Off - NOAA Affiliate <
>>> > kevin.off@xxxxxxxx> wrote:
>>> >
>>> >> Hi all,
>>> >>
>>> >> I am trying to understand caching when it comes to the file and the
>>> >> actual data. The application that I am working on will provide data
>>> from
>>> >> 133 NetCDF files that range in size from 50 MB to 400 MB. These are
>>> weather
>>> >> forecast files that contain about 22 variables that we are interested
>>> in .
>>> >> Each variable has between 1 and 55 or so time steps as dimensions.
>>> >>
>>> >> This is a Spring web application running in an embedded tomcat
>>> instance.
>>> >> All of the files on disk amount to about 22GB of data.
>>> >>
>>> >> When I receive a request I:
>>> >>
>>> >>    1. Re-project the lat lon to the dataset's projection (Lambert
>>> >>    Convormal)
>>> >>    2. Lookup the index of the data from the coordinate variabls
>>> >>    3. loop through every variable
>>> >>    4. Perform the Array a = var.read()
>>> >>    5. Loop through every time step and retrieve the value at the
>>> >>    specified point
>>> >>    6. Return it all in a JSON document.
>>> >>
>>> >> This application needs to be extremely fast. We will be serving
>>> thousands
>>> >> of requests per second (in production on a scaled system) depending on
>>> >> weather conditions.
>>> >>
>>> >> I have been told that hardware is not an obstacle and that I can use
>>> as
>>> >> much memory as I need.
>>> >> During my coding and debugging I have been able to achieve a response
>>> >> time of about 200ms - 400ms on average (this does not include any
>>> network
>>> >> time).
>>> >> As I add timers to every part of the application I find that most of
>>> the
>>> >> time is spent in the Variable.read() function.
>>> >>
>>> >> Here is a summary of the the configuration of the app.
>>> >>
>>> >> NetcdfDataset.initNetcdfFileCache(100, 200, 0);
>>> >> NetcdfDataset nc = NetcdfDataset.acquireDataset(filename, null)
>>> >> for each coverage{
>>> >>   Variable v = ds.findVariable(name)
>>> >>   Array d = v.read()
>>> >>   for each time step {
>>> >>     value = d.read(time, y, x)
>>> >>   }
>>> >> }
>>> >> nc.close()
>>> >>
>>> >> I have several questions.
>>> >>
>>> >>    1. I noticed that when the NetcdfDataset.close() function is called
>>> >>    it detects that I am using caching and performs releases. This
>>> causes the
>>> >>    IOServiceProvider (AbstractIOServiceProvider).release() to be
>>> called which
>>> >>    closes and nulls the RandomAccessFile. Then, next time that
>>> >>    NetcdfDataset.acquireDataset() is called it causes the
>>> >>    FileCache.acquireCacheOnly() to return null because the cached
>>> >>    NetcdfDataset.raf (RandomAccessFile) is null so it makes the
>>> lastModified >> =
>>> >>    0. Am I missing something or is there no way to reuse the
>>> NetcdfDataset
>>> >>    after you call close()?
>>> >>    2. What does NetcdfDataset.acquireDataset() actually cache? Is it
>>> >>    just the metadata or does it actually read in the data to all of
>>> the
>>> >>    variables?
>>> >>    3. Can I avoid having to do a Variable.read() for every request?
>>> >>    Shouldn't this data be cached inside of the netcdf file.
>>> >>    4. I see that there are caching functions on the Variable object.
>>> >>    Should I be using those caching options and just storing those
>>> Variable
>>> >>    objects in memory in my own cache instead.
>>> >>    5. Would it be a better option to use NetcdfFile.openInMemory().
>>> >>
>>> >> I know this is a bit long winded but I just want to make sure to
>>> explore
>>> >> all of my options. I have spent a lot of time stepping through the
>>> ucar
>>> >> library and have already learned a lot. I just need a little guidance
>>> >> regarding some of the more abstract caching functionality. Thanks for
>>> your
>>> >> help.
>>> >>
>>> >> --
>>> >> Kevin Off
>>> >> Internet Dissemination Group, Kansas City
>>> >> Shared Infrastructure Services Branch
>>> >> National Weather Service
>>> >> Software Engineer / Ace Info Solutions, Inc.
>>> >> <http://www.aceinfosolutions.com>
>>> >>
>>> >> _______________________________________________
>>> >> NOTE: All exchanges posted to Unidata maintained email lists are
>>> >> recorded in the Unidata inquiry tracking system and made publicly
>>> >> available through the web.  Users who post to any of the lists we
>>> >> maintain are reminded to remove any personal information that they
>>> >> do not want to be made public.
>>> >>
>>> >>
>>> >> netcdf-java mailing list
>>> >> netcdf-java@xxxxxxxxxxxxxxxx
>>> >> For list information or to unsubscribe, visit:
>>> >> http://www.unidata.ucar.edu/mailing_lists/
>>> >>
>>> >
>>> >
>>>
>>
>> _______________________________________________
>> NOTE: All exchanges posted to Unidata maintained email lists are
>> recorded in the Unidata inquiry tracking system and made publicly
>> available through the web.  Users who post to any of the lists we
>> maintain are reminded to remove any personal information that they
>> do not want to be made public.
>>
>>
>> netcdf-java mailing list
>> netcdf-java@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe, visit:
>> http://www.unidata.ucar.edu/mailing_lists/
>
>
>
>
  • 2017 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: