Re: [netcdf-java] NetCDF File and Variable Data Caching

  • To: Kevin Off - NOAA Affiliate <kevin.off@xxxxxxxx>
  • Subject: Re: [netcdf-java] NetCDF File and Variable Data Caching
  • From: Christian Ward-Garrison <cwardgar@xxxxxxxx>
  • Date: Thu, 16 Jun 2016 14:35:35 -0600
Hi Kevin,

I've done a little research and can provide some answers to your questions.

> Then, next time that NetcdfDataset.acquireDataset() is called it causes
the
> FileCache.acquireCacheOnly() to return null because the cached
NetcdfDataset.raf
> (RandomAccessFile) is null so it makes the lastModified = 0.

Prior to v4.6.5, this is indeed how caching of NetcdfDataset worked. It was
broken. However, the commit I referenced earlier should've fixed that.

> What does NetcdfDataset.acquireDataset() actually cache?

It caches the actual NetcdfDataset object, which is the result of parsing a
dataset's metadata to form a hierarchical structure and then optionally
"enhancing" that structure.  Typical enhancements include construction of
coordinate systems. These objects are heavyweight and non-trivial to
create, so only making them once is a huge performance win, especially if
the dataset aggregates smaller datasets.

> Can I avoid having to do a Variable.read() for every request?
> Shouldn't this data be cached inside of the netcdf file.

No, you can't avoid calling Variable.read(). However, if a variable is
small enough its data will be cached automatically [1]. It looks like the
limits are 4,000 bytes for normal variables and 40,000 bytes for coordinate
variables, though you could set different limits by calling
Variable.setSizeToCache(). Alternatively, you could just explicitly cache
the data yourself by calling Variable.setCachedData().

> Should I be using those caching options and just storing those Variable
objects
> in memory in my own cache instead.

With the recent caching fix, you shouldn't need to hold on to the Variable
objects yourself. NetcdfDatasets will be cached, including the Variables
that they contain.

> Would it be a better option to use NetcdfFile.openInMemory().

You could try that, especially if hardware resources are no object. I'd be
interested in the results. Actually, I'd be interested in any performance
data you collect as you optimize your response times.

Just be aware that opening a file using the static methods or constructors
in NetcdfFile will mean that enhancements won't be applied to it. If you
need coordinate systems to be built, or calculation of scale/offset/missing
values, you need to open with NetcdfDataset.

----

In your original message, you mentioned you're using
NetcdfDataset.initNetcdfFileCache(), which caches NetcdfDataset objects.
Another potential performance improvement may come from caching the
underlying RandomAccessFiles, via setGlobalFileCache(). If a
RandomAccessFile is acquired from the cache rather than recreated, this
saves you from performing an open() system call, as well as potentially a
seek() and fill of its buffer.

Here [3] are the global caches we run in the TDS. You don't need to worry
about GribCdmIndex unless you're working with GRIB files.

Cheers,
Christian


[1]
https://github.com/Unidata/thredds/blob/v4.6.6/cdm/src/main/java/ucar/nc2/Variable.java#L848
[2]
https://github.com/Unidata/thredds/blob/v4.6.6/cdm/src/main/java/ucar/nc2/Variable.java#L69
[3]
https://github.com/Unidata/thredds/blob/v4.6.6/tds/src/main/java/thredds/server/config/CdmInit.java#L263

On Wed, Jun 15, 2016 at 4:31 PM, Christian Ward-Garrison <cwardgar@xxxxxxxx>
wrote:

> Hi Kevin,
>
> Sorry for the delay in responding–I was busy with the release of 4.6.6–but
> I have some time to work on this issue now. A couple questions:
>
> 1. What does your webapp do? It sounds like it takes a user-defined subset
> of the data in a NetCDF file and returns it in JSON format. How similar is
> it to our NetCDF Subset Service (example
> <http://thredds.ucar.edu/thredds/ncss/grib/NCEP/NAM/Alaska_11km/Best/dataset.html>
> )?
> 2. What version of NetCDF-Java are you using. I suspect that much of the
> slowness you're encountering was already fixed
> <https://github.com/cwardgar/thredds/commit/075e9a819ee10714d53b355481a7cccac88b1fb9#diff-99981060deed76f1a9ddedc4362acd7fL155>
> in v4.6.5.
>
> Cheers,
> Christian
>
> On Wed, Jun 8, 2016 at 4:17 PM, Kevin Off - NOAA Affiliate <
> kevin.off@xxxxxxxx> wrote:
>
>> Hi all,
>>
>> I am trying to understand caching when it comes to the file and the
>> actual data. The application that I am working on will provide data from
>> 133 NetCDF files that range in size from 50 MB to 400 MB. These are weather
>> forecast files that contain about 22 variables that we are interested in .
>> Each variable has between 1 and 55 or so time steps as dimensions.
>>
>> This is a Spring web application running in an embedded tomcat instance.
>> All of the files on disk amount to about 22GB of data.
>>
>> When I receive a request I:
>>
>>    1. Re-project the lat lon to the dataset's projection (Lambert
>>    Convormal)
>>    2. Lookup the index of the data from the coordinate variabls
>>    3. loop through every variable
>>    4. Perform the Array a = var.read()
>>    5. Loop through every time step and retrieve the value at the
>>    specified point
>>    6. Return it all in a JSON document.
>>
>> This application needs to be extremely fast. We will be serving thousands
>> of requests per second (in production on a scaled system) depending on
>> weather conditions.
>>
>> I have been told that hardware is not an obstacle and that I can use as
>> much memory as I need.
>> During my coding and debugging I have been able to achieve a response
>> time of about 200ms - 400ms on average (this does not include any network
>> time).
>> As I add timers to every part of the application I find that most of the
>> time is spent in the Variable.read() function.
>>
>> Here is a summary of the the configuration of the app.
>>
>> NetcdfDataset.initNetcdfFileCache(100, 200, 0);
>> NetcdfDataset nc = NetcdfDataset.acquireDataset(filename, null)
>> for each coverage{
>>   Variable v = ds.findVariable(name)
>>   Array d = v.read()
>>   for each time step {
>>     value = d.read(time, y, x)
>>   }
>> }
>> nc.close()
>>
>> I have several questions.
>>
>>    1. I noticed that when the NetcdfDataset.close() function is called
>>    it detects that I am using caching and performs releases. This causes the
>>    IOServiceProvider (AbstractIOServiceProvider).release() to be called which
>>    closes and nulls the RandomAccessFile. Then, next time that
>>    NetcdfDataset.acquireDataset() is called it causes the
>>    FileCache.acquireCacheOnly() to return null because the cached
>>    NetcdfDataset.raf (RandomAccessFile) is null so it makes the lastModified 
>> =
>>    0. Am I missing something or is there no way to reuse the NetcdfDataset
>>    after you call close()?
>>    2. What does NetcdfDataset.acquireDataset() actually cache? Is it
>>    just the metadata or does it actually read in the data to all of the
>>    variables?
>>    3. Can I avoid having to do a Variable.read() for every request?
>>    Shouldn't this data be cached inside of the netcdf file.
>>    4. I see that there are caching functions on the Variable object.
>>    Should I be using those caching options and just storing those Variable
>>    objects in memory in my own cache instead.
>>    5. Would it be a better option to use NetcdfFile.openInMemory().
>>
>> I know this is a bit long winded but I just want to make sure to explore
>> all of my options. I have spent a lot of time stepping through the ucar
>> library and have already learned a lot. I just need a little guidance
>> regarding some of the more abstract caching functionality. Thanks for your
>> help.
>>
>> --
>> Kevin Off
>> Internet Dissemination Group, Kansas City
>> Shared Infrastructure Services Branch
>> National Weather Service
>> Software Engineer / Ace Info Solutions, Inc.
>> <http://www.aceinfosolutions.com>
>>
>> _______________________________________________
>> NOTE: All exchanges posted to Unidata maintained email lists are
>> recorded in the Unidata inquiry tracking system and made publicly
>> available through the web.  Users who post to any of the lists we
>> maintain are reminded to remove any personal information that they
>> do not want to be made public.
>>
>>
>> netcdf-java mailing list
>> netcdf-java@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe, visit:
>> http://www.unidata.ucar.edu/mailing_lists/
>>
>
>
  • 2016 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: