[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #IDT-559068]: Efficiency of reading HDF with netcdf 4



Hi Benno,

> I have a question for you, something inadequately tested, I am afraid.
> 
> I am reading MODIS tiles from the original HDF files using netcdf 4.1.2
> 32bit. There is a square array (18x36) of tiles, of which 317 are present
> in the 1km resolution data, and somewhat fewer in the 250m resolution.
> 
> I have numbers for the 1km dataset. When I roughly profiled my data
> processing program, I got
> 
> Runopenncfilestream0 DOmultiplenetcdfRPScolori20 profile221.0s22.5s 18.6s
> profile2a7.4s20.7s 18.5s profile2ms64.8s20.4s 21.3s profile2ams7.2s18.2s
> 19.6s profile2cache0017.4s
> 
> The runs are in pairs, with the 2a runs immediately after the 2 runs, to
> separate out disk access time (1 variable with all the 1km tiles
> decompressed is about 1.7 GB, which is also roughly the total file size,
> compressed with 317/648 tiles). There are multiple fields in the files, and
> I am only reading one field.
> 
> the "open" routine runs netcdf open, the DOmultiple routine reads the data
> into arrays, and PScolor does the processing, irrelevant to this question.
> 
> 1) "opening the files" takes 21s (directly attached disk) and 65s (NFS
> mounted) the first time, 7s the second time, i.e. from memory.
> "DOmultiple" always takes the same amount of time, indicating that there are
> no disk accesses during the reads, only during the "opens". So it looks
> like the open hits all the disk blocks needed to read the first field at
> least, if not more. I assume DOmultiple is doing the decompression, which
> on my machine is slow.
> 
> The last line is reading the uncompressed data with tiles sequentially
> arranged in a pair of binary files -- it takes about 6s the first time, 2s
> from memory. Much faster, though obviously much more disk space.
> 
> I presume 250m will be 16x slower, though I have not run the numbers yet.
> 
> So my questions are
> 
> 1) is the open reading the whole file, or at least bytes from all the blocks
> in the file? Clearly this is a wierd case for netcdf, with so many files
> and only reading a record from each.

No, but a netCDF open does read all the metadata (the file schema,
with dimension, variable, and attribute definitions and attribute
values) in a file.  The metadata is kept in memory as long as the file
is open.  This is in contrast to the HDF5 library, which reads
metadata only as required when data is accessed ("lazy" evaluation).
As a result, netCDF takes significantly longer to initially open files
that have a lot of metadata than HDF5 does, but netCDF is often faster
than HDF5 at subsequent accesses to metadata and to data that requires
associated metadata.

> 2) are there linking choices to improve the decompression performance?

Decompression happens the first time a compressed variable is read.
If the variable is stored using chunking, only the needed chunks are
decompressed.  There is a chunk cache that can prevent the same data
chunk from being decompressed multiple times when multiple reads are
made to data in the same chunk, but the default chunk cache may not be
adequate.  The size of the chunk cache can be determined and set with
calls to nc_get_var_chunk_cache() and nc_set_var_chunk_cache().  The
guidance in our documentation is currently inadequate for how to
configure the chunk cache when the defaults don't work well.  Some
experimentation may be required, as it depends on how much memory you
can devote to chunk cache.

Chunk sizes and shapes were determined when a file was written, but
data can be "rechunked" to improve access using new parameters in the
nccopy utility, if copying such a large file is practical just to
improve read performance.  

> 3) is the slowness inherent in the process, or am I just doing something
> really inefficiently?

It's hard to tell from the brief description you've given.  We recently
worked with another user to greatly improve access to data stored in a
bad order for access on a server by rechunking the data to match the
most common pattern of access.  I'm planning to eventually distill that 
effort into better guidance for improving access performance by 
rechunking.

> Some 1km MODIS tiles are at
> 
> ftp://e4ft101.cr.usgs.gov/MOLA/NYD11A2.005/
> 
> if that helps.

That may help, thanks.  That at least would make it possible to see
the current size and shapes of chunks, though you could also see this
by using "ncdump -h -s" on one of the MODIS tile files.

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: IDT-559068
Department: Support netCDF
Priority: Normal
Status: Closed