NOTE: The netcdf-hdf
mailing list is no longer active. The list archives are made available for historical reasons.
Hi Ed, > Quincey et. al., > > Given an n-dimensional dataspace, with only one unlimited > (i.e. extendable) dimension, tell me how to select the chunk size for > each dimension to get a good read performance for large data files. > > Would you care to suggest any smart algorithms to yeild better > performance for various situations? Unfortunately there aren't generic instructions for this sort of thing, it's very application-I/O-pattern dependent. A general heuristic is to pick lower and upper bounds on the size of a chunk (in bytes) and try to make the chunks "squarish" (in n-D). One thing to keep in mind is that the default chunk cache in HDF5 is 1MB, so it's probably worthwhile to keep chunks under half of that. A reasonable lower limit is a small multiple of the block size of a disk (usually 4KB). Generally, you are trying to avoid the situation below: Dataset with 10 chunks (dimension sizes don't really matter): +----+----+----+----+----+ | | | | | | | | | | | | | A | B | C | D | E | +----+----+----+----+----+ | | | | | | | | | | | | | F | G | H | I | J | +----+----+----+----+----+ If you are writing hyperslabs to part of each chunk like this: (hyperslab 1 is in chunk A, hyperslab 2 is in chunk B, etc.) +----+----+----+----+----+ |1111|2222|3333|4444|5555| |6666|7777|8888|9999|0000| | A | B | C | D | E | +----+----+----+----+----+ | | | | | | | | | | | | | F | G | H | I | J | +----+----+----+----+----+ If the chunk cache is only large enough to hold 4 chunks, then chunk A will be preempted from the cache for chunk E (when hyperslab 5 is written), but will immediately be re-loaded to write hyperslab 6 out. Unfortunately, our general purpose software can't predict the I/O pattern that users will access the data in, so it is a tough problem. One the one hand, you want to keep the chunks small enough that they will stick around in the cache until they are finished being written/read, but you want the chunks to be larger so that the I/O on them is more efficient. :-/ Quincey
netcdf-hdf
archives: