opps -- that should have been a "reply all".
minor rant: All mailing list should be set to reply to the list be default!
(yes, I know there are arguments otherwise -- carry on)
---------- Forwarded message ----------
From: Chris Barker - NOAA Federal <chris.barker@xxxxxxxx>
Date: Thu, Oct 23, 2014 at 8:40 AM
Subject: Re: [netcdfgroup] nccopy should use 1 as default-chunksize for
unlimited dimension
To: Ed Hartnett <edwardjameshartnett@xxxxxxxxx>
On Oct 23, 2014, at 7:23 AM, Ed Hartnett <edwardjameshartnett@xxxxxxxxx>
wrote:
This gives very poor performance when the number of timesteps in the file
is large.
Well, that's the trick with chunking--appropriate chunk sized depend on the
shape/size of the array, hardware specs, and access patterns.
The code that determines defaults can only know about the array shape. But
we need to make sure it at least accounts for that.
Using a small chunk on the time dimension is fine IF the other dimensions
are large, AND you want most of a chunk's worth of data at each access.
A common use case is a 3 or 4-d array, (say T x X x Y) where the user needs
to access all of X and Y, one time step at a time. In this case, a chunk
size if one in the t dimension makes sense, regardless of how large the t
dimension is.
In my (pretty limited) experimentation, I found that performance is not
very sensitive to chunk sizes within "reasonable" bounds. Very small chunks
(on order of 10 bytes) give horrible performance both in file size and
speed, and really large chunks (maybe > 10s of MB) can provide bad
performance depending a bit on access patterns.
The goals are to a) not have very small chunks, and b) have most of a chunk
used for each access. But without knowing the access patterns, the is no
way to optimize for (b) in defaults.
For example, for a t,x,y array, having a 1x1024x1024 chunk size would work
well if the user typically wanted the entire x,y domain for each time step.
But if they wanted the entire time series at one point, it would be pretty
bad. (Accessing 1MB in order to get a single value)
It seems reasonable to me to assume that an unlimited dimension is one
least likely to be accessed all at once, and thus should be the smallest
chunk size dimension. But another approach would be to make no assumptions
about access patterns, and create "square" chunks by default. This would
yield equally good (and bad) performance for any access pattern.
In either case, default chunks should never be tiny or huge.
-Chris
On Thu, Oct 23, 2014 at 5:25 AM, Heiko Klein <Heiko.Klein@xxxxxx> wrote:
> Hi,
>
> when chunking files with unlimited dimension, the unlimited dimension must
> be given explicitly in nccopy, and will usually be set to one.
>
> In a file with time as unlimited dimension, and X and Y as dimensions, it
> is currently required to use
>
> $ nccopy -k 4 -c "time/1,X/100,Y/100" in.nc out.nc
>
>
> When running without time, it does not work:
>
> $ nccopy -k 4 -c "X/100,Y/100" in.nc out.nc
> NetCDF: Invalid argument
> Location: file nccopy.c; line 637
>
>
> Only for unlimited dimensions, one needs to give the dimension explicitly,
> for all other dimensions a useful default (full dim-size) is used. I think
> a useful default for unlimited dimensions is 1.
>
>
> Heiko
>
>
>
> --
> Dr. Heiko Klein Tel. + 47 22 96 32 58
> Development Section / IT Department Fax. + 47 22 69 63 55
> Norwegian Meteorological Institute http://www.met.no
> P.O. Box 43 Blindern 0313 Oslo NORWAY
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe, visit:
> http://www.unidata.ucar.edu/mailing_lists/
_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit:
http://www.unidata.ucar.edu/mailing_lists/
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@xxxxxxxx