[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #UAU-670796]: Rechunking of a huge NetCDF file



Hi Henri,

> >> The “-r” flag didn’t make a difference for a small test file, but I’ll 
> >> have to try it with a bigger one.
> >
> > I never found a case in which the "-r" flag saved time, but was hoping
> > it might work for your extreme rechunking case.
> 
> I would actually expect it to be very useful. It also seems that I don’t have 
> that flag on all platforms, maybe it depends on some flags during compilation?

It just depends on the version of netCDF from which you built nccopy.
It must be netCDF C version 4.2.1 (June 2012) or later to have nccopy 
support for diskless access with "-w" and "-r".

> > Thanks, I downloaded the file and see the problem.  In small.nc, you
> > have lon a dimension of size 3, but you are specifying rechunking
> > along that dimension with lon/4, specifying a chunk size larger than
> > the fixed dimension size.  That's apparently something we should check
> > in nccopy.
> >
> > As a workaround, if you change this to lon/3, the nccopy completes
> > without error.  This should be an easy fix, which will be in the next
> > release.
> 
> Ok, that works. It seems then that I have misunderstood chunking. Like if 
> it’s directly dependent on the lengths of dimensions, what does it then mean 
> to chunk unevenly (like lon/2 in this case)? (No need to explain if it’s just 
> a technical detail.)

Chunking unevenly works fine, and just results in some remainder in the 
last chunk that has no data in it (yet).

> > Small chunks cause lots of overhead in HDF5, but I'm not sure whether
> > that's the problem.  I'll have to look at this more closely and
> > respond when I've had a chance to see what's going on.
> 
> I did make some interesting observations. I had previously overlooked the 
> “-u” flag (it’s documentation is somewhat confusing…?). The time coordinate 
> has been unlimited in my files. On my Macbook Air:
> 
> nccopy -w -c time/99351,lat/1,lon/1 small.nc test1.nc  11.59s user 0.07s 
> system 99% cpu 11.723 total
> 
> nccopy -u small.nc small_u.nc
> 
> nccopy -w -c time/99351,lon/1,lat/1 small_u.nc test2.nc  0.07s user 0.04s 
> system 84% cpu 0.127 total
> 
> That’s amazing!

It's because we use the same default chunk length of 1 as HDF5
does for unlimited dimensions.  But when you use -u, it makes 
all dimensions fixed, and then the default chunk length is larger.

> However, when I ran a similar test with a bigger (11GB) subset of my actual 
> data, this time on a cluster (under SLURM), there was no difference between 
> the two files. Maybe my small.nc is simply too small to reveal actual 
> differences and everything is hidden behind overheads?

That's possible, but you also need to take cache effects into account.
Sometimes when you run a timing test, a small file is read into memory
buffers, and subsequent timings are faster becasue the data is just
read from memory instead of disk, and similarly for writing.  With 11GB
files, you might not see any in-memory caching, because the system disk
caches aren't large enough to hold the file, or even consecutive chunks
of a variable.

> Anyways, I was able to rechunk the bigger test file to 
> time/10000,lon/10,lat/10 in 5 hours, which is still quite long but doable if 
> I go variable at a time. And you were right: reading in data chunked like 
> this is definitely fast enough. Maybe I will still try with bigger lengths 
> for lon/lat to see if I can do this in less than an hour.

That's good to hear, thanks for reporting back.  I imagine if we looked
carefully at where the time is being spent, that 5 hour rechunking could 
be reduced significantly, but it might require a smarter nccopy.

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: UAU-670796
Department: Support netCDF
Priority: High
Status: Closed