[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TSI-527912]: nccopy advice - rechunking very large files



Dan,

> Try this:
> http://nomads.ncdc.noaa.gov/data/offlinestage/custom/narr-physaggs/
> If you encounter permission problems, let me know.
> the *.grb.grb2.nc4 are the input netcdf4 files in question.
> And their *.grb & *.grb.grb2 cousins are the conversion steps.
> All data in this directory is volitile and not fit for public use, but
> should allow you to get ahold of it.

Thanks, I got the 10.14GB netCDF-4 file with the compressed TMP_850mb 
variable that you were using, and got it rechunked as you wanted, supporting
fast time-series access, in under 36 minutes with nccopy.  I used the following 
nccopy invocation:

  $ nccopy -ctime/98128,x/8,y/6 -e 102000 -m 40M -h 40G -d0 tmp.nc4 
tmp-rechunked.nc4

This was on a Linux desktop machine with 80 GB of memory, but I only used about 
40 GB for
nccopy.  I think the most important of the nccopy parameters in the above was 
"-e 102000"
to specify the number of chunk cache elements.  I got that number by adding the 
number of
chunks in the input file (98128, the number of times) to the number of chunks 
in the output
file (2068, the number of 6x8 tiles that fit in a 277x349 horizontal slab), and 
adding a little
extra just in case.

The "-h 40G" is also important, because it makes the chunk cache big enough to 
fit all the 
chunks in the output file, uncompressed, in memory at once, plus at least one 
input file chunk.

If you're curious, /usr/bin/time provided more information on what the machine 
was doing 
during that 36 minutes:

  $ /usr/bin/time nccopy -ctime/98128,x/8,y/6 -e 102000 -m 40M -h 40G -d0 
tmp.nc4 tmp-rechunked.nc4
  1862.11user 79.31system 35:24.38elapsed 91%CPU (0avgtext+0avgdata 
38167408maxresident)k
  456inputs+76115152outputs (3major+12842807minor)pagefaults 0swaps

Using "-w" and the diskless mode for output, where the output file is kept in 
memory until close
and written out all at once, didn't provide any improvement of the above, which 
surprised me 
until I figured out what was going on.  It took took 522.23 seconds to copy the 
uncompressed
"file" from memory to disk on close.  Total times:
  $ /usr/bin/time nccopy -w -ctime/98128,x/8,y/6 -e 128000 -m 40M -h 39G -d0 
tmp.nc4 tmp-rechunked.nc4
  1830.35user 293.04system 46:20.51elapsed 76%CPU (0avgtext+0avgdata 
76223884maxresident)k
  40480inputs+76114840outputs (530major+13965570minor)pagefaults 0swaps
  $ ls -l tmp-rechunked.nc4 
  -rw-rw-r-- 1 russ ustaff 38970737448 Oct  7 12:36 tmp-rechunked.nc4

If you want to have the output data compressed as the input was, that adds to 
the time for nccopy, 
but it's probably less time total then doing that as a separate step:

   2844.23user 75.61system 52:29.25elapsed 92%CPU (0avgtext+0avgdata 
48861228maxresident)k
   24inputs+21389952outputs (0major+14236182minor)pagefaults 0swaps
   $ ls -l tmp-rechunked.nc4
   -rw-rw-r-- 1 russ ustaff 10951640022 Oct  7 18:55 tmp-rechunked.nc4

In all these cases, most of the output file doesn't get written to disk until 
the last 9 minutes
or so, so it's hard to see the progress if you're expecting it to grow linearly 
with the
running time ...

--Russ
 
> On Thu, Oct 4, 2012 at 5:30 PM, Unidata netCDF Support
> <address@hidden> wrote:
> > Hi Dan,
> >
> >> I could use some advice.
> >>
> >> I am trying to rechunk about 30x or so 8-30 GB netcdf4 files
> >> for the North American Regional Reanalysis physical aggregations
> >> created from a wgrib2 convert process -- for eventual use on our
> >> THREDDS server.
> >>
> >> I am using source compiled binaries from netCDF 4.2.1.1
> >>
> >> The inputs are chunked as:
> >>
> >> chunkspec (t y x)
> >> 1, 277, 349
> >>
> >> Into a new file chunked to optimize read access to time series
> >> 98128,6,8
> >>
> >> These files are 1 parameter for 1 z-level, so z is excluded here.
> >>
> >> using the command:
> >>
> >> $ /san5102/netcdf4/nccopy -m 4000000000 -h 1000000000
> >> -ctime/98128,x/8,y/6
> >> /san5102/nexus/narr-physaggs/narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4
> >> /raid/nomads/testing/data/narraggs/narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4.ts
> >>
> >> Issue is, this is unreasonably slow.   At the beginning I will get a burst 
> >> of
> >> about 350-500 KB/sec output (which is reasonable for the server hardware),
> >> then after a few minutes it falls to < 10 KB/sec
> >> ~ for a 10+ GB files, this will take more than 10 days
> >> just to rechunk one file.  Adjusting -m and -h options gives only
> >> a minor improvement, the initial write burst lasts longer, but still
> >> eventually floors to <10 KB/sec.
> >>
> >> Do you think this is the best way to optimize for a time
> >> series read access?  And what do you suggest to make
> >> the process finish in a reasonable time?    Are files of this
> >> size just too much?   The output format of the file doesn't
> >> matter to me as long as its netcdf4 and max compression can be
> >> applied later.
> >
> > How much memory do you have that you can dedicate to nccopy when it is
> > rechunking the data?  If you have enough memory, use of the -w option
> > may speed things up significantly.  Since available memory can make a
> > big difference in how long rechunking takes, is a possible solution
> > just doing the rechunking on a different system with lots of memory,
> > e.g. 64 GB?  Memory is pretty cheap compared to programmer time these
> > days, so I'm wondering if that's a possibility ...
> >
> > Another approach that might work is using more than one pass over the
> > data by writing an intermediate file that's rechunked in a way
> > intermediate between the current input and the desired output.
> >
> > This problem is very interesting to me, and I'd like to be able to
> > test approaches to optimizing access for time series using a real data
> > file rather than some artificial test data.  Could you either make
> > available one of those input files (but not as an email attachment!
> > :-) or tell me how to get one?  Especially when dealing with questions
> > that may ultimately involve compression as well as chunking, it's
> > important to deal with real-world data.
> >
> > If that's not practical, I'd like to get the CDL from ncdump -h (or
> > -c) for the input netCDF file as well as CDL for the desired output,
> > so I know exactly what yuo're trying to do.
> >
> > --Russ
> >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: TSI-527912
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> 
> 
> 
> --
> =======================================
> Dan Swank
> STG, Incorporated - Government Contractor
> NCDC-NOMADS Project:  Software & Data Management
> Data Access Branch
> National Climatic Data Center
> Veach-Baley Federal Building
> 151 Patton Avenue
> Asheville, NC 28801-5001
> Email: address@hidden
> Phone: 828-271-4007
> =======================================
> Any opinions expressed in this message are mine personally and do not
> necessarily reflect any position of STG Inc or NOAA.
> =======================================
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: TSI-527912
Department: Support netCDF
Priority: Normal
Status: Closed