[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #TSI-527912]: nccopy advice - rechunking very large files



Dan,

> On the subject of compression:
> The compression has finished for 3 different rates -d 0/5/9,
> and here are the results:
> 
> 15 269 652 694 Aug 12 00:34 narr-TMP-850mb_221_yyyymmdd_hh00_000.grb
>  8 522 977 797 Sep 30 15:31 narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2
> 10 143 354 510 Sep 30 15:50 narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4 
> (x/y -d4)
> 38 970 737 399 Oct 9  11:07 
> narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4.ts.d0
> 18 693 041 241 Oct 9  14:51 
> narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4.ts.d5
> 18 548 053 490 Oct 9  15:06 
> narr-TMP-850mb_221_yyyymmdd_hh00_000.grb.grb2.nc4.ts.d9
> 
> Curious that chunking to time series creates more variability in the data,
> probably messing with the zlib algorithm and resulting in a larger file size
> overall for this chunked data.

Right, the horizontal values at a particular time are probably more
uniform in temperature at a specific time (e.g. seasonal), so compress
better than the temperatures at all times over all seasons, which would
have a greater range from the coldest minimum to the hottest maximum.

Also the chunking you specified results in some extra missing values
inserted in chunks on the edges, due to "overhang".  This is because
the 6x8 horizontal chunk dimensions don't fit evenly in the 277x349
horizontal slabs, but all the output chunks must be the same size, so
those edge chunks with missing values "pollute" the compression.  This
overhang results in more than 1GB of extra data that needs to be
compressed:

  2068 * 18840576 - 98128*386692 = 1016998592 bytes

> At the 2011 Unidata workshop on Netcdf.. the old GRIB vs. NetCDF discussion
> occured and I recall hearing (but not from who) that the
> same JPEG2000 (jasper) compression used in GRIB2 was to be implemented
> with netcdf ~ and also some wavelet compression that was supposedly superior
> to JPEG in some cases... any news on this?

Yes, Sengcom, the RAL project to implement such a compression
algorithm, could not use the patented JPEG2000 algorithm due to
licensing issues.

  ... JPEG2000 used patented technology on the 2 dimensional wavelets
  (EZW or SPIHT), which RAL could not use without implementing the
  entire JPEG2000 spec.

Also to reduce the implementation complexity, Sengcom tried only
1-dimensional wavelets and provided strict control over the maximum
absolute error, rather than mean error or other less stringent error
measures that might allow better compression.

Tests on real data with a supposedly superior wavelet algorithm and 25
other wavelet types, all 1-dimensional, showed that JPEG2000 was
superior in all but a small percentage of cases, and often provided 2
or 3 times better compression.  If you're interested, the final report
on that project should be available later this week.

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: TSI-527912
Department: Support netCDF
Priority: Normal
Status: Closed