In the last post in this series we saw that NCEP GRIB-2 files that have been stored using JPEG-2000 wavelet compression, if stored in netCDF-4 files with deflate packing, would be more than two times larger than the original GRIB2 files. Deflate compression (aka ZIP) comes standard with the NetCDF4/ HDF5 libraries. However other compression schemes can be "plugged in" by third parties, and of course, Unidata could choose to bundle another compression filter. So there's good reason to investigate other compression schemes that could be used to convert GRIB to NetCDF4.
To proceed, I developed a testing framework within the NetCDF-Java library. All of the results presented here were done in Java, using open source compression libraries. Of course the actual compression would be done with C code linked into the NetCDF4/HDF5 C library. However, I expect the resulting files sizes to be roughly the same, and timings to be at least relatively valid (i.e. comparing two algorithms with each other). But until we actually have it working on the C side, all these results must be considered preliminary.
So the methodology is as follows. For each GRIB record, extract the data as a single precision floating point array. Now pass that array to each compressor, and record the size and time it takes to compress. Then take the compressed array and decompress it, and record the time it takes to decompress. The timings don't have any I/O in them, they are just the wall clock time to compress / decompress. The results were done serially on a 64-bit Windows 7 machine running an Intel Xeon E5640 at 2.67 Ghz. No attempts were made at optimization.
I used these compression libraries:
- deflate : java.util.zip.Deflater is part of the standard JRE. As previously discussed, I am using level 3.
- bzip2 (Itadaki): org.itadaki.bzip2.BZip2OutputStream is an independent implementation with better compress times than apache compress.
- LZMA (7zip): the 7zip archive package provides a Java implementation of the LZMA algorithm.
There may be other good algorithms, but these were the ones that had well known implementations and wide spread use.
Here are the results for 38 NCEP model files using wavelet compression:
There is a strict ordering here for all files: bzip2 < LZMA < deflate. The average compression ratio is 2.10 for deflate, 1.14 for bzip2, and 1.31 for LZMA.
It occurred to me that we are giving GRIB-2 an unfair advantage. The way that GRIB-2 works is that the floating point values are first reduced to an n-bit integer using scale and offset, and then that array of integers is compressed. However, in the above tests, all the other algorithms worked on the floating point arrays, after the scale and offsets were applied to the integer data. What would happen if the other compression algorithms are given the same integer arrays that the GRIB-2 wavelet compression is given? Its easy enough to extract the integer arrays from the JPEG-2000 decompression algorithm, so I ran all the tests on the integer arrays before they have been converted to floating point.
The other compression technique is bit shaving, as detailed in a previous blog. This technique bounds the relative precision instead of the absolute precision. I ran the tests for shaved floats also.
Here are the results on the 38 NCEP files, a total of 385K records in 17 GB of GRIB2 files, for the extracted float arrays, the bit shaved float arrays, and the integer arrays before the scale/offset was applied:
Here are the actual files sizes:
size (GB) | |||||
deflate | bzip2 | lzma | grib | floats | |
compressed floats | 35.90 | 19.50 | 22.40 | 135.00 | |
floatShaved | 34.38 | 18.18 | 20.64 | ||
scale/offset | 33.98 | 18.32 | 19.05 | 17.12 |
And the ratio of the files sizes with GRIB2:
size/grib | ||||
deflate | bzip2 | lzma | grib | |
floats | 2.10 | 1.14 | 1.31 | |
floatShaved | 2.01 | 1.06 | 1.21 | |
scale/offset ints | 1.98 | 1.07 | 1.11 | 1.00 |
Finally, we can look at the time to compress and uncompress these records. The most important is the uncompress time, assuming these will be written once, and read many times:
And the actual numbers:
size (GB) | uncompress | compress | |
deflate floats | 35.90 | 2.28 | 14.71 |
deflate floatShaved | 34.38 | 1.98 | 13.59 |
deflate ints | 33.98 | 1.89 | 11.96 |
bzip2 floats | 19.50 | 17.80 | 55.84 |
bzip2 floatShaved | 18.18 | 15.20 | 48.86 |
bzip2 ints | 18.32 | 14.17 | 43.09 |
lzma floats | 22.40 | 14.50 | 473.19 |
lzma floatShaved | 20.64 | 12.31 | 454.08 |
lzma ints | 19.05 | 12.94 | 482.02 |
grib | 17.12 | 23.53 |
This plot shows the tradeoffs of uncompress time and file size. Deflate has larger files but is fast, and bzip2 is the smaller and slower option. LZMA was deliberately designed to be asymmetric: very slow compression but fast uncompression. As we see, it does beat bzip2 uncompression by a bit, but with bigger files sizes. GRIB-2 remains the best file size, with somewhat larger uncompress times. I do not yet have an estimate of GRIB-2 compress time.
In conclusion, from these tests in Java it appears that bzip2 compression can get us to within 15% of the size of GRIB2 wavelet compression. Another 7-8% can be gained by using shaved floats or scale/offset integer arrays. Adding bzip2 as a standard filter may be a good "slower but smaller" alternative to deflate. Other compression options may give different tradeoffs.
Next up: lets give bzip2 a try in the netCDF-C library.
I corrected the last plot with better timings on scale/offset ints. As I suspected, the previous numbers were skewed by being run on a slower machine. The numbers now are correct to my best knowledge.
Posted by John Caron on September 14, 2014 at 01:21 PM MDT #