On Apr 25, 2011, at 6:06 PM, John Caron wrote:
> On 4/25/2011 2:04 PM, Peter Cornillon wrote:
>>
>> On Apr 25, 2011, at 3:51 PM, John Caron wrote:
>>
>>> On 4/25/2011 1:46 PM, Peter Cornillon wrote:
>>>>
>>>> On Apr 25, 2011, at 3:42 PM, John Caron wrote:
>>>>
>>>>> On 4/25/2011 1:37 PM, Roy Mendelssohn wrote:
>>>>>> yes, internal compression. All the files were made from netcdf3 files
>>>>>> using NCO with the options:
>>>>>>
>>>>>> ncks -4 -L 1
>>>>>>
>>>>>> The results so far show a decrease in file size from 40% of original to
>>>>>> 1/100 th of the original file size. If the internally compressed data
>>>>>> requests are cached differently than request to netcdf3 files, we want
>>>>>> to take that into account when we do the tests, so that we do not just
>>>>>> see the affect of differential cacheing.
>>>>>>
>>>>>> When we have done tests on just local files, the reads where about 8
>>>>>> times slower from a compressed file. But Rich Signell has found that
>>>>>> the combination of serialization/bandwidth is the bottleneck, and you
>>>>>> hardly notice the difference in a remote access situation. That is what
>>>>>> we want to find out, because we run on very little money and with
>>>>>> compression as mentioned above our RAIDS would go a lot farther, as long
>>>>>> the hit to the access time is not too great.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Roy
>>>>>
>>>>> in netcdf4/hdf5, compression is tied to the chunking. Each chunk is
>>>>> individually compressed, and must be completely decompressed to retrieve
>>>>> even one value from that chunk. So the trick is to make your chunks
>>>>> correspond to your "common cases" of data access. If thats possible, you
>>>>> should find that compressed access is faster than non-compressed access,
>>>>> because IO is smaller. but it will be highly dependent on that.
>>>>
>>>> John, is there a loss of efficiency when compressing chunks compared to
>>>> compressing the entire file? I vaguely recall that for some compression
>>>> algorithms, compression efficiency is a function of the volume of data
>>>> compressed.
>>>>
>>>> Peter
>>>>
>>>
>>> Hi Peter:
>>>
>>> I think dictionary methods such as deflate get better as the file size goes
>>> up, but the tradeoff here is to try to decompress only the data you
>>> actually want. Decompressing very large files can be very costly.
>>
>> Yes, this is why I chunk. The reason that I asked the question is that this
>> might influence the chunk size that one chooses.
>
> yup!
>
Ouch,
I *just* ran into this. We've converted some large data sets from NetCDF-3 to
NetCDF-4 w/ DeflateLevel = 1 and double to short (w/ scaling). I wrote some
code using NetCDF-Java to validate the conversion and naively accessed the data
one x slice at a time. The default chunking had chunked xy. The conversion
test was taking 20 minutes until I switched my slices to xy, now the test takes
40 seconds. Have to pay attention to those chunksizes when determining how to
access the data!
Tom Kunicki
Center for Integrated Data Analytics
U.S. Geological Survey
8505 Research Way
Middleton, WI 53562