Re: [netcdfgroup] Unexpectedly large netCDF4 files from python

On Tue, Apr 5, 2016 at 12:56 PM, Val Schmidt <vschmidt@xxxxxxxxxxxx> wrote:

> I was able to change the chunk size and get a file size that makes much
> more sense. With a chunk size of 1024, I get a file of 166kBytes.
>
> What are the units of chunk size by the way?
>

I think it is in units of size_of_variable -- not sure what it does with a
VLType though.

This should probably be reported as a bug -- that was a huge default chunk
size.

chunk size (and shape!) is tricky, as the appropriate size depends on
access patterns and hardware itself, like disk cache sizes, etc. But when I
was experimenting with chunk sizes on large 1-D arrays, I found that it
made a huge difference as you went from very small chunks (the default had
been 1!) to about 1024, then a little difference to 1MB, and then not much
difference at all.

so default chunks should probably be not larger than 1024 (or maybe 1MB),
but certainly not as huge as this seems to be.

-CHB




-Val
>
> On Apr 5, 2016, at 3:53 PM, Chris Barker <chris.barker@xxxxxxxx> wrote:
>
> oh, and I've enclosed my code -- your didn't actually run -- missing
> imports?
>
>
>
>
> On Tue, Apr 5, 2016 at 12:52 PM, Chris Barker <chris.barker@xxxxxxxx>
> wrote:
>
>>
>>
>> On Tue, Apr 5, 2016 at 12:13 PM, Ted Mansell <ted.mansell@xxxxxxxx>
>> wrote:
>>
>>> You might check the ChunkSizes attribute with 'ncdump -hs'. The newer
>>> netcdf sets larger default chunks than it used to. I had this issue with
>>> 1-d variables that used an unlimited dimension. Even if the dimension only
>>> had a small number, the default chunk made it much bigger.
>>
>>
>> I had the same issue -- 1-d variable had a chunksize of 1, which was
>> really, really bad!
>>
>> But that doesn't seem to be the issue here -- I ran the same code, and
>> get the same results, and here is the dump:
>>
>> netcdf text3 {
>> types:
>>   ubyte(*) variable_data_t ;
>> dimensions:
>>     timestamp_dim = UNLIMITED ; // (1 currently)
>>     data_dim = UNLIMITED ; // (1 currently)
>>     item_len = 100 ;
>> variables:
>>     double timestamp(timestamp_dim) ;
>>         timestamp:_Storage = "chunked" ;
>>         timestamp:_ChunkSizes = 524288 ;
>>     variable_data_t data(data_dim) ;
>>         data:_Storage = "chunked" ;
>>         data:_ChunkSizes = 4194304 ;
>>         data:_NoFill = "true" ;
>>
>> // global attributes:
>>         :_Format = "netCDF-4" ;
>> }
>>
>> if I read that right, nice big chunks.
>>
>> note that if I do'nt use a VLType variable, I still get a 4MB file --
>> though that could be the netcdf4 overhead:
>>
>> netcdf text3 {
>> types:
>>   ubyte(*) variable_data_t ;
>> dimensions:
>>     timestamp_dim = UNLIMITED ; // (1 currently)
>>     data_dim = UNLIMITED ; // (1 currently)
>>     item_len = 100 ;
>> variables:
>>     double timestamp(timestamp_dim) ;
>>         timestamp:_Storage = "chunked" ;
>>         timestamp:_ChunkSizes = 524288 ;
>>     ubyte data(data_dim, item_len) ;
>>         data:_Storage = "chunked" ;
>>         data:_ChunkSizes = 1, 100 ;
>>
>> // global attributes:
>>         :_Format = "netCDF-4" ;
>> }
>>
>> something is up with the VLen.....
>>
>> -CHB
>>
>>
>>
>>
>>
>>> (Assuming the variable is not compressed.)
>>>
>>> -- Ted
>>>
>>> __________________________________________________________
>>> | Edward Mansell <ted.mansell@xxxxxxxx>
>>> | National Severe Storms Laboratory
>>> |--------------------------------------------------------------
>>> | "The contents of this message are mine personally and
>>> | do not reflect any position of the U.S. Government or NOAA."
>>> |--------------------------------------------------------------
>>>
>>> On Apr 5, 2016, at 1:44 PM, Val Schmidt <vschmidt@xxxxxxxxxxxx> wrote:
>>>
>>> > Hello netcdf folks,
>>> >
>>> > Iâm testing some python code for writing sets of timestamps and
>>> variable length binary blobs to a netcdf file and the resulting file size
>>> is perplexing to me.
>>> >
>>> > The following segment of python code creates a file with just two
>>> variables, âtimestampâ and âdataâ, populates the first entry of the
>>> timestamp variable with a float and the corresponding first entry of the
>>> data variable with an array of 100 unsigned 8-bit integers. The total
>>> amount of data is 108 bytes.
>>> >
>>> > But the resulting file is over 73 MB in size. Does anyone know why
>>> this might be so large and what I might be doing to cause it?
>>> >
>>> > Thanks,
>>> >
>>> > Val
>>> >
>>> >
>>> > from netCDF4 import Dataset
>>> > import numpy
>>> >
>>> > f = Dataset('scratch/text3.nc','w')
>>> >
>>> > dim = f.createDimension('timestamp_dim',None)
>>> > data_dim = f.createDimension('data_dim',None)
>>> >
>>> > data_t = f.createVLType('u1','variable_data_tâ)
>>> >
>>> > timestamp = f.createVariable('timestamp','d','timestamp_dim')
>>> > data = f.createVariable('data',data_t,'data_dimâ)
>>> >
>>> > timestamp[0] = time.time()
>>> > data[0] = uint8( numpy.ones(1,100))
>>> >
>>> > f.close()
>>> >
>>> > ------------------------------------------------------
>>> > Val Schmidt
>>> > CCOM/JHC
>>> > University of New Hampshire
>>> > Chase Ocean Engineering Lab
>>> > 24 Colovos Road
>>> > Durham, NH 03824
>>> > e: vschmidt [AT] ccom.unh.edu
>>> > m: 614.286.3726
>>> >
>>> >
>>> > _______________________________________________
>>> > netcdfgroup mailing list
>>> > netcdfgroup@xxxxxxxxxxxxxxxx
>>> > For list information or to unsubscribe,  visit:
>>> http://www.unidata.ucar.edu/mailing_lists/
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> netcdfgroup mailing list
>>> netcdfgroup@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit:
>>> http://www.unidata.ucar.edu/mailing_lists/
>>>
>>
>>
>>
>> --
>>
>> Christopher Barker, Ph.D.
>> Oceanographer
>>
>> Emergency Response Division
>> NOAA/NOS/OR&R            (206) 526-6959   voice
>> 7600 Sand Point Way NE   (206) 526-6329   fax
>> Seattle, WA  98115       (206) 526-6317   main reception
>>
>> Chris.Barker@xxxxxxxx
>>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker@xxxxxxxx
> <huge_nc_file.py>
>
>
> ------------------------------------------------------
> Val Schmidt
> CCOM/JHC
> University of New Hampshire
> Chase Ocean Engineering Lab
> 24 Colovos Road
> Durham, NH 03824
> e: vschmidt [AT] ccom.unh.edu
> m: 614.286.3726
>
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@xxxxxxxx