Re: [netcdfgroup] File size, unlimited dimensions, compression and chunks

To: netcdfgroup@xxxxxxxxxxxxxxxx
Subject: Re: [netcdfgroup] File size, unlimited dimensions, compression and chunks
From: John Caron <caron@xxxxxxxxxxxxxxxx>
Date: Mon, 09 Jan 2012 12:50:15 -0700

Hi Ross:

from

http://www.unidata.ucar.edu/software/netcdf/docs/netcdf-c/nc_005fdef_005fvar_005fchunking.html#nc_005fdef_005fvar_005fchunking

"Variables that make use of one or more unlimited dimensions,compression, or checksums must use chunking. Such variables are createdwith default chunk sizes of 1 for each unlimited dimension and thedimension length for other dimensions, except that if the resultingchunks are too large, the default chunk sizes for non-record dimensionsare reduced."

So you are putting each value in each variable in its own chunk. Thatsgot a lot of overhead.


You want to set the chunk size explicitly, using the call:

     int nc_def_var_chunking(int ncid, int varid, int storage, size_t 
*chunksizesp);

(note that you must do that on each variable). Experiment with the size.100 might solve the problem but 900 might be better.


Also, you might consider using Compound types instead of Groups.
Also, I dont see the fast_reg dimension used below.

John


On 1/9/2012 10:15 AM, Ross Williamson wrote:

Hi Ted,

Thanks - I guess I'm under the impression that netcdf4 should always
be used over netcdf3 (i.e. better).  One test I did do is to remove
all unlimited variables and fix them to a size of 900 and the file
size is basically what one expects (350K vs 5Mb) so it really is the
unlimited dimensions that are causing that large file size.  I've cat
the header to the netcdf file below in case anyone is interested - I
would really like to keep the unlimited dimensions option available
for data logging.

I do use quite a few 2D dimensions and also two unlimited dimensions
(fast and slow) where the fast has 100 samples for each slow. Once
fully implemented I expect to be dumping about 2Mb/s to the netCDF
file.

Any advice much appreciated.

Ross

dimensions:
        slow_reg = UNLIMITED ; // (900 currently)
        fast_reg = UNLIMITED ; // (0 currently)

group: array {

   group: frame {
     variables:
        uint status(slow_reg) ;
        ubyte received(slow_reg) ;
        uint nsnap(slow_reg) ;
        uint record(slow_reg) ;
        double utc(slow_reg) ;
        uint features(slow_reg) ;
        uint markSeq(slow_reg) ;
     } // group frame

   group: pt415 {
     variables:
        uint status(slow_reg) ;
        uint record(slow_reg) ;
        ... (quite a few more in here)
        float error_code(slow_reg) ;
     } // group pt415

   group: sim900 {
     dimensions:
        dvm_volts_dim2 = 4 ;
        dvm_gnd_dim2 = 4 ;
        dvm_ref_dim2 = 4 ;
        therm_volts_dim2 = 4 ;
        therm_temperature_dim2 = 4 ;
     variables:
        uint status(slow_reg) ;
        uint record(slow_reg) ;
        double utc(slow_reg) ;
        float main_volt_monitor(slow_reg) ;
        float main_current_monitor(slow_reg) ;
        float main_power_monitor(slow_reg) ;
        float main_undervoltage(slow_reg) ;
        uint main_tick(slow_reg) ;
        float dvm_volts(slow_reg, dvm_volts_dim2) ;
        float dvm_gnd(slow_reg, dvm_gnd_dim2) ;
        float dvm_ref(slow_reg, dvm_ref_dim2) ;
        float therm_volts(slow_reg, therm_volts_dim2) ;
        float therm_temperature(slow_reg, therm_temperature_dim2) ;
        ... (Few more in here)
        float bridge_output_value(slow_reg) ;
     } // group sim900
   } // group array

group: antenna0 {

   group: frame {
     variables:
        uint status(slow_reg) ;
        ubyte received(slow_reg) ;
        uint nsnap(slow_reg) ;
        uint record(slow_reg) ;
        double utc(slow_reg) ;
        uint features(slow_reg) ;
        uint markSeq(slow_reg) ;
     } // group frame

   group: acu {
     variables:
        uint status(slow_reg) ;
        uint new_mode(slow_reg) ;
        ...
        uint px_checksum_error_count(slow_reg) ;
        uint px_resyncing(slow_reg) ;
     } // group acu

   group: gpsTime {
     variables:
        uint status(slow_reg) ;
        ...
        uint serialNumber(slow_reg) ;
     } // group gpsTime
   } // group antenna0

group: receiver {

   group: frame {
     variables:
        uint status(slow_reg) ;
        ubyte received(slow_reg) ;
        uint nsnap(slow_reg) ;
        uint record(slow_reg) ;
        double utc(slow_reg) ;
        uint features(slow_reg) ;
        uint markSeq(slow_reg) ;
     } // group frame

   group: bolometers {
     variables:
        uint status(slow_reg) ;
     } // group bolometers
   } // group receiver
}

On Mon, Jan 9, 2012 at 4:56 PM, Ted Mansell<ted.mansell@xxxxxxxx>  wrote:

I don't think you can chunk an unlimited dimension by more than 1.  What are 
the variable dimensions?  Your formula makes it sound like they are 1-D and 
only sized by the unlimited dimension.  If that is the case, compression won't 
help.  You might be better off with a netcdf-3 file?

-- Ted

On Jan 9, 2012, at 8:15 AM, Ross Williamson wrote:

I'm trying to get my head around the filesize of my netcdf-4 file -
Some background.

1) I'm using the netcdf_c++4 API
2) I have an unlimited dimensions which I write data to about every second
3) There are a set of nested groups
4) I'm using compression on each variable
5) I'm using the default chunk size which I think is 1 for the
unlimited dimensions and sizeof(type) for other dimensions
6) I take data for 900 samples - There are about 100 variables so I
would expect (given doubles) a file size of 900x100x4 = 360K. Now I
fully expect some level of overhead but my file sizes are 5MB which is
incredibly large.

Now compression doesn't make much difference (5Mb vs 5.3Mb).  I'm
assuming here the thing that is screwing me over is that I haven't got
my chuncking set right. The issue is that I'm rather confused.  It
appears that you set the chunk size for each variable rather than the
whole file which doesn't make sense to me.  Would I just say multiply
each chunk size by say 100 so have 100 for the unlimited dimension and
sizeof(type)*100 for other dimensions?

I'd really like to fix this as netcdf-4 seems ideal for my project but
I can't deal with a size overhead of an order of magnitude.

I can attach the header of the netcdf file if it helps.

Ross

--
Ross Williamson
Associate Research Scientist
Columbia Astrophysics Laboratory
212-851-9379 (office)
212-854-4653 (Lab)
312-504-3051 (Cell)

_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/

Follow-Ups:
- Re: [netcdfgroup] File size, unlimited dimensions, compression and chunks
  - From: Ross Williamson

References:
- [netcdfgroup] File size, unlimited dimensions, compression and chunks
  - From: Ross Williamson
- Re: [netcdfgroup] File size, unlimited dimensions, compression and chunks
  - From: Ted Mansell
- Re: [netcdfgroup] File size, unlimited dimensions, compression and chunks
  - From: Ross Williamson

2012 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: