Re: [netcdfgroup] File size, unlimited dimensions, compression and chunks



On 1/9/2012 9:15 AM, Ross Williamson wrote:
Hi Ted,

Thanks - I guess I'm under the impression that netcdf4 should always
be used over netcdf3 (i.e.*better*).

"Better" for whom? If you have an interest in inter-operability -- i.e. a belief that there may be users for your data beyond the group creating it -- then you should think very carefully about the netCDF3 / netCDF4 trade-off. Also, developing custom applications for every analysis and visualization task can be a productivity killer.

  One test I did do is to remove
all unlimited variables and fix them to a size of 900 and the file
size is basically what one expects (350K vs 5Mb) so it really is the
unlimited dimensions that are causing that large file size.  I've cat
the header to the netcdf file below in case anyone is interested - I
would really like to keep the unlimited dimensions option available
for data logging.

I do use quite a few 2D dimensions and also two unlimited dimensions
(fast and slow) where the fast has 100 samples for each slow. Once
fully implemented I expect to be dumping about 2Mb/s to the netCDF
file.

Any advice much appreciated.

Ross

dimensions:
        slow_reg = UNLIMITED ; // (900 currently)
        fast_reg = UNLIMITED ; // (0 currently)

group: array {

   group: frame {
     variables:
        uint status(slow_reg) ;
        ubyte received(slow_reg) ;
        uint nsnap(slow_reg) ;
        uint record(slow_reg) ;
        double utc(slow_reg) ;
        uint features(slow_reg) ;
        uint markSeq(slow_reg) ;
     } // group frame

   group: pt415 {
     variables:
        uint status(slow_reg) ;
        uint record(slow_reg) ;
        ... (quite a few more in here)
        float error_code(slow_reg) ;
     } // group pt415

   group: sim900 {
     dimensions:
        dvm_volts_dim2 = 4 ;
        dvm_gnd_dim2 = 4 ;
        dvm_ref_dim2 = 4 ;
        therm_volts_dim2 = 4 ;
        therm_temperature_dim2 = 4 ;
     variables:
        uint status(slow_reg) ;
        uint record(slow_reg) ;
        double utc(slow_reg) ;
        float main_volt_monitor(slow_reg) ;
        float main_current_monitor(slow_reg) ;
        float main_power_monitor(slow_reg) ;
        float main_undervoltage(slow_reg) ;
        uint main_tick(slow_reg) ;
        float dvm_volts(slow_reg, dvm_volts_dim2) ;
        float dvm_gnd(slow_reg, dvm_gnd_dim2) ;
        float dvm_ref(slow_reg, dvm_ref_dim2) ;
        float therm_volts(slow_reg, therm_volts_dim2) ;
        float therm_temperature(slow_reg, therm_temperature_dim2) ;
        ... (Few more in here)
        float bridge_output_value(slow_reg) ;
     } // group sim900
   } // group array

group: antenna0 {

   group: frame {
     variables:
        uint status(slow_reg) ;
        ubyte received(slow_reg) ;
        uint nsnap(slow_reg) ;
        uint record(slow_reg) ;
        double utc(slow_reg) ;
        uint features(slow_reg) ;
        uint markSeq(slow_reg) ;
     } // group frame

   group: acu {
     variables:
        uint status(slow_reg) ;
        uint new_mode(slow_reg) ;
        ...
        uint px_checksum_error_count(slow_reg) ;
        uint px_resyncing(slow_reg) ;
     } // group acu

   group: gpsTime {
     variables:
        uint status(slow_reg) ;
        ...
        uint serialNumber(slow_reg) ;
     } // group gpsTime
   } // group antenna0

group: receiver {

   group: frame {
     variables:
        uint status(slow_reg) ;
        ubyte received(slow_reg) ;
        uint nsnap(slow_reg) ;
        uint record(slow_reg) ;
        double utc(slow_reg) ;
        uint features(slow_reg) ;
        uint markSeq(slow_reg) ;
     } // group frame

   group: bolometers {
     variables:
        uint status(slow_reg) ;
     } // group bolometers
   } // group receiver
}

On Mon, Jan 9, 2012 at 4:56 PM, Ted Mansell<ted.mansell@xxxxxxxx>  wrote:
I don't think you can chunk an unlimited dimension by more than 1.  What are 
the variable dimensions?  Your formula makes it sound like they are 1-D and 
only sized by the unlimited dimension.  If that is the case, compression won't 
help.  You might be better off with a netcdf-3 file?

-- Ted

On Jan 9, 2012, at 8:15 AM, Ross Williamson wrote:

I'm trying to get my head around the filesize of my netcdf-4 file -
Some background.

1) I'm using the netcdf_c++4 API
2) I have an unlimited dimensions which I write data to about every second
3) There are a set of nested groups
4) I'm using compression on each variable
5) I'm using the default chunk size which I think is 1 for the
unlimited dimensions and sizeof(type) for other dimensions
6) I take data for 900 samples - There are about 100 variables so I
would expect (given doubles) a file size of 900x100x4 = 360K. Now I
fully expect some level of overhead but my file sizes are 5MB which is
incredibly large.

Now compression doesn't make much difference (5Mb vs 5.3Mb).  I'm
assuming here the thing that is screwing me over is that I haven't got
my chuncking set right. The issue is that I'm rather confused.  It
appears that you set the chunk size for each variable rather than the
whole file which doesn't make sense to me.  Would I just say multiply
each chunk size by say 100 so have 100 for the unlimited dimension and
sizeof(type)*100 for other dimensions?

I'd really like to fix this as netcdf-4 seems ideal for my project but
I can't deal with a size overhead of an order of magnitude.

I can attach the header of the netcdf file if it helps.

Ross

--
Ross Williamson
Associate Research Scientist
Columbia Astrophysics Laboratory
212-851-9379 (office)
212-854-4653 (Lab)
312-504-3051 (Cell)

_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/


  • 2012 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: