Re: [netcdfgroup] Fwd: Reading groups is very very slow. What am I doing wrong?

To: netcdfgroup@xxxxxxxxxxxxxxxx
Subject: Re: [netcdfgroup] Fwd: Reading groups is very very slow. What am I doing wrong?
From: Paul van Delst <paul.vandelst@xxxxxxxx>
Date: Wed, 30 Oct 2013 14:54:49 -0400

Hi Dave,

My data consists of atmospheric profiles (pressure, temperature, h2o,o3, trace gas absorbers..., etc..). Each profile can have a differentnumber of atmospheric levels and the absorber data can be for adifferent number or set of absorbers. A single profile is about 25Kb.


So one group would look like:

group: atmprofile-1 {
  dimensions:
      n_Levels = 101 ;
      n_Layers = 100 ;
      n_Absorbers = 28 ;
  variables:
      double Level_Pressure(n_Levels) ;
      double Level_Temperature(n_Levels) ;
      double Level_Absorber(n_Absorbers, n_Levels) ;
  ...etc...
  } // group atmprofile-1

and another like

group: atmprofile-2 {
  dimensions:
      n_Levels = 91 ;
      n_Layers = 90 ;
      n_Absorbers = 2 ;
  variables:
  ...etc...
  } // group atmprofile-2

The dimensions within a group is what I meant by "base" dimensions. Eachgroup has the same dimensions, but they have different values.

With netCDF3, I always had to ensure the profile data was at the sameset (or, at least, number) of pressure levels and with the same numberof gaseous absorbers to pack all the data into arrays, e.g.


netcdf ECMWF52.AtmProfile {
dimensions:
    n_levels = 101 ;
    n_layers = 100 ;
    n_absorbers = 2 ;
    n_profiles = UNLIMITED ; // (52 currently)
variables:
    double level_pressure(n_profiles, n_levels) ;
    double level_temperature(n_profiles, n_levels) ;
    double level_absorber(n_profiles, n_absorbers, n_levels) ;
    ...etc...
}

Adding individual profiles as a separate group allows me more freedom(and with less processing) to use profiles as they are delivered, but atthe cost of long I/O times for large(ish) datasets.

I guess I've fundamentally misinterpreted how groups in netCDF4 shouldbe used. Your point about the multiplicative time of reading a singlegroup makes sense. It just seemed to me that, since the data content iseffectively the same (for my tests they are identical), the I/O timeshould be also.

But I guess not. The overhead of reading lots of little groups of data(as in my dataset) is dominant. Bummer. :o(

Is there a way of storing this type of dataset in netCDF4 in, e.g.,ragged arrays?


cheers,

paulv


On 10/30/13 14:28, David W. Pierce wrote:

Hi Paul,
Well, you don't say what the size of each timestep is, but as the sizeof each timestep becomes small (< 50 MB maybe?) I would think thatdoing each timestep as a separate group (if that's what you're doing)would, for a 5000 timestep array, take ~5000 times as long. That'ssince the set up time is very considerable, and the incremental timefor a second timestep after you've set up for a first timestep issmall (unless each timestep is quite large).
For someone who doesn't know just what you're doing this part ispretty hard to parse:
"I did this so each group can have different "base" dimensions for thedata arrays."
Maybe you could give the specific example? Not knowing the detailsit's hard to see why it would be desirable to take the multi-groupapproach, or to think about alternate approaches that would accomplishyour goal but might be more efficient.
Regards,

--Dave
On Wed, Oct 30, 2013 at 8:00 AM, Paul van Delst<paul.vandelst@xxxxxxxx <mailto:paul.vandelst@xxxxxxxx>> wrote:
    Hello,

    I've just converted some of my netCDF writing code to write/read
    multiple groups rather than use an unlimited dimension. I did this
    so each group can have different "base" dimensions for the data
    arrays.

    I have one data set where the unlimited dimension is 5000. The
    read/write of this data in netCDF3 format is almost instantaneous.
    When I use the netCDF4 approach (reading and writing 5000 separate
    groups) the reads and write can take upwards of 10minutes (I
    started the program at 10:33am. It is now 10:51am and the read of
    the created file is still going on).

    I realise there's going to be additional overhead using the
    "groups" approach (defining dimensions and variables for each
    group) but I presume I'm doing something very wrong/stupid to
    cause the I/O to be as slow as it is. Before I start posting code
    snippets, does anyone have any experience hints as to what could
    be causing this supa slow I/O?

    Thanks for any info.

    cheers,

    paulv

    p.s. It's now 11:00am and the dataset reading is still going on...

    _______________________________________________
    netcdfgroup mailing list
    netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
    For list information or to unsubscribe,  visit:
http://www.unidata.ucar.edu/mailing_lists/
--
David W. Pierce
Division of Climate, Atmospheric Science, and Physical Oceanography
Scripps Institution of Oceanography, La Jolla, California, USA
(858) 534-8276 <tel:%28858%29%20534-8276> (voice) / (858) 534-8561<tel:%28858%29%20534-8561> (fax) dpierce@xxxxxxxx<mailto:dpierce@xxxxxxxx>
--
David W. Pierce
Division of Climate, Atmospheric Science, and Physical Oceanography
Scripps Institution of Oceanography, La Jolla, California, USA
(858) 534-8276 (voice) / (858) 534-8561 (fax) dpierce@xxxxxxxx<mailto:dpierce@xxxxxxxx>
_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/

Follow-Ups:
- Re: [netcdfgroup] Fwd: Reading groups is very very slow. What am I doing wrong?
  - From: David W. Pierce

References:
- [netcdfgroup] Reading groups is very very slow. What am I doing wrong?
  - From: Paul van Delst
- [netcdfgroup] Fwd: Reading groups is very very slow. What am I doing wrong?
  - From: David W. Pierce

2013 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: