Re: [netcdf-java] point data

To: Roland Viger <rviger@xxxxxxxx>
Subject: Re: [netcdf-java] point data
From: Tom Kunicki <tkunicki@xxxxxxxx>
Date: Wed, 17 Feb 2010 10:02:20 -0600

Roland and John,

I have the GSOD data written as one file for all stations (5609) and one 
station per file using CF continuous ragged array representation using C NetCDF 
4.1 with NC_CLASSIC_FORMAT.   I was initially confused since (for some strange 
reason) I thought CF and CDM were mutually exclusive, this is definitely not 
the case.  The large file comes in at 1.2GB, performance in the ToolsUI seems 
resonable.  The ToolUI reads data for one station at a time, I don't know how 
performance will scale when reading multiple stations simultaneously (i.e. with 
 LatLonRect in NetCDF-Java).  If there is a performance hit it can be worked 
around by reading all the data for a single station at once so I'm not too 
worried about it.

The data packing/unpacking usage is standardized so this isn't an issue 
(http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#Packed%20Data%20Values).
   Our current GSOD netcdfs store data into packed shorts.  ToolsUI has no 
problem converting these values.  It will add some responsibility to client 
read code reading the data, but we can generalize the implementation based on 
the reference standards.

For stationTimeSeries data, is it possible by convention to specific the 
observation date ranges in the station pseudo structure?  Currently we have to 
read the entire sequence of station data before 
StationTimeSeriesFeature.getDateRange() will return non-null .

I did try to aggregate the individual station files with ncml but it's not 
obvious how to do this with station time series data (all files share the same 
CDL).   There also seems to be a limit as to the number of files one can 
aggregate.  ToolsUI did have an issue processing one of my attempts, I received 
a error akin to "too many open files" (do all the files need to be kept open?). 
 Having one station per file is advantageous so it would be nice to get this 
working.   The CDL is below, any idea how to merge multiple station files with 
ncml?   I did try aggregation type of "union" (only showed data from first 
station file independent of number of referenced station files).  It's 
currently not clear to me how to do this with aggregation type of joinExisting 
or joinNew.  

### START
netcdf \690020-93218 {
dimensions:
        station = 1 ;
        stationid_len = 32 ;
        observation = UNLIMITED ; // (4144 currently)
variables:
        float lon(station) ;
                lon:units = "degrees_east" ;
                lon:_FillValue = -999.999f ;
        float lat(station) ;
                lat:units = "degrees_north" ;
                lat:_FillValue = -99.999f ;
        float elev(station) ;
                elev:units = "ft" ;
                elev:positive = "up" ;
                elev:_FillValue = -99.999f ;
        int wmo(station) ;
                wmo:standard_name = "station_WMO_id" ;
        int wban(station) ;
        char stationid(station, stationid_len) ;
                stationid:standard_name = "station_id" ;
        int ragged_row_size(station) ;
                ragged_row_size:standard_name = "ragged_row_size" ;
        short time(observation) ;
                time:units = "days since 1929-01-01 00:00:00" ;
        short temp(observation) ;
                temp:units = "degF" ;
                temp:_FillValue = -32768s ;
                temp:scale_factor = 0.1f ;
                temp:add_offset = 0.f ;
                temp:coordinates = "time lon lat elev" ;
        short dewp(observation) ;
                dewp:units = "degF" ;
                dewp:_FillValue = -32768s ;
                dewp:scale_factor = 0.1f ;
                dewp:add_offset = 0.f ;
                dewp:coordinates = "time lon lat elev" ;
        short slp(observation) ;
                slp:units = "mbar" ;
                slp:_FillValue = -32768s ;
                slp:scale_factor = 0.1f ;
                slp:add_offset = 0.f ;
                slp:coordinates = "time lon lat elev" ;
        short stp(observation) ;
                stp:units = "mbar" ;
                stp:_FillValue = -32768s ;
                stp:scale_factor = 0.1f ;
                stp:add_offset = 0.f ;
                stp:coordinates = "time lon lat elev" ;
        short visib(observation) ;
                visib:units = "miles" ;
                visib:_FillValue = -32768s ;
                visib:scale_factor = 0.1f ;
                visib:add_offset = 0.f ;
                visib:coordinates = "time lon lat elev" ;
        short wdsp(observation) ;
                wdsp:units = "knots" ;
                wdsp:_FillValue = -32768s ;
                wdsp:scale_factor = 0.1f ;
                wdsp:add_offset = 0.f ;
                wdsp:coordinates = "time lon lat elev" ;
        short mxspd(observation) ;
                mxspd:units = "knots" ;
                mxspd:_FillValue = -32768s ;
                mxspd:scale_factor = 0.1f ;
                mxspd:add_offset = 0.f ;
                mxspd:coordinates = "time lon lat elev" ;
        short gust(observation) ;
                gust:units = "knots" ;
                gust:_FillValue = -32768s ;
                gust:scale_factor = 0.1f ;
                gust:add_offset = 0.f ;
                gust:coordinates = "time lon lat elev" ;
        short max(observation) ;
                max:units = "degF" ;
                max:_FillValue = -32768s ;
                max:scale_factor = 0.1f ;
                max:add_offset = 0.f ;
                max:coordinates = "time lon lat elev" ;
        short min(observation) ;
                min:units = "degF" ;
                min:_FillValue = -32768s ;
                min:scale_factor = 0.1f ;
                min:add_offset = 0.f ;
                min:coordinates = "time lon lat elev" ;
        short prcp(observation) ;
                prcp:units = "inches" ;
                prcp:_FillValue = -32768s ;
                prcp:scale_factor = 0.1f ;
                prcp:add_offset = 0.f ;
                prcp:coordinates = "time lon lat elev" ;
        short sndp(observation) ;
                sndp:units = "inches" ;
                sndp:_FillValue = -32768s ;
                sndp:scale_factor = 0.1f ;
                sndp:add_offset = 0.f ;
                sndp:coordinates = "time lon lat elev" ;
        byte frshtt(observation) ;
                frshtt:_FillValue = 0b ;
                frshtt:coordinates = "time lon lat elev" ;

// global attributes:
                :Conventions = "CF-1.5" ;
                :CF\:featureType = "stationTimeSeries" ;
}
### END

Tom Kunicki
Center for Integrated Data Analytics
U.S. Geological Survey
8505 Research Way
Middleton, WI  53562





On Jan 26, 2010, at 1:19 PM, Roland Viger wrote:

> 
> Hi John, 
> 
> I'll try to add a bit to Lauren's response. Hopefully the others will make 
> sure I'm not mangling the technology or vision on this. So, yes on all three 
> types of queries (including Lauren's additional one), but it might be the 
> case that Lauren's case #3 (with a time period) is the only one that needs to 
> be supported if we use a metadatabase to answer the rest of the questions 
> (location, period of record, data quality, etc) before querying the actual 
> data store. We might need to think about this on our end a little more. 
> 
> As far as web services, we're expecting to serve all this through THREDDS or 
> direct NetCDF reads. As far as clients, our first focus will be stuff we make 
> ourselves. Home made web page interfaces and Java applications are the most 
> important for the short term. Access to the data from other data servers 
> (other instances of THREDDS, RAMADDA, or non-Unidata products like ERDDAP) is 
> also on the horizon, but not really in the initial development cycle (Nate, 
> Steve, do you agree with this?).  These other data servers may or may not be 
> local to the data. We're open to suggestions--as Lauren said, we're just 
> expecting to use OPeNDAP and/or direct reads with Java-NetCDF. If there are 
> libraries or classes that we should be lifting out of IDV or otherwise 
> leveraging, we would be very interested to hear about this. We have not 
> really investigated the Unidata display and analysis offerings at all. 
> 
> I think the integer w/float offset plan sounds good as long as we can return 
> the data in the original floating point form. Doesn't seem like carrying that 
> transformation out is a big deal. Could it be embedded in the 
> creation/streaming of the original NetCDF file that gets returned? Would be 
> nice to avoid writing one NetCDF file, reading it, and then writing 
> out/streaming the real result. 
> 
> Separating current and archived data might be a help, although our data set 
> is updated only every couple of months. The "current" thing is not all that 
> dynamic for us. Using this idea to break the history into conveniently sized 
> blocks optimized for access should probably be our focus. Chunking might be 
> good since some of our data sets go back a lot of years. I take it that NcML 
> would be used to stich the chunks together as a single, temporally continuous 
> virtual file. 
> 
> Part of our question about the arrangements of files is that we've normally 
> had the full history each station in a separate file. We weren't sure yet how 
> to use NcML to stitch these together. Rich says you've figured out how to do 
> this spatial kind of stitching. We didn't know if this was the most efficient 
> or whether to simply regenerate the NetCDF files according to other 
> dimensions/variables. Not sure if we're closer to answering your question on 
> this. Please let us know. 
> 
> Roland 
> 
> 
> From: Lauren E Hay/WRD/USGS/DOI
> To:   John Caron <caron@xxxxxxxxxxxxxxxx>
> Cc:   Steven Markstrom <markstro@xxxxxxxx>, netcdf-java 
> <netcdf-java@xxxxxxxxxxxxxxxx>, Nate Booth <nlbooth@xxxxxxxx>, Rich Signell 
> <rsignell@xxxxxxxx>, Roland Viger <rviger@xxxxxxxx>
> Date: 01/25/2010 02:35 PM
> Subject:      Re: [netcdf-java] point data
> 
> 
> 
> 
> 
> John 
> Below are the answers to your questions -- let me know if it's not enough 
> info. 
> Lauren 
> ======================================
> Lauren E. Hay, Ph.D.            Tel:    (303) 236-7279
> U.S. Geological Survey          Fax:  (303) 236-5034
> Box 25046, MS 412, DFC      Email: lhay@xxxxxxxx
> Lakewood, CO 80225
> ====================================== 
> 
> From: John Caron <caron@xxxxxxxxxxxxxxxx>
> To:   Rich Signell <rsignell@xxxxxxxx>
> Cc:   netcdf-java <netcdf-java@xxxxxxxxxxxxxxxx>, Roland Viger 
> <rviger@xxxxxxxx>, Steven Markstrom <markstro@xxxxxxxx>, Lauren E Hay 
> <lhay@xxxxxxxx>, Nate Booth <nlbooth@xxxxxxxx>
> Date: 01/25/2010 10:21 AM
> Subject:      Re: [netcdf-java] point data
> 
> 
> 
> 
> 
> 
> Hi Rich and all:
> 
> This is a interesting challenge on such a large datasets to get good read 
> response. 
> 
> First, you have to decide what kinds of queries you want to support and what 
> kind of response time is needed.  I have generally used the assumption that 
> the common queries that you want to optimize are:
> 1) get data over a time range for all stations in a lat/lon box.
> 2) get data for a single station over a time range, or for all time. 
> 3) get data with a specified list of stations
> 
> 
> Usually I would break the data into multiple files based on time range, 
> aiming for a file size of 50-500 Mb. I also use a different format for 
> current vs archived data, so that the current dataset can be added to 
> dynamically, while the archived data is rewritten (once) for speed of 
> retrieval. 
> 
> Again, all depends on what queries you want to optimize so ill wait for your 
> thoughts on that. 
> We ran into this problem in the past so we made a separate file for each 
> station and each variable. Is there a problem with having too many files? Can 
> we have a file by year that only contains stations with data for that year? 
> Or -- if we don't care how many files -- 1 file for each station for each 
> variable for each year. It does not matter to me. The current project will 
> have data that has a set time period. We hope to use this structure for other 
> projects that will have file updates as new data is collected. 
> 
> Another question is what clients need to access this data. Are you writing 
> your own web service, do you just want remote access from IDV, or ?? 
> We anticipate that our web serivces will use the OpenDAP API. I'm not the 
> person to answer this one. 
> 
> 
> I would think that if we're careful, we can get netcdf-4 sizes that are 
> similar to compressed text, but we'll have to experiment. The data appears to 
> be integer or float with a fixed dynamic range, which is amenable to storing 
> as an integer with scale/offset. integer data compresses much better than 
> floating point due to the noise in the low bits of the mantissa. So one task 
> you should get started on is to examine each field and decide its data type. 
> if floating point, decide on its range and the number of significant bits.
> 
> 
> 
> _______________________________________________
> netcdf-java mailing list
> netcdf-java@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe, visit: 
> http://www.unidata.ucar.edu/mailing_lists/
References:
- Re: [netcdf-java] point data
  - From: Rich Signell
- Re: [netcdf-java] point data
  - From: John Caron
- Re: [netcdf-java] point data
  - From: Roland Viger
2010 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdf-java archives: