Rob Weingruber wrote:
Hi John Again ;-)
John Caron wrote:
Rob Weingruber wrote:
Hi Again ;-)
Looks like the NCML is the way to go. Thanks for the suggestion ;-)
I created a super simple NCML that aggregates several files into a
single
virtual dataset, joined by an existing 'Time' variable. Piece of
cake. And the
GeoGrid API was nice enough to then give me all of the 'Valid Times'
for that
virtual dataset. This 'Time' variable is the semantic equivalent of
the 'valid time'
for the file. However, there also is the issue of the 'Generated
Time' for a file
(ie: generated at 12:00 Z, but valid for 15:00 Z. This would be
used in requests
such as 'give me the 15:00 Z forecast gen'ed at 12:00 Z). I see
that there might
be 2 ways to glue on the generated time information: a) as an
attribute in each
of the files that make up the data set or b) join on a new gen time
variable. Which
would be best and performant, in your opinion? Would the latter
even be possible, considering that
we still would need to join on the existing 'valid time' variable?
Or would we just
join on the 'valid time', and then attach a new gen-time variable
(and value) to each
of the files (within the NCML for that virtual dataset)?
I am currently working on a new kind of NcML aggregation called
"forecastModelRunCollection", which deals with a 2D time, "valid" and
"generated". I hope to have an alpha version in the next week or two.
There is some partially completedd code in the 2.2.17 snapshot.
I will probably make some UML diagrams, and ill send them along to
you for your feedback when I do.
Gladly will take a look at these/this whenever you're ready for me
to. This sounds like
exactly what we might need...
Make sense?
Also, I recall that we agreed the performance would be fine for,
say, 10,000 files
within a virtual dataset defined by NCML. Did I misinterpret, or is
that reasonable?
I think there will be some optimizations needed to scale up to that
size. It will probably work (given enough memory - I forget if JVMs
are still restricted to 2Gb heaps)? Id like to measure its memory
use, so perhaps you could help me test and debug this size datasets?
Glad to help here too. I think I have an old JBuilder around, that
has an OptimizeIt license too....
One thing I thought of recently, is: does NCML allow datetime
coordValue's to be placed
*into* the NCML (thereby avoiding a file.open when those coordValues
are queried via the API)?
I tried the following, to no avail**:
<aggregation dimName="Time" type="joinExisting">
<netcdf
location="file:/d2/www/data/ncmlTest/DPG/2006070611/wrfout_d01_2006-07-06_080000.DPG_F.nc"
coordValue="2006-07-06 08:00:00Z"/>...
The reasoning behind this is that I would like to place the 'valid
(and gen) time's into the NCML,
where each coordValue would theoretically match the value in the file
for that specific
netcdf file. If the API could then *use the values directly from the
NCML*, then there
might be no need to open the file(s) when geoGrid.getTimes() is
called. The point being
that if we can avoid opening files for valid and gen time information,
then we'd better
the performance for datasets with lots and lots of files. What do you
think?
** "To no avail" - means that it worked, but I tried moving the netcdf
files out of the way,
to see if they would be opened for a geoGrid.getTimes() call, and
exceptions were thrown.
It all worked when I left the files where they were supposed to be,
but that wasnt the point ;-)
Ive just been working on some of that in 2.2.17, see new section
"Defining coordinates on a JoinExisting aggregation" in
http://www.unidata.ucar.edu/software/netcdf/ncml/v2.2/Aggregation.html.
It looks like what you were doing above is mostly correct (assuming your
files have an existing coordinate variable called "Time" with length 1),
but the current version is not handling it. I would reccomend that you
use the form "2006-07-06T08:00:00Z" so that we can use space delimiters
when theres more than one coord value.
Also, the coord values can be cached (you have to enable this, see the
last section "Aggregation Caching") if you want to let the library read
the values the first time.
This code is so new im not sure i have even done a release with it. Im
working at home today, ill check when im in tommorrow...
This refers to the joinExisting aggregations. You probably really want
to use the new "Forecast Model Run" Aggregation that im working on now.
It will be similar, but take into account the 2D time coordinate.