Re: [netcdf-java] Bug (concurrency issue?) when reading NCML aggregations

To: John Caron <jcaron1129@xxxxxxxxx>
Subject: Re: [netcdf-java] Bug (concurrency issue?) when reading NCML aggregations
From: Clifford Harms <clifford.harms@xxxxxxxxx>
Date: Thu, 19 Nov 2015 23:52:19 -0600

Neither the file name or directories have runtime in them, some of the data
have the runtime as global attribution, and I have to figure the runtime
out via other mechanisms for other datasets (i.e. file creation time in
some cases, using the first available time in others).  This is one of the
reasons why programatically creating the NCML (or maybe feature collection)
rather than using the scan feature is ideal for me - I can accommodate some
production realities as needed.  I've also prototyped 'fixing' the single
netcdfs using NCML (filling in missing CF variables and attributes), but
this basically doubles the amount of files on an already near-capacity file
system wrt file descriptors.

Changing filenames or directory structure is, unfortunately, out of my
control and was stated as off limits at the beginning of this particular
requirement, as it is very expensive and time consuming to alter heavily
validated model code.

I have read that the 'FMRC' aggregation is deprecated, while the FMRC
feature collection type is the preferred method, so that is going to be my
next attempt.

If my expectation for the way the aggregation is supposed to work is
invalid in the original bug report, I will happily close it.

On Thu, Nov 19, 2015 at 5:08 PM, John Caron <jcaron1129@xxxxxxxxx> wrote:

> Hi Clifford:
>
> Yes, the FMRC may do the job, though it will need testing.
>
> More questions:
>
> 1) does the filename or directory have the run time in it?
>
> 2) are you at liberty to change the filename or directory structure? The
> best scenario would be to put each run in its own directory, and for the
> filename or directory to have the runtime in it.
>
> John
>
> PS im cc'ing to community so others can contribute experiences.
>
>
>
> On Wed, Nov 18, 2015 at 2:26 PM, Clifford Harms <clifford.harms@xxxxxxxxx>
> wrote:
>
>> So I have just now ran across the new(ish)  'feature collection'
>> capability in the netcdf libraries.  I am starting to get the indication
>> that this is the tree I should be barking up (with type=FMRC) for my use
>> case (it is stated as handling file differences a bit better then the 'old
>> style' aggregations).
>>
>>
>>
>> On Wed, Nov 18, 2015 at 1:01 PM, Clifford Harms <clifford.harms@xxxxxxxxx
>> > wrote:
>>
>>> A single directory has multiple 'regions', and a directory can consist
>>> of 300-1000 files.  Each region gets updated with a new production run
>>> daily, and is overwritten by the newer data - eventually.  Until the update
>>> is complete (again, this can take hours), there are a mixture of files that
>>> belong to the same region but have different production times.  These files
>>> must still aggregate as a single logical dataset with runtime coordinates,
>>> and must still be available for customers.
>>>
>>> In addition, new regions, or different resolutions of an existing
>>> region, may be added without notice, and also removed without notice. File
>>> names are also subject to change without notification. Lastly, it is
>>> important that the system respond to changes in the data on the file system
>>> as immediately as possible.
>>>
>>> These requirements, along with the amount of data involved, have forced
>>> me to exclude some of the more routine solutions to the problem (i.e.
>>> scanning the all of the directories at regular intervals using file name
>>> patterns to determine aggregations).
>>>
>>> The solution I've developed so far basically observes file system events
>>> and maintains an index of the files using aggregation criteria, so that
>>> when a change occurs the likely aggregation candidates can be processed
>>> without having to evaluate every file in a directory, or rely on file name
>>> conventions.  This essentially means that each time a file is changed, the
>>> NCML of the aggregation that the file belongs to is rewritten to accurately
>>> and immediately reflect the disposition of the data on the file system.
>>>
>>> This solution has, in testing, worked fantastically (the ucar netcdf
>>> library performs very well in general), with the exception of the problem
>>> I've outlined in the (potential) bug report..  The number of 'time'
>>> coordinates in the datasets that I am attempting to  'joinNew' on will
>>> rarely be equal, so if this is flat out unsupported by the 'joinNew'
>>> aggregation, then I will have to find another way.
>>>
>>>
>>> On Tue, Nov 17, 2015 at 4:48 PM, John Caron <jcaron1129@xxxxxxxxx>
>>> wrote:
>>>
>>>> so you are running a model which outputs 50-70 files that belong to a
>>>> single "run".
>>>>
>>>> do you put each run in a seperate directory?
>>>>
>>>> are you overwriting the files?
>>>>
>>>> On Tue, Nov 17, 2015 at 1:34 PM, Clifford Harms <
>>>> clifford.harms@xxxxxxxxx> wrote:
>>>>
>>>>> The data I attached is for a test case in a scenario I am trying to
>>>>> handle. I have several thousand netcdfs (some CF, some not), most of which
>>>>> are the same logical dataset broken up via a time or Z axis into datasets
>>>>> consisting of 30-50 files, which I must aggregate into a single 'logical'
>>>>> dataset (I believe this is a fairly common use case). These files are
>>>>> updated daily, but due to the amount of data involved as well as other
>>>>> environmental factors, these updates happen sporadically over a span of
>>>>> about 24 hours.
>>>>>
>>>>> So what I am trying to do here is, as the files of an aggregated
>>>>> dataset are slowly updated with newer versions of the same file, add those
>>>>> new versions to the aggregated datasets that they belong to but ensuring
>>>>> that the new data can be differentiated within the aggregation via its 
>>>>> data
>>>>> creation time (be it a model run time or production time or whatever). 
>>>>> This
>>>>> is where the joining of files with the joinNew dimension comes in (in this
>>>>> example, 'runtime'), as the data creation time does not exist in the
>>>>> datasets as a coordinate variable, and in some cases is not even indicated
>>>>> in global attribution.
>>>>>
>>>>> Ultimately, once all of the files for an aggregated dataset have been
>>>>> updated, the aggregation contains files that all have the same data
>>>>> creation or run time, until the next update starts.
>>>>>
>>>>> You seem to be indicating that I cannot perform a 'joinNew'
>>>>> aggregation between datasets that have coordinate variables with different
>>>>> sizes? If that is the case, and I missed it in the documentation 
>>>>> somewhere,
>>>>> then what about aggregating the files with a joinNew first, and then
>>>>> aggregating those aggregations as 'joinExisting' along time/Z axis?
>>>>>
>>>>> There still is the issue, though, of the random behavior (an exception
>>>>> for some reads, for other reads an array of values) which indicates a
>>>>> concurrency problem. If the read worked consistently, instead of only half
>>>>> of the time, that would still be useful to me as my code could easily
>>>>> determine which values in the returned array were valid.
>>>>> At any rate, thanks for responding so quickly
>>>>>
>>>>> On Sat, Nov 14, 2015 at 5:35 PM, John Caron <jcaron1129@xxxxxxxxx>
>>>>> wrote:
>>>>>
>>>>>> Hi Clifford:
>>>>>>
>>>>>>   <aggregation type="joinNew" dimName="runtime">
>>>>>>     <netcdf  coordValue="0" location="ncom-relo-mayport_u_miw-t000.nc
>>>>>> "/>
>>>>>>     <netcdf coordValue="24">
>>>>>>       <aggregation type="joinExisting" dimName="time">
>>>>>>         <netcdf location="ncom-relo-mayport_26_u_miw-t001.nc"/>
>>>>>>         <netcdf location="ncom-relo-mayport_26_u_miw-t000.nc"/>
>>>>>>       </aggregation>
>>>>>>     </netcdf>
>>>>>>
>>>>>> ncom-relo-mayport_u_miw-t000.nc only has 1 time coordinate, but the
>>>>>> inner aggregation has 2, so these are not homogeneous in the sense that
>>>>>> Ncml aggregation requires.
>>>>>>
>>>>>> could you explain more what you are trying to do?
>>>>>>
>>>>>> John
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 11:24 PM, Clifford Harms <
>>>>>> clifford.harms@xxxxxxxxx> wrote:
>>>>>>
>>>>>>> I've posted the report, sample data, sample xml, and sample code on
>>>>>>> github -> https://github.com/Unidata/thredds/issues/276
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Clifford M. Harms
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> netcdf-java mailing list
>>>>>>> netcdf-java@xxxxxxxxxxxxxxxx
>>>>>>> For list information or to unsubscribe, visit:
>>>>>>> http://www.unidata.ucar.edu/mailing_lists/
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Clifford M. Harms
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Clifford M. Harms
>>>
>>
>>
>>
>> --
>> Clifford M. Harms
>>
>
>


-- 
Clifford M. Harms

References:
- [netcdf-java] Bug (concurrency issue?) when reading NCML aggregations
  - From: Clifford Harms
- Re: [netcdf-java] Bug (concurrency issue?) when reading NCML aggregations
  - From: John Caron
- Re: [netcdf-java] Bug (concurrency issue?) when reading NCML aggregations
  - From: Clifford Harms
- Re: [netcdf-java] Bug (concurrency issue?) when reading NCML aggregations
  - From: John Caron
- Re: [netcdf-java] Bug (concurrency issue?) when reading NCML aggregations
  - From: John Caron

2015 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdf-java archives: