Re: [netcdf-java] Bug (concurrency issue?) when reading NCML aggregations

Hi Clifford:

Yes, the FMRC may do the job, though it will need testing.

More questions:

1) does the filename or directory have the run time in it?

2) are you at liberty to change the filename or directory structure? The
best scenario would be to put each run in its own directory, and for the
filename or directory to have the runtime in it.

John

PS im cc'ing to community so others can contribute experiences.



On Wed, Nov 18, 2015 at 2:26 PM, Clifford Harms <clifford.harms@xxxxxxxxx>
wrote:

> So I have just now ran across the new(ish)  'feature collection'
> capability in the netcdf libraries.  I am starting to get the indication
> that this is the tree I should be barking up (with type=FMRC) for my use
> case (it is stated as handling file differences a bit better then the 'old
> style' aggregations).
>
>
>
> On Wed, Nov 18, 2015 at 1:01 PM, Clifford Harms <clifford.harms@xxxxxxxxx>
> wrote:
>
>> A single directory has multiple 'regions', and a directory can consist of
>> 300-1000 files.  Each region gets updated with a new production run daily,
>> and is overwritten by the newer data - eventually.  Until the update is
>> complete (again, this can take hours), there are a mixture of files that
>> belong to the same region but have different production times.  These files
>> must still aggregate as a single logical dataset with runtime coordinates,
>> and must still be available for customers.
>>
>> In addition, new regions, or different resolutions of an existing region,
>> may be added without notice, and also removed without notice. File names
>> are also subject to change without notification. Lastly, it is important
>> that the system respond to changes in the data on the file system as
>> immediately as possible.
>>
>> These requirements, along with the amount of data involved, have forced
>> me to exclude some of the more routine solutions to the problem (i.e.
>> scanning the all of the directories at regular intervals using file name
>> patterns to determine aggregations).
>>
>> The solution I've developed so far basically observes file system events
>> and maintains an index of the files using aggregation criteria, so that
>> when a change occurs the likely aggregation candidates can be processed
>> without having to evaluate every file in a directory, or rely on file name
>> conventions.  This essentially means that each time a file is changed, the
>> NCML of the aggregation that the file belongs to is rewritten to accurately
>> and immediately reflect the disposition of the data on the file system.
>>
>> This solution has, in testing, worked fantastically (the ucar netcdf
>> library performs very well in general), with the exception of the problem
>> I've outlined in the (potential) bug report..  The number of 'time'
>> coordinates in the datasets that I am attempting to  'joinNew' on will
>> rarely be equal, so if this is flat out unsupported by the 'joinNew'
>> aggregation, then I will have to find another way.
>>
>>
>> On Tue, Nov 17, 2015 at 4:48 PM, John Caron <jcaron1129@xxxxxxxxx> wrote:
>>
>>> so you are running a model which outputs 50-70 files that belong to a
>>> single "run".
>>>
>>> do you put each run in a seperate directory?
>>>
>>> are you overwriting the files?
>>>
>>> On Tue, Nov 17, 2015 at 1:34 PM, Clifford Harms <
>>> clifford.harms@xxxxxxxxx> wrote:
>>>
>>>> The data I attached is for a test case in a scenario I am trying to
>>>> handle. I have several thousand netcdfs (some CF, some not), most of which
>>>> are the same logical dataset broken up via a time or Z axis into datasets
>>>> consisting of 30-50 files, which I must aggregate into a single 'logical'
>>>> dataset (I believe this is a fairly common use case). These files are
>>>> updated daily, but due to the amount of data involved as well as other
>>>> environmental factors, these updates happen sporadically over a span of
>>>> about 24 hours.
>>>>
>>>> So what I am trying to do here is, as the files of an aggregated
>>>> dataset are slowly updated with newer versions of the same file, add those
>>>> new versions to the aggregated datasets that they belong to but ensuring
>>>> that the new data can be differentiated within the aggregation via its data
>>>> creation time (be it a model run time or production time or whatever). This
>>>> is where the joining of files with the joinNew dimension comes in (in this
>>>> example, 'runtime'), as the data creation time does not exist in the
>>>> datasets as a coordinate variable, and in some cases is not even indicated
>>>> in global attribution.
>>>>
>>>> Ultimately, once all of the files for an aggregated dataset have been
>>>> updated, the aggregation contains files that all have the same data
>>>> creation or run time, until the next update starts.
>>>>
>>>> You seem to be indicating that I cannot perform a 'joinNew' aggregation
>>>> between datasets that have coordinate variables with different sizes? If
>>>> that is the case, and I missed it in the documentation somewhere, then what
>>>> about aggregating the files with a joinNew first, and then aggregating
>>>> those aggregations as 'joinExisting' along time/Z axis?
>>>>
>>>> There still is the issue, though, of the random behavior (an exception
>>>> for some reads, for other reads an array of values) which indicates a
>>>> concurrency problem. If the read worked consistently, instead of only half
>>>> of the time, that would still be useful to me as my code could easily
>>>> determine which values in the returned array were valid.
>>>> At any rate, thanks for responding so quickly
>>>>
>>>> On Sat, Nov 14, 2015 at 5:35 PM, John Caron <jcaron1129@xxxxxxxxx>
>>>> wrote:
>>>>
>>>>> Hi Clifford:
>>>>>
>>>>>   <aggregation type="joinNew" dimName="runtime">
>>>>>     <netcdf  coordValue="0" location="ncom-relo-mayport_u_miw-t000.nc
>>>>> "/>
>>>>>     <netcdf coordValue="24">
>>>>>       <aggregation type="joinExisting" dimName="time">
>>>>>         <netcdf location="ncom-relo-mayport_26_u_miw-t001.nc"/>
>>>>>         <netcdf location="ncom-relo-mayport_26_u_miw-t000.nc"/>
>>>>>       </aggregation>
>>>>>     </netcdf>
>>>>>
>>>>> ncom-relo-mayport_u_miw-t000.nc only has 1 time coordinate, but the
>>>>> inner aggregation has 2, so these are not homogeneous in the sense that
>>>>> Ncml aggregation requires.
>>>>>
>>>>> could you explain more what you are trying to do?
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>> On Fri, Nov 13, 2015 at 11:24 PM, Clifford Harms <
>>>>> clifford.harms@xxxxxxxxx> wrote:
>>>>>
>>>>>> I've posted the report, sample data, sample xml, and sample code on
>>>>>> github -> https://github.com/Unidata/thredds/issues/276
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Clifford M. Harms
>>>>>>
>>>>>> _______________________________________________
>>>>>> netcdf-java mailing list
>>>>>> netcdf-java@xxxxxxxxxxxxxxxx
>>>>>> For list information or to unsubscribe, visit:
>>>>>> http://www.unidata.ucar.edu/mailing_lists/
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Clifford M. Harms
>>>>
>>>
>>>
>>
>>
>> --
>> Clifford M. Harms
>>
>
>
>
> --
> Clifford M. Harms
>
  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: