Re: [thredds] joinExisting and FMRC aggregation performance

  • To: tom cook <tmcook@xxxxxxxx>
  • Subject: Re: [thredds] joinExisting and FMRC aggregation performance
  • From: "Signell, Richard" <rsignell@xxxxxxxx>
  • Date: Mon, 16 Mar 2015 13:35:15 -0400
Tom,
I think you are absolutely right: you can't argue with success, and
although one would think that the right aggregation technique to use
for your HFradar data would be the joinExisting aggregation, the
featureCollection FRMC seems to be handling your situation much more
nicely.

If you stick with the FMRC (and it definitely seems like the right
choice here) I don't think you need to use Roy's netcheck method,
since that simply reduces the number of times the "first user" hits an
aggregation that needs the full metadata regenerated.   The FMRC,
doesn't have this "first user" problem -- it just updates the metadata
every time a new dataset comes in and the response time is
consistently fast.

-Rich


On Mon, Mar 16, 2015 at 1:24 PM, tom cook <tmcook@xxxxxxxx> wrote:
> Thanks to all contributing and for Rich advocating for our cause. It
> seems like in the long run, the grid aggregations (when complete) will
> be the preferred way for us to serve the data. When I've told people I
> can serve the data through the FMRC collections much faster than ncML,
> the consensus is that I shouldn't be able to do it that way. My
> question is, if it works why can't I? I'm still learning how our users
> access the data, so I'm not sure if the FMRC collections are accessed
> differently than ncML.
>
> Another question: I'm going to try Roy's netcheck method (I'll just
> use wget), is there a trigger from TDS that lets me know the
> aggregation is done, so I can schedule this appropriately?
> Thanks again!
> Tom
>
> On Sat, Mar 14, 2015 at 11:06 AM, Roy Mendelssohn - NOAA Federal
> <roy.mendelssohn@xxxxxxxx> wrote:
>> We have some aggregations with more files than what you mention.  The 
>> problem is probably when they add a new file or files, so that the 
>> aggregation has to be updated,  That doesn’t happen until there is a data 
>> request, so the first data request takes a long time  (some of ours take 
>> minutes on reaggregation).  So the trick is to have HFR radar folks make the 
>> first data request after the update.
>>
>> We sort of do this by running Netcheck  (freely available at 
>> http://coastwatch.pfeg.noaa.gov/coastwatch/NetCheck.html).    Files that 
>> update frequently get checked frequently.  Sometimes a request comes in 
>> before the Netcheck does the check, but that is rare on our servers.  Or you 
>> have whatever script produces the update file than make a request.  As I 
>> said, if you make the request on the local machine using a “localhost” 
>> address the request will not time out and the (re)-aggregation will be 
>> completed.
>>
>> For example 
>> http://oceanview.pfeg.noaa.gov/thredds/dodsC/Model/FNMOC/6hr_pressure.html 
>> has over 70,000 files. Of course, now that I have linked it as soon as you 
>> go to it the response will take for ever!, but I just tested it and it was 
>> pretty quick.
>>
>> The FMRC as is being implemented speeds this all up by creating index files 
>> and then running a separate process (TDM) to update the aggregations without 
>> waiting for a data request..
>>
>> -Roy
>>
>>
>>
>> On Mar 14, 2015, at 10:39 AM, John Caron <caron@xxxxxxxx> wrote:
>>
>>> I find it amazing that things work on that large of an NcML or even FMRC 
>>> collection. Just goes to show what I know. Anyway, Im about to embark on 
>>> studying where the bottlenecks are.
>>>
>>> The code isnt so much poorly written, as it simply wasnt designed with high 
>>> scaleability in mind. The solution is to write persistent "index" files so 
>>> that, once indexed, the logical "collection datasets" can be very quickly 
>>> accessed. Im going to take what I have been doing in GRIB and apply it to 
>>> netCDF, and GRID data in general.
>>>
>>> An NcML aggregation like a joinExisting may be specified inside the catalog 
>>> config or outside in a separate NcML file and referenced in a dataset or 
>>> datasetScan. In both cases, nothing is done until it is requested by a 
>>> user. At that point, if the dataset has already been constructed and is in 
>>> the TDS cache, and doesnt need updating, then its fast.
>>>
>>> A featureCollection has a new set of functionality to update the dataset in 
>>> the background. FMRC does some extra "persistent caching" (make some of the 
>>> info persist between TDS restarts).  Still not enough, but better than 
>>> NcML. GRIB collections now do this well. However if the collection is 
>>> changing, a seperate process (TDM) will handle updating and notifying the 
>>> TDS. That keeps the code from getting too complex and greatly simplifies 
>>> getting the object caching right.
>>>
>>> Read-optimized netcdf-4 files are an elegant solution indeed. Dave, maybe 
>>> sometime you could share your workflow in some place we could link to in 
>>> our documentation?
>>>
>>>
>>>
>>> On Sat, Mar 14, 2015 at 10:47 AM, Signell, Richard <rsignell@xxxxxxxx> 
>>> wrote:
>>> John,
>>>
>>> > NcML Aggregations should only be used for small collections of files ( a 
>>> > few
>>> > dozen?) , because they are created on-the-fly.
>>>
>>> The HFRADAR data is using a joinExisting aggregation in a THREDDS
>>> catalog.   Is that what you are calling NcML aggregation?
>>> I was thinking that NcML aggregation referred to the practice of
>>> writing an NcML file and dropping that into a folder along with the
>>> data files where it can be picked up by a DatasetScan.
>>>
>>> > FMRC does a better job of
>>> > caching information so things go quicker. It handles the case of a single
>>> > time dimension as a special case of a Forecast model collection. However,
>>> > they too are limited in how much they will scale up, (< 100 ?)
>>> >
>>> > So how many files and variables are in the HF Radar collection?
>>>
>>> There are currently 27,986 NetCDF files in the aggregation, each with
>>> a single time record containing the HF radar data for the hour.    It
>>> seems that the FMRC is handling this just fine, with reliable WMS
>>> response times of about one second.
>>>
>>> As Dave Blodgett points out, a better approach here might be to
>>> periodically combine a bunch of these hourly files into, say, monthly
>>> files, which would result in higher performance, less utilization of
>>> disk space, and quicker aggregation.
>>>
>>> I still don't understand what is happening with the joinExisting
>>> aggregation, however -- why it periodically (but not regularly) takes
>>> 50 seconds or more to respond.
>>>
>>> --
>>> Dr. Richard P. Signell   (508) 457-2229
>>> USGS, 384 Woods Hole Rd.
>>> Woods Hole, MA 02543-1598
>>>
>>> _______________________________________________
>>> thredds mailing list
>>> thredds@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit: 
>>> http://www.unidata.ucar.edu/mailing_lists/
>>
>> **********************
>> "The contents of this message do not reflect any position of the U.S. 
>> Government or NOAA."
>> **********************
>> Roy Mendelssohn
>> Supervisory Operations Research Analyst
>> NOAA/NMFS
>> Environmental Research Division
>> Southwest Fisheries Science Center
>> ***Note new address and phone***
>> 110 Shaffer Road
>> Santa Cruz, CA 95060
>> Phone: (831)-420-3666
>> Fax: (831) 420-3980
>> e-mail: Roy.Mendelssohn@xxxxxxxx www: http://www.pfeg.noaa.gov/
>>
>> "Old age and treachery will overcome youth and skill."
>> "From those who have been given much, much will be expected"
>> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
>>
>> _______________________________________________
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit: 
>> http://www.unidata.ucar.edu/mailing_lists/



-- 
Dr. Richard P. Signell   (508) 457-2229
USGS, 384 Woods Hole Rd.
Woods Hole, MA 02543-1598



  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: