Re: [thredds] joinExisting and FMRC aggregation performance

  • To: Roy Mendelssohn - NOAA Federal <roy.mendelssohn@xxxxxxxx>
  • Subject: Re: [thredds] joinExisting and FMRC aggregation performance
  • From: tom cook <tmcook@xxxxxxxx>
  • Date: Mon, 16 Mar 2015 10:24:12 -0700
Thanks to all contributing and for Rich advocating for our cause. It
seems like in the long run, the grid aggregations (when complete) will
be the preferred way for us to serve the data. When I've told people I
can serve the data through the FMRC collections much faster than ncML,
the consensus is that I shouldn't be able to do it that way. My
question is, if it works why can't I? I'm still learning how our users
access the data, so I'm not sure if the FMRC collections are accessed
differently than ncML.

Another question: I'm going to try Roy's netcheck method (I'll just
use wget), is there a trigger from TDS that lets me know the
aggregation is done, so I can schedule this appropriately?
Thanks again!
Tom

On Sat, Mar 14, 2015 at 11:06 AM, Roy Mendelssohn - NOAA Federal
<roy.mendelssohn@xxxxxxxx> wrote:
> We have some aggregations with more files than what you mention.  The problem 
> is probably when they add a new file or files, so that the aggregation has to 
> be updated,  That doesn’t happen until there is a data request, so the first 
> data request takes a long time  (some of ours take minutes on reaggregation). 
>  So the trick is to have HFR radar folks make the first data request after 
> the update.
>
> We sort of do this by running Netcheck  (freely available at 
> http://coastwatch.pfeg.noaa.gov/coastwatch/NetCheck.html).    Files that 
> update frequently get checked frequently.  Sometimes a request comes in 
> before the Netcheck does the check, but that is rare on our servers.  Or you 
> have whatever script produces the update file than make a request.  As I 
> said, if you make the request on the local machine using a “localhost” 
> address the request will not time out and the (re)-aggregation will be 
> completed .
>
> For example 
> http://oceanview.pfeg.noaa.gov/thredds/dodsC/Model/FNMOC/6hr_pressure.html 
> has over 70,000 files. Of course, now that I have linked it as soon as you go 
> to it the response will take for ever!, but I just tested it and it was 
> pretty quick.
>
> The FMRC as is being implemented speeds this all up by creating index files 
> and then running a separate process (TDM) to update the aggregations without 
> waiting for a data request..
>
> -Roy
>
>
>
> On Mar 14, 2015, at 10:39 AM, John Caron <caron@xxxxxxxx> wrote:
>
>> I find it amazing that things work on that large of an NcML or even FMRC 
>> collection. Just goes to show what I know. Anyway, Im about to embark on 
>> studying where the bottlenecks are.
>>
>> The code isnt so much poorly written, as it simply wasnt designed with high 
>> scaleability in mind. The solution is to write persistent "index" files so 
>> that, once indexed, the logical "collection datasets" can be very quickly 
>> accessed. Im going to take what I have been doing in GRIB and apply it to 
>> netCDF, and GRID data in general.
>>
>> An NcML aggregation like a joinExisting may be specified inside the catalog 
>> config or outside in a separate NcML file and referenced in a dataset or 
>> datasetScan. In both cases, nothing is done until it is requested by a user. 
>> At that point, if the dataset has already been constructed and is in the TDS 
>> cache, and doesnt need updating, then its fast.
>>
>> A featureCollection has a new set of functionality to update the dataset in 
>> the background. FMRC does some extra "persistent caching" (make some of the 
>> info persist between TDS restarts).  Still not enough, but better than NcML. 
>> GRIB collections now do this well. However if the collection is changing, a 
>> seperate process (TDM) will handle updating and notifying the TDS. That 
>> keeps the code from getting too complex and greatly simplifies getting the 
>> object caching right.
>>
>> Read-optimized netcdf-4 files are an elegant solution indeed. Dave, maybe 
>> sometime you could share your workflow in some place we could link to in our 
>> documentation?
>>
>>
>>
>> On Sat, Mar 14, 2015 at 10:47 AM, Signell, Richard <rsignell@xxxxxxxx> wrote:
>> John,
>>
>> > NcML Aggregations should only be used for small collections of files ( a 
>> > few
>> > dozen?) , because they are created on-the-fly.
>>
>> The HFRADAR data is using a joinExisting aggregation in a THREDDS
>> catalog.   Is that what you are calling NcML aggregation?
>> I was thinking that NcML aggregation referred to the practice of
>> writing an NcML file and dropping that into a folder along with the
>> data files where it can be picked up by a DatasetScan.
>>
>> > FMRC does a better job of
>> > caching information so things go quicker. It handles the case of a single
>> > time dimension as a special case of a Forecast model collection. However,
>> > they too are limited in how much they will scale up, (< 100 ?)
>> >
>> > So how many files and variables are in the HF Radar collection?
>>
>> There are currently 27,986 NetCDF files in the aggregation, each with
>> a single time record containing the HF radar data for the hour.    It
>> seems that the FMRC is handling this just fine, with reliable WMS
>> response times of about one second.
>>
>> As Dave Blodgett points out, a better approach here might be to
>> periodically combine a bunch of these hourly files into, say, monthly
>> files, which would result in higher performance, less utilization of
>> disk space, and quicker aggregation.
>>
>> I still don't understand what is happening with the joinExisting
>> aggregation, however -- why it periodically (but not regularly) takes
>> 50 seconds or more to respond.
>>
>> --
>> Dr. Richard P. Signell   (508) 457-2229
>> USGS, 384 Woods Hole Rd.
>> Woods Hole, MA 02543-1598
>>
>> _______________________________________________
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit: 
>> http://www.unidata.ucar.edu/mailing_lists/
>
> **********************
> "The contents of this message do not reflect any position of the U.S. 
> Government or NOAA."
> **********************
> Roy Mendelssohn
> Supervisory Operations Research Analyst
> NOAA/NMFS
> Environmental Research Division
> Southwest Fisheries Science Center
> ***Note new address and phone***
> 110 Shaffer Road
> Santa Cruz, CA 95060
> Phone: (831)-420-3666
> Fax: (831) 420-3980
> e-mail: Roy.Mendelssohn@xxxxxxxx www: http://www.pfeg.noaa.gov/
>
> "Old age and treachery will overcome youth and skill."
> "From those who have been given much, much will be expected"
> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
>
> _______________________________________________
> thredds mailing list
> thredds@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/



  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: