We also have had success with some very large aggregations, but it’s
interesting to hear that NcML aggregations should only be used for small
collections. We’ve worked around that issue in more ways than I care to recall…
The latest technique I’ve settled on is to use ncml aggregation on a high
performance ‘development’ server to aggregate files and add/subtract metadata
then use nccopy against the OPeNDAP endpoint of the aggregations to create a
read optimized NetCDF4 archive file(s). How much goes in each file is based on
a reasonable max file size for file transferability (currently I cap it at
about 5GB). Rich, I’m not sure this approach is an option for you, but it is a
good way to avoid the rescan hit of big ‘archive’ aggregations. We maintain one
service pair that we archive annually using this kind of approach so we have an
‘archive’ dataset and a ‘real time’ dataset. The archive is optimized so the
rescan is a trivial processing hit. Finally, if your data has an unlimited
dimension… stop using unlimited dimensions for files that don’t change length.
The time variable is interleaved in with the data so the scan to aggregate is
super duper slow. nccopy has a flag to rewrite data to fixed dimension.
I know we’ve brought it up before, but the sooner grid feature collections can
get done, the better! Although, I’m not sure providing band aids for poorly
written archive files is a good thing?
- Dave
On Mar 13, 2015, at 9:37 PM, Roy Mendelssohn - NOAA Federal
<roy.mendelssohn@xxxxxxxx> wrote:
> We have aggregations in the 10,000’s that work okay. The frequency of
> update, and the time that it takes to update affect the response. We have
> something that polls the datasets and therefore the updated aggregation get
> done usually before an user makes a request. Perhaps the same code that does
> the update should then make a request. Moreover we have found that if the
> request comes from the local machine using localhost: it doesn’t time-out and
> will complete the request no matter how long it takes.
>
> -Roy
>
> On Mar 13, 2015, at 7:22 PM, John Caron <caron@xxxxxxxx> wrote:
>
>> Hi Rich:
>>
>> NcML Aggregations should only be used for small collections of files ( a few
>> dozen?) , because they are created on-the-fly. FMRC does a better job of
>> caching information so things go quicker. It handles the case of a single
>> time dimension as a special case of a Forecast model collection. However,
>> they too are limited in how much they will scale up, (< 100 ?)
>>
>> GRIB collection in v4.6.0 finally are almost ready for large scale
>> collections (> 10K files). I will be improving the FMRC to use some of the
>> techniques in GRIB collections, slated for version 4.6.1. Not yet sure how
>> far that will get in scaling up, but I think we can do much better than now.
>>
>> So how many files and variables are in the HF Radar collection?
>>
>> John
>>
>> On Fri, Mar 13, 2015 at 2:50 PM, Signell, Richard <rsignell@xxxxxxxx> wrote:
>> Thredds community,
>>
>> The largest archive of HF Radar ocean surface current data is being
>> served by THREDDS at
>> http://hfrnet.ucsd.edu/thredds/catalog.html, but the erratic
>> performance of the joinExisting aggregations has made them difficult
>> to use. The folks at UCSD discovered that if they use FMRC
>> aggregations they work much better than the joinExisting, as borne out
>> by this Ipython Notebook, where we just request WMS services from the
>> two aggregations every minute for one hour:
>>
>> http://nbviewer.ipython.org/gist/rsignell-usgs/139d5481d74a1181e576
>>
>> I don't understand this behavior. The joinExisting was designed for
>> this type of aggregation (simply joining netcdf files along the time
>> dimension) and the FMRC was instead designed for files with
>> overlapping forecast times. But there is no arguing with the results
>> of this test: FMRC is clearly working better.
>>
>> Anyone have insight into why we are getting these results?
>>
>> Are there settings that could be changed to improve the performance of
>> the joinExisting aggregation?
>>
>> Thanks,
>> Rich
>>
>> P.S. the existing aggregation catalog and threddsConfig.xml settings
>> are shown at the end of the notebook
>>
>> --
>> Dr. Richard P. Signell (508) 457-2229
>> USGS, 384 Woods Hole Rd.
>> Woods Hole, MA 02543-1598
>>
>> _______________________________________________
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe, visit:
>> http://www.unidata.ucar.edu/mailing_lists/
>>
>>
>> _______________________________________________
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe, visit:
>> http://www.unidata.ucar.edu/mailing_lists/
>
> **********************
> "The contents of this message do not reflect any position of the U.S.
> Government or NOAA."
> **********************
> Roy Mendelssohn
> Supervisory Operations Research Analyst
> NOAA/NMFS
> Environmental Research Division
> Southwest Fisheries Science Center
> ***Note new address and phone***
> 110 Shaffer Road
> Santa Cruz, CA 95060
> Phone: (831)-420-3666
> Fax: (831) 420-3980
> e-mail: Roy.Mendelssohn@xxxxxxxx www: http://www.pfeg.noaa.gov/
>
> "Old age and treachery will overcome youth and skill."
> "From those who have been given much, much will be expected"
> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
>
> _______________________________________________
> thredds mailing list
> thredds@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe, visit:
> http://www.unidata.ucar.edu/mailing_lists/