This is great info Mike!
I've just been testing the Union vs joinExisting and I really appreciate
hearing about your strategy for large datasets. Fortunately, my largest
netCDF datasets are static, so I plan on using your suggestions and
getting thing set up that way.
Thanks again!!
-kevin.
On 10/28/13 11:48 AM, Michael McDonald wrote:
Kevin,
I have been triggering this initial scan by clicking on the services for the
aggregated dataset. Is there another way to perform the initial indexing of
netCDF aggregations (like is done with GRIB Collections) besides clicking on
a service link?
We trigger all of our initial catalog scans via ongoing Nagios
(http://www.nagios.org/) queries that check the most frequently
accessed datasets (really only need to query the datasets that change,
i.e., forecast datasets, and the large aggregations). We set the
Nagios queries to extremely high timeout values (5~10minutes) and then
just let them run normally. We occasionally get false-positives from
this when the tomcat server is reset/synchronized on a daily basis.
All of the other misc datasets will be triggered by the users when
requested. However, these misc/smaller datasets are usually quick to
scan/generate on the fly. All of your static datasets should have the
"recheckEvery" value *excluded* from its catalog file. Therefore, once
the cache/agg file is created it will only be removed when the
NetcdfFileCache scour value elapses. This is a tricky balance to get
right. We are still trying to fine tune this on our servers.
Also, I assume that the scouring of NetcdfFileCache would not remove this
index file from cache/agg, correct? Otherwise users would be in for a long
wait each time they click on an aggregated service. According to
http://www.unidata.ucar.edu/software/thredds/current/tds/reference/ThreddsConfigXMLFile.html,
the cache/agg dir is only for joinExisting. I'm trying to use Union right
now.
Assume that anything in the cache/agg folder is game for
removal/scrub. "everything/anything" in cache/agg older than the scour
value will be deleted! We were testing out a btsync between our two
thredds servers and this tomcat scour was deleting dot-files/folders
unrelated to thredds. So we now do our sync one directory level higher
"cache" and exclude all directories but the "agg" folder.
If your dataset does not change, and you want it to be cached for a
while - avoiding the initial scan, then you need to set the
NetcdfFileCache scour value to multiple days. Make sure you have
plenty of disk space for the cache/agg folder, since all other
datasets will now be cached for much longer. However, all of our
catalogs in cache/agg typically occupy less than 25MB of space. The
real cache consumer is NCSS (a separate scour value/schedule)!
I don't think unions are stored in cache/agg. Best test is to look in
this folder for a file resembling the dataset name. Inspect the file
and note its size, timestamp, and contents. Nearly all of our
aggregations are nested joinExisting(like variables)+union(top). I see
all of the joinExisting cache files in this cache/agg folder, but zero
files with the "union" type.
Are you sure you should be performing a union on this dataset and not
using joinExisting (time series data) instead? What we do is many
small/manageable joinExisting scans of like data. Then we do a union
at the top level of these netcdf datasets. This way all of the
components get cached and then the top level union is simply a
combination of the cached data (see latest.xml attachment). This idea
was in one of the advanced thredds examples (or on the forum) and it
has helped significantly reduce our initial scan times.
/mike
--
Kevin Manross
NCAR/CISL/Data Support Section
Phone: (303)-497-1218
Email:manross@xxxxxxxx <mailto:manross@xxxxxxxx>
Web:http://rda.ucar.edu