Dear Antonio and Michael,
Thank you for your comments and tips. It is nice to see that we are not the
only ones facing these challenges.
I will try some of these recommendations and see if it works.
Kind regards,
Jordi
On Fri, 30 Jun 2023 at 23:39, Michael McDonald <mcdonald@xxxxxxxxxxxxx>
wrote:
> Jordi,
> We use Nagios (the free "Core" release that comes with most Linux
> distros) to continuously check/trigger the "1st access" penalty of our
> HYCOM.org datasets that change or have expired caches for our THREDDS
> servers (https://tds.hycom.org/thredds). We check/touch each dataset's
> OPENDAP access/form pages, and the NCSS forms, i.e., the point at
> which the dataset has been successfully scanned and is ready for
> access. To account for some datasets that take longer than others to
> index/scan we allow for up to 1200 seconds to complete. The advantage
> of Nagios is that it will keep trying if there is a
> critical/warning/timeout. When the dataset is finally indexed and
> available for use the Nagios response will be "Green across the board"
> and will take milliseconds to respond. Nagios is also useful for
> checking a lot of other server services (tomcat, memory, CPU, etc).
>
> e.g.,
> nagios custom command name: long_check_http
> $USER1$/check_http -H $HOSTNAME$ -w 300 -c 600 -t 1200 $ARG1$
>
> service name:
> long_check_http!-u /thredds/dodsC/GLBy0.08/expt_93.0/ssh.html
>
> Nagios requesting this "OPeNDAP Dataset Access Form" page will trigger
> (and wait) for this to complete and then return a status of OK. If
> there are datasets that return CRITICAL or WARNING after a few minutes
> then that is a sign of some other system issue, e.g., disk/NFS I/O,
> misconfiguration, etc.
>
> $ /usr/lib64/nagios/plugins/check_http -H tds.hycom.org -w 300 -c 600
> -t 1200 -u /thredds/dodsC/GLBy0.08/expt_93.0/ssh.html
> HTTP OK: HTTP/1.1 200 200 - 19245 bytes in 0.053 second response time
> |time=0.052748s;300.000000;600.000000;0.000000 size=19245B;;;0
>
>
> https://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0/ssh.html
>
>
> Regarding your large dataset issue, I'd advise you give the FMRC
> feature of THREDDS a try but only use this on the "incoming"
> (new/changing) parts of the dataset. We do this for the incoming
> forecast data (a separate folder) and then flatten/keep only the parts
> needed to add/extend to our existing time series, e.g., we have daily
> forecast runs that go from hour t000 to t180, which the FMRC can
> easily handle, combine, and merge, but we only copy (keep/save) the
> t000~t023 files so that we do not have time index duplicates/overlap
> in the main/growing dataset aggregations.
>
> The trick is to have THREDDS only scan/index the parts that are
> changing. The parts that do not change should be scanned once and the
> cache file preserved until it expires (you specify this in the
> config). You could do this a number of ways. We do this typically "by
> year" where the datasetScan touching the current/active 2023 data and
> the joinExisting aggregation has a "recheckEvery="60 min", which
> THREDDS keys on to determine if X time has passed since the last index
> to determine if a re-index on a Tomcat restart is needed (note: a
> quick tomcat restart/bounce is the only way we can "reliably" trigger
> a dataset update when new data arrives). The other years older than
> 2023 are not getting updated, so they do not have the "recheckEvery"
> set for their joinExisting aggregations.
>
> e.g., side-note: do not use "backslashes" or other special chars in
> your dataset "ID" values, as this will produce weird cache issues
> (learned this the hard way). The dataset ID is global (no duplicates)
> but this is the key (the fiename) used for creating/keeping the agg
> cache files under /var/lib/tomcat/content/thredds/cache/agg. The
> "urlPath" is the part used to access from the web UI and for use when
> "combining" multiple aggregations into a top "dataset" aggregation.
> e.g., each "/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/YYYY"
> is an independently scanned/cached dataset (an OPENDAP object) that
> can be reused again in another joinExisting or unions. NOTE that union
> operations do not produce an agg cache file.
>
> a hybrid example (see below) of multiple aggregations for one of our
> datasets. We also have our catalogs coded in puppet templates for easy
> updates/deployments across our multiple load-balanced tomcat+thredds
> servers.
>
> <dataset name="* ALL DATA/YEARS *"
> ID="GOMu0.04-expt_90.1m000" urlPath="GOMu0.04/expt_90.1m000">
> <serviceName>all</serviceName>
> <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
> <aggregation dimName="time" type="joinExisting">
> <netcdf location="dods://
> tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2019"/>
> <netcdf location="dods://
> tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2020"/>
> <netcdf location="dods://
> tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2021"/>
> <netcdf location="dods://
> tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2022"/>
> <netcdf location="dods://
> tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2023"/>
> </aggregation>
> </netcdf>
> </dataset>
>
> <dataset name="(2023) Hindcast Data (1-hrly)"
> ID="GOMu0.04-expt_90.1m000-2023"
> urlPath="GOMu0.04/expt_90.1m000/data/hindcasts/2023">
> <serviceName>all</serviceName>
> <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
> <aggregation dimName="time" type="joinExisting" recheckEvery="60 min">
> <scan
> location="/hycom/ftp/datasets/GOMu0.04/expt_90.1m000/data/hindcasts/2023/"
> suffix="*.nc" subdirs="false" />
> </aggregation>
> </netcdf>
> </dataset>
>
> <dataset name="(2022) Hindcast Data (1-hrly)"
> ID="GOMu0.04-expt_90.1m000-2022"
> urlPath="GOMu0.04/expt_90.1m000/data/hindcasts/2022">
> <serviceName>all</serviceName>
> <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
> <aggregation dimName="time" type="joinExisting">
> <scan
> location="/hycom/ftp/datasets/GOMu0.04/expt_90.1m000/data/hindcasts/2022/"
> suffix="*.nc" subdirs="false" />
> </aggregation>
> </netcdf>
> </dataset>
>
> ...
>
>
>
> We break down our datasets into chunks "by year" and sometimes within
> this "by variable". e.g., we have surface data for each time value in
> one file (2d) and variables with a depth component in another file
> (3z) for datasets. We union these. Running a datasetScan for all the
> *2d.nc files in a specific year and another datasetScan for all *3z.nc
> files in a dir will each create an independent CACHE record for each.
> If you tell THREDDS to not recheck this dataset, then it should honor
> this depending on your AggregationCache settings in threddsConfig.xml
> and if a joinExisting aggregation has its "recheckEvery=___" defined.
>
> We also enforce a longer cache retention period via this override in
> our "threddsConfig.xml". This might not be applicable in your case as
> you need to monitor and pay closer attention to these cache files.
>
> <AggregationCache>
> <dir>/var/lib/tomcat/content/thredds/cache/agg/</dir>
> <scour>-1 sec</scour>
> <maxAge>999 days</maxAge>
> <cachePathPolicy>oneDirectory</cachePathPolicy>
> </AggregationCache>
>
>
> After the July holiday break (post july 17th) I'd be happy to Zoom
> with you one on one to help out further if needed.
>
>
> On Mon, Jun 19, 2023 at 10:52 AM Jordi Domingo Ballesta
> <jordi.domingo@lobelia.earth> wrote:
> >
> > Dear TDS team,
> >
> > I would like to know if it is possible to (pre-)create the aggregation
> cache and make thredds load it, in order to speed up the first time a
> dataset is requested.
> >
> > To give a bit of context, our situation is the following:
> > - We have a big archive of 265TB of data and 5 million files,
> distributed in 1000 datasets (aprox).
> > - These datasets are in NetCDF format (mostly v4, some v3).
> > - We run TDS version 5.4.
> > - We configured thredds to provide access to them via "http" and "odap"
> services, both directly (with "datasetScan") and as aggregated datasets.
> > - The configuration needs to be updated regularly (at least every day)
> as new files come while others are deleted.
> > - We have serious performance issues regarding the access of aggregated
> datasets, especially the first time they are accessed.
> >
> > In order to improve that, we tried configuring the catalogs with the
> explicit list of files for each dataset, including the "ncoords" field, or
> even the "coordValue" field with the time value of each file (they are
> joinExisting aggregations based on time dimension). That improved
> substantially the performance of the first access, but the duration is
> still not "acceptable" by the users.
> >
> > I tried to pre-create the cache files in thredds/cache/aggNew/ directory
> with the same content as when they are created by thredds, but it seems
> that thredds is ignoring them when loading, and just recreating its own
> version again. I also noticed that the cache database in
> thredds/cache/catalog/ directory plays a role as well, but I do not
> understand the relation between that and the aggregation cache files.
> >
> > Anyway, do you recommend any practice in order to improve the
> performance of thredds for the first time a dataset is accessed? Maybe
> throwing a 1-time request for the time variables to each dataset in order
> to force thredds to create and load the cache?
> >
> > Your help is very appreciated. Many thanks!
> >
> > Kind regards,
> >
> > Jordi Domingo
> > Senior software engineer
> > Lobelia Earth, S.L.
> > _______________________________________________
> > NOTE: All exchanges posted to Unidata maintained email lists are
> > recorded in the Unidata inquiry tracking system and made publicly
> > available through the web. Users who post to any of the lists we
> > maintain are reminded to remove any personal information that they
> > do not want to be made public.
> >
> >
> > thredds mailing list
> > thredds@xxxxxxxxxxxxxxxx
> > For list information or to unsubscribe, visit:
> https://www.unidata.ucar.edu/mailing_lists/
>
>
>
> --
> Michael McDonald
> Florida State University
>