Hi Chiara,
On 05/07/18 08:54, Chiara Scaini wrote:
Hi Antonio and thanks for answering. I'm using the version 4.6.11.
Here's an example of 2 folders of the Filesystem containing data from
2 different models. I'm currently creating 2 different datasetScan so
data are located in different folders in my thredds catalog.
/filesystem/operative/data/wrfop/outputs/2018
/filesystem/operative/data//ariareg/farm/air_quality_forecast/2018
Each folder contains several daily folders with files that I can
filter by name, (ex. <include wildcard="*myfile_*"/>)
/filesystem/operative/data//ariareg/farm/air_quality_forecast/2018/20180620_00/myfile.nc
<http://myfile.nc>
But my aim is to harvest data from Thredds to Geonetwork only once per
file. Since Geonetwork can account for the 'harvest' attribute, I
would like to _set the harvest attribute to 'false' for all data but
the newly created_. Do you think that's possible with the current
functions?
I don't have experience with GeoNetwork, but looking it's documentation
what you want its to harvest thredds catalogs *from* GeoNetwork. From my
POV the "problem" is that GeoNetwork it's harvesting datasets (atomic or
collections) that were already harvested and you don' want to do it
again. In fact, in the GeoNetwork example for harvesting a remote
thredds catalog, they it is ignoring the "harvest" attribute of the
dataset. It may be it's beyoond of my knowledge but it would very easy
for GeoNetwork, to "cache" already harvested metadata based on the ID
attribute of the thredds datasets (collections and atomic ones). The
harvest attribute it's only applied to the root of the datasetScan
collection and it's not inherent because it's not a metadata element.
A workaround would be to create a temporary folder (and catalog) to be
used for harvesting. A crontab job is creating the new data in the
filesystem everyday and it cancreate the link too. The catalog would
contain symbolic links and the attribute "harvest=true". The links
would be deleted and replaced daily from crontab. Once imported to
Geonetwork, I would of course modify the thredds links to point to the
main catalog and not to a 404.
That could be a solution, like the latest dataset in:
http://motherlode.ucar.edu/thredds/catalog/satellite/3.9/WEST-CONUS_4km/catalog.html
It May be you could use the latest or proxy dataset feature described in
[1]. This allows to generate a proxy dataset wich points to latest
"added" dataset and provided the URL to the latest dataset to the
GeoNetwork harvester. I'm not sure if this solves your issue, but it
worth it the try.
This an example:
http://motherlode.ucar.edu/thredds/catalog/nws/metar/ncdecoded/files/catalog.html
The other option could be to "regenerate" the catalogs dynamically and
trigger a catalog reload to the TDS instance. This quite similar to your
option, but more dynamic although require more mchinery to complete.
Here's what i got so far with the 'harvest' attribute set at the
datasetScan level. I did what you suggested about the filter:
<filter>
<include wildcard="*wrfout_*"/>
</filter>
<filter>
<include collection="true"/>
</filter>
<filter>
<include atomic="true"/>
</filter>
The harvest attribute is not set for the inner dataset nodes, but only
for the dataset parent. Is that what I should expect?
Yes, the harvest it's only for the parent collection dataset.
<catalog version="1.0.1"><service name="all" serviceType="Compound"
base=""><service name="odap" serviceType="OPENDAP"
base="/thredds/dodsC/"/><service name="http" serviceType="HTTPServer"
base="/thredds/fileServer/"/><service name="wms" serviceType="WMS"
base="/thredds/wms/"/><service name="ncss" serviceType="NetcdfSubset"
base="/thredds/ncss/"/></service><dataset name="AUXILIARY"
harvest="true" ID="testAUXILIARY"><metadata
inherited="true"><serviceName>all</serviceName><dataType>GRID</dataType><documentation
type="summary">This is a summary for my test ARPA catalog for WRF
runs. Runs are made at 12Z and 00Z, with analysis an d forecasts every
6 hours out to 60 hours. Horizontal = 93 by 65 points, resolution
81.27 km, LambertConformal projection. Vertical = 1000 to 100 hPa
pressure levels.</documentation><keyword>WRF
outputs</keyword><geospatialCoverage><northsouth><start>25.0</start><size>35.0</size><units>degrees_north</units></northsouth><eastwest><start>-20.0</start><size>50.0</size><units>degrees_east</units></eastwest><updown><start>0.0</start><size>0.0</size><units>km</units></updown></geospatialCoverage><timeCoverage><end>present</end><duration>5
years</duration></timeCoverage><variables
vocabulary="GRIB-1"/><variables vocabulary=""><variable name="Z_sfc"
vocabulary_name="Geopotential H" units="gp m">Geopotential height,
gpm</variable></variables></metadata><dataset name="wrfout_d03_test7"
ID="testAUXILIARY/wrfout_d03_test7"
urlPath="AUXILIARY/wrfout_d03_test7"><dataSize
units="Mbytes">137.2</dataSize><date
type="modified">2018-06-28T10:19:28Z</date></dataset><dataset
name="wrfout_d03_test6" ID="testAUXILIARY/wrfout_d03_test6"
urlPath="AUXILIARY/wrfout_d03_test6"><dataSize
units="Mbytes">137.2</dataSize><date
type="modified">2018-06-28T10:19:28Z</date></dataset></dataset></catalog>
Thanks for your time,
Chiara
Hope this helps
Regards
Antonio
[1]
https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Adding_Proxy_Datasets
On 4 July 2018 at 19:46, Antonio S. Cofiño <cofinoa@xxxxxxxxx
<mailto:cofinoa@xxxxxxxxx>> wrote:
Hi Chiara,
I'm answering inline.
On 04/07/18 18:23, Chiara Scaini wrote:
Hi all, I'm setting up a geospatial data and metadata portal
based on thredds catalog and the Geonetwork engine and web
application. I am working on Linux CentOS and my applications are
deployed with Tomcat8.
Which TDS version are you using?
I am populating a thredds catalog based on a filesystem
containing meteorological data. Geonetwork then harvests the
catalog and populates the application. However, and given that
I'm updating the data on the web side, I would like to harvest
only once the data.
I tried to set the 'harvest' attribute from the catalog, but
without success. Here's an excerpt of my catalog.xml file:
The "harvest" it's been only defined as attribute for dataset (and
datasetScan) elements, but IMO it's no the purpose you are looking
for (see [1])
<datasetScan name="AUXILIARY" ID="testAUXILIARY"
path="AUXILIARY"
location="content/testdata/auxiliary-aux" harvest="true">
This harvest is correct.
<metadata inherited="true">
<serviceName>all</serviceName>
<dataType>Grid</dataType>
<dataFormatType>NetCDF</dataFormatType>
<DatasetType harvest="true"></DatasetType>
<harvest>true</harvest>
This hrvest it's not defined in the THREDDS Client Catalog
Specification (see [1])
<keyword>WRF outputs</keyword>
<documentation type="summary">This is a summary for my
test ARPA catalog for WRF runs. Runs are made at 12Z and 00Z,
with analysis an
d forecasts every 6 hours out to 60 hours. Horizontal =
93 by 65 points, resolution 81.27 km, LambertConformal
projection. Vertical = 1000 to
100 hPa pressure levels.</documentation>
<timeCoverage>
<end>present</end>
<duration>5 years</duration>
</timeCoverage>
<variables vocabulary="GRIB-1" />
<variables vocabulary="">
<variable name="Z_sfc" vocabulary_name="Geopotential H"
units="gp m">Geopotential height, gpm</variable>
</variables>
</metadata>
<filter>
<include wildcard="*wrfout_*"/>
</filter>
How files are distributed on disk? they are under directories? If
yes the you need to add a include filter with the collection
attribute="true" (see [2] and [3])
<addDatasetSize/>
<addTimeCoverage
datasetNameMatchPattern="([0-9]{2})_([0-9]{4})-([0-9]{2})-([0-9]{2})_([0-9]{2}):([0-9]{2}):([0-9]{2})"
startTimeSubstitutionPattern="$2-$3-$4T$5:00:00"
duration="6 hours" />
<namer>
<regExpOnName
regExp="([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})"
replaceString="WRF $1-$2-$3T$4:00:00" />
<regExpOnName
regExp="([0-9]{2})_([0-9]{4})-([0-9]{2})-([0-9]{2})_([0-9]{2}):([0-9]{2}):([0-9]{2})"
replaceString="WRF Domain-$1 $2-$3-$4T$5:00:00" />
</namer>
</datasetScan>
Even if I set the harvest="true" attribute, it is not inherited
by the datasets and thus the harvester does not get the files. I
can also ignore the 'harvest' attribute while harvesting, but my
aim is to harvest only new files using an auxiliary catalog that
contains symbolic links (and updating the Thredds path after
harvesting).
Am I missing something? How would you sistematically add the
harvest attribute to all inner datasets in a nested filesystem?
Or, would it make sense to create two catalogs using the time
filter options (ex. all up to yesterday in one catalog, and
today's files in another)? Can you show me an example of usage of
those filters in a datasetScan?
Many thanks,
Chiara
How this helps
Regards
Antonio
[1]
https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogSpec.html#dataset
<https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogSpec.html#dataset>
[2]
https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogServerSpec.html#datasetScan_Element#filter_Element
<https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogServerSpec.html#datasetScan_Element%23filter_Element>
[3]
https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Including_Only_the_Desired_Files
<https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Including_Only_the_Desired_Files>
--
Antonio S. Cofiño
Dep. de Matemática Aplicada y
Ciencias de la Computación
Universidad de Cantabria
http://www.meteo.unican.es
--
Chiara Scaini
_______________________________________________
NOTE: All exchanges posted to Unidata maintained email lists are
recorded in the Unidata inquiry tracking system and made publicly
available through the web. Users who post to any of the lists we
maintain are reminded to remove any personal information that they
do not want to be made public.
thredds mailing list
thredds@xxxxxxxxxxxxxxxx <mailto:thredds@xxxxxxxxxxxxxxxx>
For list information or to unsubscribe,
visit:http://www.unidata.ucar.edu/mailing_lists/
<http://www.unidata.ucar.edu/mailing_lists/>
--
Chiara Scaini