Re: [thredds] Set harvest attribute using datasetScan

Hi Chiara,
On 05/07/18 08:54, Chiara Scaini wrote:
Hi Antonio and thanks for answering. I'm using the version 4.6.11.

Here's an example of 2 folders of the Filesystem containing data from 2 different models. I'm currently creating 2 different datasetScan so data are located in different folders in my thredds catalog.

/filesystem/operative/data/wrfop/outputs/2018
/filesystem/operative/data//ariareg/farm/air_quality_forecast/2018

Each folder contains several daily folders with files that I can filter by name, (ex. <include wildcard="*myfile_*"/>) /filesystem/operative/data//ariareg/farm/air_quality_forecast/2018/20180620_00/myfile.nc <http://myfile.nc>

But my aim is to harvest data from Thredds to Geonetwork only once per file. Since Geonetwork can account for the 'harvest' attribute, I would like to _set the harvest attribute to 'false' for all data but the newly created_. Do you think that's possible with the current functions?

I don't have experience with GeoNetwork, but looking it's documentation what you want its to harvest thredds catalogs *from* GeoNetwork. From my POV the "problem" is that GeoNetwork it's harvesting datasets (atomic or collections) that were already harvested and you don' want to do it again. In fact, in the GeoNetwork example for harvesting a remote thredds catalog, they it is ignoring the "harvest" attribute of the dataset. It may be it's beyoond of my knowledge but it would very easy for GeoNetwork, to "cache" already harvested metadata based on the ID attribute of the thredds datasets (collections and atomic ones). The harvest attribute it's only applied to the root of the datasetScan collection and it's not inherent because it's not a metadata element.


A workaround would be to create a temporary folder (and catalog) to be used for harvesting. A crontab job is creating the new data in the filesystem everyday and it cancreate the link too. The catalog would contain symbolic links and the attribute "harvest=true". The links would be deleted and replaced daily from crontab. Once imported to Geonetwork, I would of course modify the thredds links to point to the main catalog and not to a 404.

That could be a solution, like the latest dataset in:
http://motherlode.ucar.edu/thredds/catalog/satellite/3.9/WEST-CONUS_4km/catalog.html

It May be you could use the latest or proxy dataset feature described in [1]. This allows to generate a proxy dataset wich points to latest "added" dataset and provided the URL to the latest dataset to the GeoNetwork harvester. I'm not sure if this solves your issue, but it worth it the try.
This an example:
http://motherlode.ucar.edu/thredds/catalog/nws/metar/ncdecoded/files/catalog.html

The other option could be to "regenerate" the catalogs dynamically and trigger a catalog reload to the TDS instance. This quite similar to your option, but more dynamic although require more mchinery to complete.

Here's what i got so far with the 'harvest' attribute set at the datasetScan level. I did what you suggested about the filter:
  <filter>
      <include wildcard="*wrfout_*"/>
    </filter>
    <filter>
      <include collection="true"/>
    </filter>
    <filter>
      <include atomic="true"/>
    </filter>

The harvest attribute is not set for the inner dataset nodes, but only for the dataset parent. Is that what I should expect?
Yes, the harvest it's only for the parent collection dataset.

<catalog version="1.0.1"><service name="all" serviceType="Compound" base=""><service name="odap" serviceType="OPENDAP" base="/thredds/dodsC/"/><service name="http" serviceType="HTTPServer" base="/thredds/fileServer/"/><service name="wms" serviceType="WMS" base="/thredds/wms/"/><service name="ncss" serviceType="NetcdfSubset" base="/thredds/ncss/"/></service><dataset name="AUXILIARY" harvest="true" ID="testAUXILIARY"><metadata inherited="true"><serviceName>all</serviceName><dataType>GRID</dataType><documentation type="summary">This is a summary for my test ARPA catalog for WRF runs. Runs are made at 12Z and 00Z, with analysis an d forecasts every 6 hours out to 60 hours. Horizontal = 93 by 65 points, resolution 81.27 km, LambertConformal projection. Vertical = 1000 to 100 hPa pressure levels.</documentation><keyword>WRF outputs</keyword><geospatialCoverage><northsouth><start>25.0</start><size>35.0</size><units>degrees_north</units></northsouth><eastwest><start>-20.0</start><size>50.0</size><units>degrees_east</units></eastwest><updown><start>0.0</start><size>0.0</size><units>km</units></updown></geospatialCoverage><timeCoverage><end>present</end><duration>5 years</duration></timeCoverage><variables vocabulary="GRIB-1"/><variables vocabulary=""><variable name="Z_sfc" vocabulary_name="Geopotential H" units="gp m">Geopotential height, gpm</variable></variables></metadata><dataset name="wrfout_d03_test7" ID="testAUXILIARY/wrfout_d03_test7" urlPath="AUXILIARY/wrfout_d03_test7"><dataSize units="Mbytes">137.2</dataSize><date type="modified">2018-06-28T10:19:28Z</date></dataset><dataset name="wrfout_d03_test6" ID="testAUXILIARY/wrfout_d03_test6" urlPath="AUXILIARY/wrfout_d03_test6"><dataSize units="Mbytes">137.2</dataSize><date type="modified">2018-06-28T10:19:28Z</date></dataset></dataset></catalog>

Thanks for your time,
Chiara


Hope this helps

Regards

Antonio

[1] https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Adding_Proxy_Datasets


On 4 July 2018 at 19:46, Antonio S. Cofiño <cofinoa@xxxxxxxxx <mailto:cofinoa@xxxxxxxxx>> wrote:

    Hi Chiara,

    I'm answering inline.



    On 04/07/18 18:23, Chiara Scaini wrote:
    Hi all, I'm setting up a geospatial data and metadata portal
    based on thredds catalog and the Geonetwork engine and web
    application. I am working on Linux CentOS and my applications are
    deployed with Tomcat8.

    Which TDS version are you using?
    I am populating a thredds catalog based on a filesystem
    containing meteorological data. Geonetwork then harvests the
    catalog and populates the application. However, and given that
    I'm updating the data on the web side, I would like to harvest
    only once the data.

    I tried to set the 'harvest' attribute from the catalog, but
    without success. Here's an excerpt of my catalog.xml file:
    The "harvest" it's been only defined as attribute for dataset (and
    datasetScan) elements, but IMO it's no the purpose you are looking
    for (see [1])

      <datasetScan name="AUXILIARY" ID="testAUXILIARY"
                   path="AUXILIARY"
    location="content/testdata/auxiliary-aux" harvest="true">
    This harvest is correct.
        <metadata inherited="true">
          <serviceName>all</serviceName>
          <dataType>Grid</dataType>
          <dataFormatType>NetCDF</dataFormatType>
            <DatasetType harvest="true"></DatasetType>
            <harvest>true</harvest>
    This hrvest it's not defined in the THREDDS Client Catalog
    Specification (see [1])
          <keyword>WRF outputs</keyword>
            <documentation type="summary">This is a summary for my
    test ARPA catalog for WRF runs. Runs are made at 12Z and 00Z,
    with analysis an
            d forecasts every 6 hours out to 60 hours. Horizontal =
    93 by 65 points, resolution 81.27 km, LambertConformal
    projection. Vertical = 1000 to
             100 hPa pressure levels.</documentation>
           <timeCoverage>
             <end>present</end>
             <duration>5 years</duration>
           </timeCoverage>
           <variables vocabulary="GRIB-1" />
           <variables vocabulary="">
             <variable name="Z_sfc" vocabulary_name="Geopotential H"
    units="gp m">Geopotential height, gpm</variable>
           </variables>
        </metadata>

        <filter>
          <include wildcard="*wrfout_*"/>
        </filter>

    How files are distributed on disk? they are under directories? If
    yes the you need to add a include filter with the collection
    attribute="true" (see [2] and [3])



        <addDatasetSize/>
        <addTimeCoverage
    
datasetNameMatchPattern="([0-9]{2})_([0-9]{4})-([0-9]{2})-([0-9]{2})_([0-9]{2}):([0-9]{2}):([0-9]{2})"
               startTimeSubstitutionPattern="$2-$3-$4T$5:00:00"
                      duration="6 hours" />

        <namer>
        <regExpOnName
    regExp="([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})"
    replaceString="WRF $1-$2-$3T$4:00:00" />
        <regExpOnName
    
regExp="([0-9]{2})_([0-9]{4})-([0-9]{2})-([0-9]{2})_([0-9]{2}):([0-9]{2}):([0-9]{2})"
    replaceString="WRF Domain-$1 $2-$3-$4T$5:00:00" />
        </namer>

      </datasetScan>


    Even if I set the harvest="true" attribute, it is not inherited
    by the datasets and thus the harvester does not get the files. I
    can also ignore the 'harvest' attribute while harvesting, but my
    aim is to harvest only new files using an auxiliary catalog that
    contains symbolic links (and updating the Thredds path after
    harvesting).

    Am I missing something? How would you sistematically add the
    harvest attribute to all inner datasets in a nested filesystem?
    Or, would it make sense to create two catalogs using the time
    filter options (ex. all up to yesterday in one catalog, and
    today's files in another)? Can you show me an example of usage of
    those filters in a datasetScan?

    Many thanks,
    Chiara


    How this helps
    Regards

    Antonio


    [1]
    
https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogSpec.html#dataset
    
<https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogSpec.html#dataset>
    [2]
    
https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogServerSpec.html#datasetScan_Element#filter_Element
    
<https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogServerSpec.html#datasetScan_Element%23filter_Element>
    [3]
    
https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Including_Only_the_Desired_Files
    
<https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Including_Only_the_Desired_Files>

    --
    Antonio S. Cofiño
    Dep. de Matemática Aplicada y
             Ciencias de la Computación
    Universidad de Cantabria
    http://www.meteo.unican.es



-- Chiara Scaini


    _______________________________________________
    NOTE: All exchanges posted to Unidata maintained email lists are
    recorded in the Unidata inquiry tracking system and made publicly
    available through the web.  Users who post to any of the lists we
    maintain are reminded to remove any personal information that they
    do not want to be made public.


    thredds mailing list
    thredds@xxxxxxxxxxxxxxxx <mailto:thredds@xxxxxxxxxxxxxxxx>
    For list information or to unsubscribe,  
visit:http://www.unidata.ucar.edu/mailing_lists/
<http://www.unidata.ucar.edu/mailing_lists/>




--
Chiara Scaini

  • 2018 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: