[thredds] Aggregation - of joinexisting (on time) - extremely slow compared to featurecollection

To: thredds@xxxxxxxxxxxxxxxx
Subject: [thredds] Aggregation - of joinexisting (on time) - extremely slow compared to featurecollection
From: Aleksander Vines <aleksander.vines@xxxxxxxx>
Date: Fri, 27 Nov 2015 15:57:09 +0100

 Dear All,


We have a dataset with gridded data and multiple parameters of in-situ 
measurements (temperature, salinity, oxygen, etc). The issue with this dataset 
is that I there are very few timesteps, if any, where we have all the 
parameters together.


We want the main end product to be the entire dataset as a thredds dataset, 
mostly for the opendap and wms capabilities.


We're trying to decide on how to best store these data, for both efficient 
storage and retrieval of the data.


Would the most efficient, for reading in Thredds, be one huge netcdf file? That 
seems inefficient storage-wise, and it certainly is inefficient if we want to 
add new data (both "new" historical data, and new-new data).


Could we make it efficient(for thredds to retrieve tha data) when we split the 
netcdf up into smaller files, either on each parameter, or timesteps, or both, 
and then omit some  variables on the timesteps where there is no data for them? 
The optimal would be a union of joinExisting, where each file has only one 
variable and one timestep.


>From what we can see from the documentation/our testing, this seems hard. If 
>we were using a featurecollection, it could be solved with setting the 
>"correct" prototype, but the featurecollection has no "grid" featureType - 
>why? :( 

The FMRC featuretype seems to process the files much more efficient than an 
ncml aggregation. To use FMRC is not an option, however, as we can not modify 
basic things like name/summary/id/add other variables of the "best" dataset 
(and preferably remove the time_run variable). For me, it does sound easy 
enough to provide this functionality in TDS.


The first test files we've used here contain one timestep, with all variables 
(even where all values for some are nan - this seems superfluous, and should be 
possible to omit).


We have these two test datasets:



 <dataset name="Aggregation_ncml" ID="aggr_ncml" serviceName="all" 
dataType="Grid" urlPath="aggr_ncml">
 <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
 <aggregation dimName="time" type="joinExisting" recheckEvery="15 min" >
 <scan location="/vagrant/shared/thredds/Test-1/" suffix=".nc" subdirs="false"/>
 </aggregation>
 </netcdf>
 </dataset>


 <featureCollection name="test-fmrc" featureType="FMRC" harvest="true" 
path="fmrc/test">
 <collection spec="/vagrant/shared/thredds/Test-1/UL-5-#yyyy-MM-dd#.nc"
               recheckAfter="10 sec"
               olderThan="1 min"/>
 <update startup="true" rescan="0 5 3 * * ? *" />
 <protoDataset choice="Penultimate" change="0 2 3 * * ? *" />
 <fmrcConfig regularize="true" datasetTypes="TwoD Best Files Runs 
ConstantForecasts ConstantOffsets" />
 </featureCollection>


The ncml aggregation is SLOW(uses 1-5 minutes to produce one single wms layer 
in godiva)! While the fmrc collection is quite fast (e.g. "Best" uses under a 
minute to process a yearly resoluted wms-animation over 139 years). For 
processing individual wms-requests (outside of godiva), its working much faster 
(5-10 seconds for ncml), but the ncml-agg still takes 10-100 times longer than 
fmrc.


I assume this has to do with the way the two methods index the files/data. We 
also tried to use dateFormatMark="UL-5-#yyyy-MM-dd" on the ncml aggregation in 
hope it would improve the indexing, but the results was the same.


If anyone have any advice on how to optimize our datasets for Thredds, that 
would be fantastic.


Many thanks,
Aleksander Vines

Attachment: smime.p7s
Description: Electronic Signature S/MIME

2015 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: