[First Draft: 9/15/2016]
[Last updated: 9/16/2016]
The Thredds Data server (TDS) was designed to operate in a client-server architecture. Recently, Unidata has moved TDS into the cloud using its existing architecture.
There seems to be agreement inside Unidata that we need to begin rethinking that architecture to adapt to the realities of the cloud.
Proposal 1
This (first) proposal makes an assumptions about the nature of the cloud, and especially as it is likely to be in the near future.
This Assumption is that rather than having large quantities of data behind a (TDS) server, all data will be stored in cloud storage such as Amazon S3 or Azure blobs.
Secondarily, in such an environment, TDS cannot be aware of all data because it the set of all data is likely to be growing at a fast rate and by organizations not known to a given TDS server.
In this environment, the role of TDS becomes more of a locator and transformer of data. That is, TDS is must be made aware of some datasets and then it must apply various computations on that data to produce new derived data and then publish it into cloud storage.
Some consequences:
- Unidata may have to get into the data discovery business; somthing it has tended to avoid so far.
- The new TDS must be organized so that others can extend its capabilities by providing new kinds of computation models.
- It is not clear if protocols such as DAP2, DAP4, CDMremote, etc. will be needed any longer because clients will be able to access the computed products using the S3 or Blob interfaces. In effect, streaming becomes replaced with the reification of computations into a file in S3/Blob.
- Asynchronous computations more or less fall out of this proposed architecture if it possible for a client to poll S3/Blob for some dataset or for getting an event notification from the cloud.
- Standardized file formats now become important than ever. The primary such formats for atmospherics is, I believe, netcdf3 and netcdf4. The HDF5 format is likely to also become more important, although its complexity vis-a-vis netcdf-4 will IMO hold it back.
Some questions:
- Is there room for another (or several) standard file formats?
- Is it possible to define a wrapper API for S3 and Azure blobs and whatever google and other cloud companies provde? This API would help clients having to lock in on a single provider?
- What is the relation between this proposal and, say, Amazon lambda, or microservices?
[9/16/2016]
Notes on Services to be Provided
Catalog
Our current catalog system assumes that there is some set of dataset over which we have control and knowledge. As a rule, that set is the set of datasets on the Thredds server machine.
Under this proposal, this becomes less true. There may be no such set. Let us propose instead that we provide an umbrella catalog for which others can ask to have their datasets included. Additionally, others might ask to have their catalogs grafted onto our catalog tree. In any case, we are effectively talking about a federated catalog.
The value added is that we become the place to go to locate datasets. A consequence is that it becomes incumbent on us to:
- Make searching our catalogs easy and support sophisticated searches.
- Provide our catalog in a variety of formats, such as in the form of a set of relational tables.
- Provide the ability to crack datasets to obtain additional information for our catalogs.
CDM
We also need to think about the role of CDM in this proposal. currently, CDM is our UNCOL (historical reference) in that CDM is the common model that allows us to separate the dataset format from the users of that dataset. That is, an IOSP maps some data format to CDM and then tools can be defined in terms of CDM to avoid having to know about all the actual data formats. This is a very powerful approach and we should not discard it.
Subset Services
Data subsetting services, in the form of NCSS and the dap(2,4) constraint languages is an additional service we provide that will continue to be important in any new architecture. In fact, I think that pulling this out as a set of services would be enhanced with this architecture. [Needs more thought].
[More thoughts will be added as they occur to me]