[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: THREDDS/DLESE Connections slides



Steve Hankin wrote:
> 
> Brief comment on the obvious:  It is less important what the agreed 
> definition of a
> "data set" (etc. for "collection", "catalog", "directory", etc.) is than that 
> there BE
> an agreed definition. I suggest that someone should circulate an 
> authoritative DODS
> glossary before the meeting.  

Sounds like a good idea to me. I've added Paul to the e-mail list so that he
can get this started. I assume that we should not address subsequent messages
re a DODS glossary to the thredds mailing list. Anyone on thredds who thinks
differently, please let me know. If we haven't heard from anyone by noon on 
the 19th we'll go ahead on DODS tech only.


> It could save hours of definitional confusion.
> (Personally I like the simple definition "In a DODS server, a dataset is 
> something you
> can get a DAS and DAP from."  Maybe this should be the def'n of a "DODS data 
> set".)

Steve, are you including a file server and/or an Aggregation server here? 
Even if you are, from the perspective of the DODS/NVODS dataset list, this 
definition is quite restrictive. 

> John:  Any thoughts you'd care to share prior to the meeting about the 
> potential for a
> DODS web crawler ("harvester", "scanner", ... more glossary issues) 
> automatically to
> produce a single giant thematic "DODS collection" in the THREDDS framework?
> 
>     - steve
> 
> ===========================================
> 
> Peter Cornillon wrote:
> 
> > John Caron wrote:
> > >
> > > Peter Cornillon wrote:
> > >
> > > >>Just to make sure i understand your terminology:
> > > >>
> > > >>files = physical files
> > > >>
> > > >
> > > > YUP
> > > >
> > > >
> > > >>datasets = logical files we want the user to see
> > > >>
> > > >
> > > > I don't think about datasets in a file concept. It could be a group of
> > > > files, a single file,... I guess that the reason that I don't think
> > > > about it that way is that the data need not be in digital form to be
> > > > grouped in a data set. Beach profiles that have been collected over
> > > > the past 50 years and consist of pages of numbers - monthly values of
> > > > depth below mean low water at specified distances from a marker in a
> > > > given direction would qualify. I suppose that your definition is
> > > > correct from a computer perspective, I just don't think of it that way.
> > >
> > > ok, i didnt really mean to use the word "file". how about:
> > >
> > > "a dataset is a logical grouping of data, associated in some meaningful 
> > > way from
> > > the user's perspective."
> >
> > Yup.
> >
> > > In a DODS server, a dataset is something you can get a DAS and DAP from.
> >
> > Well not really. You can only get a DDS and DAS from a data set IF it is
> > either a sinlge file or has a description in a file server or now in the
> > Aggregation Server.
> >
> > > in THREDDS, a "collection" is a collection of datasets, for which the 
> > > above
> > > definition also works just fine. so whats the difference between a 
> > > dataset and a
> > > collection?
> >
> > At URI we have a half dozen SST datasets derived from the AVHRR sensors:
> > one for the area off of Cape Hatteras, another for the Great Lakes, ...
> > Each has on the order of 15,000 passes in it. I assume that you would
> > call the ensemble of these a collection?
> >
> > > this is the same issue that Benno has pointed out: in his DODS
> > > server, there is no distinction between collections and datasets, because 
> > > the
> > > server seamlessly moves between collections, physical files, and the 
> > > fields in
> > > the files, presenting a uniform API of datasets with their DAP and DAS.
> >
> > But, you would be hard pressed to aggregate the things that I call datasets
> > at URI (the Hatteras one with the Great Lakes one) with your Aggregation 
> > Server.
> > As I noted in my previous e-mail the actual grouping of data into a dataset 
> > is
> > arbitrary, so one could call the collection of datasets at URI a dataset or 
> > one
> > could refer to each one as a dataset. One could call all data at a site a
> > data set, or in the extreme, all earth science data accessible via DODS as
> > a dataset.
> >
> > > (I am not going to try to answer the question of what's the difference 
> > > between a
> > > catalog and a collection yet; hopefully others might have some ideas)
> > >
> > > in THREDDS, a dataset has a URI, and is the smallest choosable thing in 
> > > the
> > > catalog.
> >
> > I think that this is pretty much what we refer to as a directory, although
> > we are still working on making a single URL for each dataset described in
> > the various directories.
> >
> > > our goal as middleware is to present the list of dataset choices to the
> > > user very quickly, without having to actually contact the server. once 
> > > the user
> > > selects a dataset, then the user can expect some delay while a connection 
> > > is
> > > made to the server, and the "real" dataset metadata is collected. This 
> > > implies
> > > that the catalog metadata may not be exactly right at all times (eg the 
> > > list of
> > > available times of the dataset), which makes life easier for implementors.
> > >
> > > >
> > > >
> > > >>inventory = listing of datasets
> > > >>
> > > >
> > > > No, a listing of datasets is what I refer to as a directory (not a
> > > > directory on a computer). The GCMD is an example of same. An
> > > > inventory is a listing of elements in a data set, it could be a
> > > > list of times for satellite images in an archive along with the
> > > > physical location of the data (tape C18341 on a rack, or
> > > > N861230147.hat in a computer directory on my machine) or a list
> > > > of times and locations of each XBT in an XBT archive.
> > >
> > > so is an inventory an internal thing that the server uses to construct the
> > > datasets that are visible to the outside world?
> >
> > I don't think so. First, it need not be internal. For a long time
> > we maintained inventories of the data sets at JPL. The inventory
> > is simply a list of the contents of a dataset. A dataset can
> > exist without an inventory, in that the dataset is a logical
> > grouping of the data. The GCMD identifies a lot of datasets
> > that to the best of my knowledge do not have inventories. Well,
> > in a sense they do in that they might often comprise all of the
> > files in a directory on a computer, so the directory listing is
> > to some extent an inventory of the data in the dataset.
> >
> > > >>question:
> > > >>what does it mean to "group files into data sets"? like the agg server?
> > > >>
> > > >
> > > > One mightsay that all images in this projection, from this satellite,
> > > > processed this way form a data. Or one could say that all images in
> > > > this projection, from this suite of satellites processed this way
> > > > form a data set. Or... This is the trouble with data sets, different
> > > > people call different groupings of the data a data set. This caused
> > > > a lot of blood letting between NASA and NOAA a number of years back.
> > > > The idea is NOT to call every granule or every file in the system a
> > > > data set, you know the difference between lumpers and splitters. In
> > > > order for us to make progress, we have to back off a bit and look at
> > > > the big picture, grouping things into data sets allows us to do that.
> > > > This is exactly the problem that the DODS crawler has. When it crawls
> > > > a site such as our satellite archive, it ends up with thousands of
> > > > entries and the system or the person viewing the results struggles
> > > > with a data overload, more information that s/he/it (humm... have
> > > > to be careful with these gender neutral versions) wants or needs to
> > > > locate the group of files that define the object of interest. Given
> > > > that there is no precise definition for how to group files into a
> > > > data set, I think that we can reduce the amount of information that
> > > > we have to deal with to a reasonable view of the all the data on the
> > > > system without losing much if anything. The crawler is likely to group
> > > > the files slightly differently in some cases than the human would, but
> > > > one could probably discover this pretty quickly and steer the crawler
> > > > if necessary.
> > >
> > > ok, this seems to be similar to the "collections" vs "datasets" issue 
> > > above. I
> > > think i need to hear Steve's tech presentation before I can understand 
> > > this any
> > > deeper.
> > >
> > > >
> > > >>Generating "inventories of granules in data sets" makes sense in the 
> > > >>context of
> > > >>an agg server, but is there also meaning to it in the context of a 
> > > >>normal DODS
> > > >>server?
> > > >>
> > > >
> > > > Not sure exactly what you mean here. We have file servers which are
> > > > inventories of granules in data sets. Actually the terminology is a
> > > > bit loose here also. The server in this case is a DODS FreeForm server.
> > > > It serves a table that contains a list of URLs with the 
> > > > characteristic(s)
> > > > that differentiate one URI from another, time in the case of our 
> > > > satellite
> > > > archives.
> > >
> > > i think some of the problem is that i think of DODS narrowly as a specific
> > > client/server protocol, and you include services and extensions that have 
> > > been
> > > built with or use that protocol.
> >
> > Yes! The DODS DAP is the thing that defines the low level data access
> > protocol. To use it effectively one needs to add higher level constructs
> > such as the file server.
> >
> > Peter
> > --
> >  Peter Cornillon
> >   Graduate School of Oceanography  - Telephone: (401) 874-6283
> >    University of Rhode Island      -       FAX: (401) 874-6728
> >     Narragansett RI 02882  USA     -  Internet: address@hidden

-- 
 Peter Cornillon                                                       
  Graduate School of Oceanography  - Telephone: (401) 874-6283         
   University of Rhode Island      -       FAX: (401) 874-6728         
    Narragansett RI 02882  USA     -  Internet: address@hidden