Brief comment on the obvious: It is less important what the agreed definition
of a
"data set" (etc. for "collection", "catalog", "directory", etc.) is than that
there BE
an agreed definition. I suggest that someone should circulate an authoritative
DODS
glossary before the meeting. It could save hours of definitional confusion.
(Personally I like the simple definition "In a DODS server, a dataset is
something you
can get a DAS and DAP from." Maybe this should be the def'n of a "DODS data
set".)
John: Any thoughts you'd care to share prior to the meeting about the
potential for a
DODS web crawler ("harvester", "scanner", ... more glossary issues)
automatically to
produce a single giant thematic "DODS collection" in the THREDDS framework?
- steve
==========================================
Peter Cornillon wrote:
> John Caron wrote:
> >
> > Peter Cornillon wrote:
> >
> > >>Just to make sure i understand your terminology:
> > >>
> > >>files = physical files
> > >>
> > >
> > > YUP
> > >
> > >
> > >>datasets = logical files we want the user to see
> > >>
> > >
> > > I don't think about datasets in a file concept. It could be a group of
> > > files, a single file,... I guess that the reason that I don't think
> > > about it that way is that the data need not be in digital form to be
> > > grouped in a data set. Beach profiles that have been collected over
> > > the past 50 years and consist of pages of numbers - monthly values of
> > > depth below mean low water at specified distances from a marker in a
> > > given direction would qualify. I suppose that your definition is
> > > correct from a computer perspective, I just don't think of it that way.
> >
> > ok, i didnt really mean to use the word "file". how about:
> >
> > "a dataset is a logical grouping of data, associated in some meaningful way
> > from
> > the user's perspective."
>
> Yup.
>
> > In a DODS server, a dataset is something you can get a DAS and DAP from.
>
> Well not really. You can only get a DDS and DAS from a data set IF it is
> either a sinlge file or has a description in a file server or now in the
> Aggregation Server.
>
> > in THREDDS, a "collection" is a collection of datasets, for which the above
> > definition also works just fine. so whats the difference between a dataset
> > and a
> > collection?
>
> At URI we have a half dozen SST datasets derived from the AVHRR sensors:
> one for the area off of Cape Hatteras, another for the Great Lakes, ...
> Each has on the order of 15,000 passes in it. I assume that you would
> call the ensemble of these a collection?
>
> > this is the same issue that Benno has pointed out: in his DODS
> > server, there is no distinction between collections and datasets, because
> > the
> > server seamlessly moves between collections, physical files, and the fields
> > in
> > the files, presenting a uniform API of datasets with their DAP and DAS.
>
> But, you would be hard pressed to aggregate the things that I call datasets
> at URI (the Hatteras one with the Great Lakes one) with your Aggregation
> Server.
> As I noted in my previous e-mail the actual grouping of data into a dataset is
> arbitrary, so one could call the collection of datasets at URI a dataset or
> one
> could refer to each one as a dataset. One could call all data at a site a
> data set, or in the extreme, all earth science data accessible via DODS as
> a dataset.
>
> > (I am not going to try to answer the question of what's the difference
> > between a
> > catalog and a collection yet; hopefully others might have some ideas)
> >
> > in THREDDS, a dataset has a URI, and is the smallest choosable thing in the
> > catalog.
>
> I think that this is pretty much what we refer to as a directory, although
> we are still working on making a single URL for each dataset described in
> the various directories.
>
> > our goal as middleware is to present the list of dataset choices to the
> > user very quickly, without having to actually contact the server. once the
> > user
> > selects a dataset, then the user can expect some delay while a connection is
> > made to the server, and the "real" dataset metadata is collected. This
> > implies
> > that the catalog metadata may not be exactly right at all times (eg the
> > list of
> > available times of the dataset), which makes life easier for implementors.
> >
> > >
> > >
> > >>inventory = listing of datasets
> > >>
> > >
> > > No, a listing of datasets is what I refer to as a directory (not a
> > > directory on a computer). The GCMD is an example of same. An
> > > inventory is a listing of elements in a data set, it could be a
> > > list of times for satellite images in an archive along with the
> > > physical location of the data (tape C18341 on a rack, or
> > > N861230147.hat in a computer directory on my machine) or a list
> > > of times and locations of each XBT in an XBT archive.
> >
> > so is an inventory an internal thing that the server uses to construct the
> > datasets that are visible to the outside world?
>
> I don't think so. First, it need not be internal. For a long time
> we maintained inventories of the data sets at JPL. The inventory
> is simply a list of the contents of a dataset. A dataset can
> exist without an inventory, in that the dataset is a logical
> grouping of the data. The GCMD identifies a lot of datasets
> that to the best of my knowledge do not have inventories. Well,
> in a sense they do in that they might often comprise all of the
> files in a directory on a computer, so the directory listing is
> to some extent an inventory of the data in the dataset.
>
> > >>question:
> > >>what does it mean to "group files into data sets"? like the agg server?
> > >>
> > >
> > > One mightsay that all images in this projection, from this satellite,
> > > processed this way form a data. Or one could say that all images in
> > > this projection, from this suite of satellites processed this way
> > > form a data set. Or... This is the trouble with data sets, different
> > > people call different groupings of the data a data set. This caused
> > > a lot of blood letting between NASA and NOAA a number of years back.
> > > The idea is NOT to call every granule or every file in the system a
> > > data set, you know the difference between lumpers and splitters. In
> > > order for us to make progress, we have to back off a bit and look at
> > > the big picture, grouping things into data sets allows us to do that.
> > > This is exactly the problem that the DODS crawler has. When it crawls
> > > a site such as our satellite archive, it ends up with thousands of
> > > entries and the system or the person viewing the results struggles
> > > with a data overload, more information that s/he/it (humm... have
> > > to be careful with these gender neutral versions) wants or needs to
> > > locate the group of files that define the object of interest. Given
> > > that there is no precise definition for how to group files into a
> > > data set, I think that we can reduce the amount of information that
> > > we have to deal with to a reasonable view of the all the data on the
> > > system without losing much if anything. The crawler is likely to group
> > > the files slightly differently in some cases than the human would, but
> > > one could probably discover this pretty quickly and steer the crawler
> > > if necessary.
> >
> > ok, this seems to be similar to the "collections" vs "datasets" issue
> > above. I
> > think i need to hear Steve's tech presentation before I can understand this
> > any
> > deeper.
> >
> > >
> > >>Generating "inventories of granules in data sets" makes sense in the
> > >>context of
> > >>an agg server, but is there also meaning to it in the context of a normal
> > >>DODS
> > >>server?
> > >>
> > >
> > > Not sure exactly what you mean here. We have file servers which are
> > > inventories of granules in data sets. Actually the terminology is a
> > > bit loose here also. The server in this case is a DODS FreeForm server.
> > > It serves a table that contains a list of URLs with the characteristic(s)
> > > that differentiate one URI from another, time in the case of our satellite
> > > archives.
> >
> > i think some of the problem is that i think of DODS narrowly as a specific
> > client/server protocol, and you include services and extensions that have
> > been
> > built with or use that protocol.
>
> Yes! The DODS DAP is the thing that defines the low level data access
> protocol. To use it effectively one needs to add higher level constructs
> such as the file server.
>
> Peter
> --
> Peter Cornillon
> Graduate School of Oceanography - Telephone: (401) 874-6283
> University of Rhode Island - FAX: (401) 874-6728
> Narragansett RI 02882 USA - Internet: pcornillon@xxxxxxxxxxx