John Caron wrote:
>
> Peter Cornillon wrote:
>
> >>Just to make sure i understand your terminology:
> >>
> >>files = physical files
> >>
> >
> > YUP
> >
> >
> >>datasets = logical files we want the user to see
> >>
> >
> > I don't think about datasets in a file concept. It could be a group of
> > files, a single file,... I guess that the reason that I don't think
> > about it that way is that the data need not be in digital form to be
> > grouped in a data set. Beach profiles that have been collected over
> > the past 50 years and consist of pages of numbers - monthly values of
> > depth below mean low water at specified distances from a marker in a
> > given direction would qualify. I suppose that your definition is
> > correct from a computer perspective, I just don't think of it that way.
>
> ok, i didnt really mean to use the word "file". how about:
>
> "a dataset is a logical grouping of data, associated in some meaningful way
> from
> the user's perspective."
Yup.
> In a DODS server, a dataset is something you can get a DAS and DAP from.
Well not really. You can only get a DDS and DAS from a data set IF it is
either a sinlge file or has a description in a file server or now in the
Aggregation Server.
> in THREDDS, a "collection" is a collection of datasets, for which the above
> definition also works just fine. so whats the difference between a dataset
> and a
> collection?
At URI we have a half dozen SST datasets derived from the AVHRR sensors:
one for the area off of Cape Hatteras, another for the Great Lakes, ...
Each has on the order of 15,000 passes in it. I assume that you would
call the ensemble of these a collection?
> this is the same issue that Benno has pointed out: in his DODS
> server, there is no distinction between collections and datasets, because the
> server seamlessly moves between collections, physical files, and the fields in
> the files, presenting a uniform API of datasets with their DAP and DAS.
But, you would be hard pressed to aggregate the things that I call datasets
at URI (the Hatteras one with the Great Lakes one) with your Aggregation
Server.
As I noted in my previous e-mail the actual grouping of data into a dataset is
arbitrary, so one could call the collection of datasets at URI a dataset or one
could refer to each one as a dataset. One could call all data at a site a
data set, or in the extreme, all earth science data accessible via DODS as
a dataset.
> (I am not going to try to answer the question of what's the difference
> between a
> catalog and a collection yet; hopefully others might have some ideas)
>
> in THREDDS, a dataset has a URI, and is the smallest choosable thing in the
> catalog.
I think that this is pretty much what we refer to as a directory, although
we are still working on making a single URL for each dataset described in
the various directories.
> our goal as middleware is to present the list of dataset choices to the
> user very quickly, without having to actually contact the server. once the
> user
> selects a dataset, then the user can expect some delay while a connection is
> made to the server, and the "real" dataset metadata is collected. This implies
> that the catalog metadata may not be exactly right at all times (eg the list
> of
> available times of the dataset), which makes life easier for implementors.
>
> >
> >
> >>inventory = listing of datasets
> >>
> >
> > No, a listing of datasets is what I refer to as a directory (not a
> > directory on a computer). The GCMD is an example of same. An
> > inventory is a listing of elements in a data set, it could be a
> > list of times for satellite images in an archive along with the
> > physical location of the data (tape C18341 on a rack, or
> > N861230147.hat in a computer directory on my machine) or a list
> > of times and locations of each XBT in an XBT archive.
>
> so is an inventory an internal thing that the server uses to construct the
> datasets that are visible to the outside world?
I don't think so. First, it need not be internal. For a long time
we maintained inventories of the data sets at JPL. The inventory
is simply a list of the contents of a dataset. A dataset can
exist without an inventory, in that the dataset is a logical
grouping of the data. The GCMD identifies a lot of datasets
that to the best of my knowledge do not have inventories. Well,
in a sense they do in that they might often comprise all of the
files in a directory on a computer, so the directory listing is
to some extent an inventory of the data in the dataset.
> >>question:
> >>what does it mean to "group files into data sets"? like the agg server?
> >>
> >
> > One mightsay that all images in this projection, from this satellite,
> > processed this way form a data. Or one could say that all images in
> > this projection, from this suite of satellites processed this way
> > form a data set. Or... This is the trouble with data sets, different
> > people call different groupings of the data a data set. This caused
> > a lot of blood letting between NASA and NOAA a number of years back.
> > The idea is NOT to call every granule or every file in the system a
> > data set, you know the difference between lumpers and splitters. In
> > order for us to make progress, we have to back off a bit and look at
> > the big picture, grouping things into data sets allows us to do that.
> > This is exactly the problem that the DODS crawler has. When it crawls
> > a site such as our satellite archive, it ends up with thousands of
> > entries and the system or the person viewing the results struggles
> > with a data overload, more information that s/he/it (humm... have
> > to be careful with these gender neutral versions) wants or needs to
> > locate the group of files that define the object of interest. Given
> > that there is no precise definition for how to group files into a
> > data set, I think that we can reduce the amount of information that
> > we have to deal with to a reasonable view of the all the data on the
> > system without losing much if anything. The crawler is likely to group
> > the files slightly differently in some cases than the human would, but
> > one could probably discover this pretty quickly and steer the crawler
> > if necessary.
>
> ok, this seems to be similar to the "collections" vs "datasets" issue above. I
> think i need to hear Steve's tech presentation before I can understand this
> any
> deeper.
>
> >
> >>Generating "inventories of granules in data sets" makes sense in the
> >>context of
> >>an agg server, but is there also meaning to it in the context of a normal
> >>DODS
> >>server?
> >>
> >
> > Not sure exactly what you mean here. We have file servers which are
> > inventories of granules in data sets. Actually the terminology is a
> > bit loose here also. The server in this case is a DODS FreeForm server.
> > It serves a table that contains a list of URLs with the characteristic(s)
> > that differentiate one URI from another, time in the case of our satellite
> > archives.
>
> i think some of the problem is that i think of DODS narrowly as a specific
> client/server protocol, and you include services and extensions that have been
> built with or use that protocol.
Yes! The DODS DAP is the thing that defines the low level data access
protocol. To use it effectively one needs to add higher level constructs
such as the file server.
Peter
--
Peter Cornillon
Graduate School of Oceanography - Telephone: (401) 874-6283
University of Rhode Island - FAX: (401) 874-6728
Narragansett RI 02882 USA - Internet: pcornillon@xxxxxxxxxxx