Some 15 or so years ago I worked at NCAR's Atmospheric Chemistry Division (ACD) integrating chemistry models into climate models from NCAR's Climate and Global Dynamics division (CGD). At the time all the models ran exclusively on vector supercomputers; at NCAR these were CRAY computers and the output files were stored in CRAY specific files that were written directly from the Fortran model code. Everything was (and still is) about performance because thats what limits our ability to do science in global modeling. CGD's main model was the Community Climate System Model (CCSM) and I learned how to read those files on non-CRAY machines like IBM AIX and SGI workstations.
Someone higher up the food chain understood that the non-portability of those output files was a problem, and a big "model data issues" meeting was held to get input from various groups. After some days of the usual presentations and discussions, finally a bewildered old-timer stood up and asked the right question: "Whats wrong with Fortran unformatted writes?"
Its the right question because unless you really understand the answer, you will repeat some version of the mistake of using Fortran unformatted writes for files that other people would like to use. Fortran unformatted writes are not portable between machines because it uses the machine representation of the numbers it writes out. These days this boils down to just big vs little endianess, but in those days Cray had their own floating point representation that was not IEEE. It was also Operating System (OS) dependent (in those days, hardware and their OS were not really seperate), eg CRAY added something called "COS blocking" that you had to know about. The encoding of Fortran unformatted writes might depend on which Fortran compiler you were using, and the real kicker was that there was no foolproof way to interpret the bytes on disk (was that an I4 or 2*I2 values?). Best practice was to obtain the subroutine that wrote the file, and convert the Fortran write statement to a read statement. So the file was hardware, OS, compiler and program dependent.
The old-timer knew all of that, it wasnt a naive question, rather it was a question about the relative priorities of portability vs performance. NetCDF (and many other formats) had already solved the portability issues, but it was slower than the unformatted writes. More subtlely, his question was also a statement along the lines of "our tools already know how to read the files that we write, so why should we do something different". The short answer is because important data needs to be shared. These days the entire climate community uses netCDF as an exchange format to compare models with each other.
NetCDF creates portable data files: data files that can be read by different programs in different languages, OS, and hardware architectures. But wait, theres more! In my previous blog entry, Jim Gray and colleagues claimed that data independence is needed for the next generation of scientific data management:
Physical data independence comes in many different forms. However, in all cases the goal is to be able to change the underlying physical data organization without breaking any application programs that depend on the old data format. ... Modern database systems also provide logical data independence that insulates programs from changes to the logical database design – allowing designers to add or delete relationships and to add information to the database. While physical data independence is used to hide changes in the physical data organizations, logical data independence hides changes in the logical organization of the data.
So exactly how far does netCDF go with physical and logical data independence? Its fair to say that netCDF provides physical data independence, in the sense that applications using the netCDF API do not need to know how the file is organized on disk. The netCDF-4 file format allows one to reorganize the data (e.g. changing the chunking and compression) on disk, and the application gets exactly the same results.
Aggregation is the term that is used by the netCDF-Java library for combining multiple files into a single logical dataset. In this case, the user no longer needs to know the physical filenames, or in which file any section of the data resides. So this takes physical data independence another step.
Logical data independence however, is another story. Database users think of database views when they are talking about logical independence. Lets define the logical schema as the user visible view of the data which defines which queries and method calls are valid, and what they should return. In databases, the logical schema is the set of tables and their columns. Database administrators can change the tables, combining or breaking them up for performance reasons or to eliminate redundancy. By creating virtual tables with views, they can prevent legacy applications from breaking.
In netCDF, the logical schema is the variables, dimensions and attributes (along with groups in netcdf-4). The low-level netCDF API doesn't allow you to change these without breaking legacy applications, unless those applications were written in a smart way. NcML provides a modest amount of machinery that has some similarities to logical views. But indexed-based data access limits how far we can achieve logical data independence. One example of this is detailed here. Generally, the netCDF API exposes the data arrays in index space, and you cant rearrange these without breaking an application that uses this index-based API. (A coordinate-based API I think is the way to get logical data independence. More on that another day.)
One way to state the problem with Fortran unformatted files is that the application "just has to know" exactly how the data is stored on disk (i.e. hardcode that information into the reader). Today, everybody understands thats not tenable for scientific data that needs to be shared, and portability and physical data independence are solved problems. So why is data still so hard to use? In part because of the attitude by data producers that "if we can read our own data, then our job is done". Just as we say that one can write spaghetti code in any language, one can write unshareable files in any format.
Stay tuned for a future installment where the guilty shall be named, and the mighty shall tremble in their shame. Or not.
NetCDF provides logical and physical data independence, in that programs that use documented interfaces are insulated from the addition of new dimensions, variables, and attributes, a desirable property I like to call obliviousness.
Unlike reading unformatted binary data, netCDF client programs use the names of variables, attributes, and dimensions instead of their position or sequential order, which means such programs are also oblivious to changes in the order in which named objects are defined.
This is not so different from databases, in which new fields may be added to relations, and new relations may be added to a schema without affecting existing programs. Immunity from schema additions is an important aspect of logical data independence. It means that schemas can safely evolve, and information from two sources can sometimes be merged without breaking programs written before the merger.
Sort of like 3D glasses, for a loose enough interpretation of "sort of" ...
Posted by Russ Rew on June 02, 2011 at 02:33 PM MDT #