LEAD Status Report

LEAD at Unidata

Overview

LEAD is a NSF Large ITR project involving nine institutions to create an integrated, scalable framework in which meteorological analysis tools, forecast models, and data repositories can operate as dynamically adaptive, on demand, grid-enabled systems. For more information see http://portal.leadproject.org/.

The two key goals for LEAD are:

1) To democratize the availability of advanced weather technologies for research and education, lowering the barrier to entry, empowering application in a grid context, increasing the realism of how technologies are applied and facilitating rapid understanding, experiment design and execution of complex end-to-end weather analysis and prediction systems.

2) To improve our understanding of and ability to detect, analyze and predict mesoscale atmospheric phenomena by interacting with the weather in a dynamically adaptive manner.

The LEAD effort at Unidata includes:

Development, deployment and maintenance of a Grid and Web services test bed for LEAD technologies, as well as data storage. This includes:

Running automatically steered WRF jobs
The provision of a four month data archive for the LEAD seven canonical datasets (see Data Description for LEAD 7 Datasets)
Storage for data and other information generated via LEAD orchestrations

Development of the THREDDS Data Repository (TDR), a storage archive that integrates with the THREDDS Data Server (TDS) to provide easy access to data, which includes:

Data movement into and out of an archive
Support for a variety of storage media, including mass storage
Generation and/or enhancement of metadata

Development and maintenance of a crosswalk that translates THREDDS metadata into LEAD metadata
Installation and testing of existing assimilation packages, forecast models and associated tools on the Unidata LEAD test bed as well as hosts at other institutions such as supercomputers at NCSA
Ensuring integration across relevant Unidata technologies: especially LDM, TDS, and IDV
Providing an interface between the Unidata community and LEAD and leveraging our community building skills to help LEAD to develop its own community
Providing expertise in successful software development and deployment to help LEAD succeed

Status Update March 12th, 2007

Unidata Policy Committee Meeting

We anticipate that the future direction of the LEAD effort at Unidata will be discussed at the Unidata Policy Committee Meeting, March 12-13, 2007. This report provides a more than typical "beyond Unidata" perspective of the current state of the LEAD effort in order to facilitate that discussion.

Near Term LEAD Goals

LEAD has identified two aggressive goals for the spring of 2007. The first goal is to provide support for the WxChallenge Collegiate Forecast Contest, a collegiate weather forecasting competition. The second goal is to support the CAPS Spring Experiment which itself has three primary thrusts. The first thrust involves launching ensemble forecasts to study areas of deep convection. Ensemble forecasts allow for specifying uncertainty in model initial conditions and quantifying uncertainty in model output. The second thrust provides for dynamic forecasts that are triggered by the receipt of tornado watches and warnings. Finally, using the LEAD portal, forecasters will be able to determine domains for and launch forecasts on demand.

Beta Users Program

The UPC is playing a leading role in the deployment of the LEAD system in the atmospheric science community. Following on the successs of last summer’s Users Workshop, we have continued in the identification of a list of additional faculty and students interested in being part of the beta users program. Unidata continues to spearhead the beta users program that would allow this group access to the software and provide support in return for their testing of the system and provision of feedback. Testing has been expanded beyond the internal LEAD team to include groups of students from Millersville and Howard universities. These students have been testing LEAD capabilities intended to support the WxChallenge Collegiate Forecast Contest. Using our existing infrastructure, we have set up a LEAD support venue (support@unidata.ucar.edu) and user's e-mail list (leadusers@unidata.ucar.edu) both of which have seen significant traffic. Thus far, 50 support questions have been received and answered over a 2 month period resulting in many bugs and feature requests brought to the LEAD team. Unidata plays the role of helping users as well as filtering and vetting bug reports.

Unidata LEAD Test Bed Status

We are maintaining a rolloing archive of at least 120 days of each of the seven LEAD canonical datasets, and in some cases more please see Data Description for LEAD 7 Datasets. The archive also has at least 120 days of the remaining IDD feeds. In addition we are maintaining a smaller archive of ADAS and steered WRF model output. The current data volume is close to 24 terabytes. This volume of data is intractable for back up, so it is at risk for loss. Data volumes continue to create technical problems. The UPC LEAD team is working with the THREDDS group to investigate these problems and come up with the best solution for these technical problems. Given the data volumes and complexities, the UPC LEAD testbed has been an excellent test case for studying scalability aspects of many of Unidata's technologies and strategies.

The test bed is using the latest TDS technologies, which includes the ability to work with native GRIB files as well as RADAR levels II and III formats. This accomplishment facilitates comparisons of WRF model predictions to RADAR observations. The TDS also provides for the ability to directly download a file via http, subset a dataset and download a CF convention netCDF file (a feature we are encouraging our colleagues in LEAD to make greater use of), catalog gridftp availability of files and provide a WCS interface to gridded data files. All these capabilities are useful and desirable in the LEAD context. The UPC test bed is integrated into the LEAD workflow system to provide initial and boundary conditions for real-time and retrospective steered WRF predictions. Until recently it was also being used to store model output. Our partners at Indiana University have set up a TDS on their system for storage of these as well as intermediate products used by the workflow system.

There are two TDS catalogs that provide access to the data. The primary catalog provides complete access to all the IDD data. The other catalog is the operational LEAD top catalog. This catalog does not yet include radar data as the volumes are too great for the current LEAD software and require some strategy for handling. This is being worked on by the LEAD team.

Recently, all of the test bed nodes were upgraded to a gigabit internet connection. This addresses a gridftp overload problem that was revealed at the LEAD Lab day at the Unidata workshop last summer.

TDR

Synopsis

The THREDDS Data Repository (TDR) is a repository space to store data and other items and their associated metadata. Users can upload data and metadata to the repository. The TDR will locate space to put the data, move the data into the repository, and generate catalogs containing both externally provided and internally generated metadata.

The TDR is integrated with the THREDDS Data Server (TDS), so all TDS functionality for serving data is available for items stored in the repository. The TDR complements TDS technology by providing a means to populate a repository of data that can be served via the TDS. TDR requirements are influencing TDS design and development by providing new use cases involving dynamically generated catalogs and catalog "editing" capabilities to support maintenance of catalogs and metadata.

Also, like the TDS, while the UPC does provide a TDR for use by designated projects, the TDR is intended to be deployed by other institutions so that they may create and administer their own repositories.

TDR Use Cases

At this time TDR development is being steered by two use cases. The Next Generation Case Study project is a case study repository in which archive designers interactively arrange and store items related to a case study, such as data, notes, images, IDV bundles, etc., and make these studies available to their community. Also, the LEAD project needs storage and access for items relevant to a user's experiment. The latter includes items involved in running an orchestration, such as input, output, and intermediate files, but also includes items that a user wants to publish.

These use cases have in common the need for a repository space that: provides data storage, can be structured by the client, provides integrated metadata management, and can serve the data.

The Current State of the TDR

In the Unidata TDR deployment, the repository is subdivided on a per project basis, currently the Next Generation Case Study project and the LEAD project. Each project has a separate partition in the repository space. Each project also has a different front end to the repository in the form of a servlet interface. Within a project, clients can store data and create catalogs in a hierarchically structured manner of their own design. A client can add or remove nodes within their space.

The Next Generation Case Study project requires an interactive interface. Users communicate with the server via a web input form. This form provides a means to specify a data source, enter metadata, and also provide information about structuring the storage space. Users can add or delete nodes in this space via this input form. Once stored, the data is browsable and retrievable via the TDS.

The LEAD orchestration system is based on a Service Oriented Architecture (SOA). Thus the LEAD interface to the TDR must provide a Web API. Early versions will provide a simple http interface, but later versions will likely need to provide a SOAP interface and a WSDL service description. This interface is under development. More details about its design are given below.

Features of the TDR include:

Incorporation of externally provided metadata
Generation of new metadata. In particular, using the Common Data Model, it generates access URLs based on what it can determine about a file's format. For example, if the file is known to be a grid and it can be served by OPeNDAP then it can also be served via the TDS WCS server and an appropriate access URLs are generated.
Storage of both individual files and collections (currently in the form of a tar.gz file). Collections are unpacked and an associated catalog structure is generated. This allows movement of groups of files into the repository via a single operation.
Data movement into the repository. Currently http and gridftp file movement protocols are supported.

A prototype of this server is available on the LEAD test bed.

TDR Next Tasks

Tomcat security will be implemented to authenticate and authorize repository writers. This will prevent unauthorized users from uploading material to the server. There are no current plans to authenticate readers.

The LEAD interface to the TDR will be expanded. A prototype client will be built in order to explore and test this programmatic interface. LEAD inputs to the server will include metadata generated by the orchestration system, plus user information such as certificates required to perform Grid operations. The TDR may generate additional metadata. The TDR will return a handle to the data that the orchestration system can use for later access. The orchestration system could then query the TDR for data access options, for example choosing gridFTP retrieval.

Crosswalk

The THREDDS to LEAD crosswalk has been updated to generate valid LEAD metadata and continues to serve to provide LEAD metadata for the community datasets offered by and used in LEAD.

Discussions have occurred regarding updating of the crosswalk in order to handle large data volumes such as radar data. A simple data and host specific solution has been outlined that will provide updates to continuously maintained list of available radar data. The development of this software is a necessary step in the integration of radar data into LEAD.

LEAD Target Audience

In October, the LEAD team met with its External Advisory Panel (EAP) to present the current state of the LEAD effort and obtain the Panel's feedback on the direction and capabilities represented. The panel offered many excellent suggestions for the LEAD team as the project entered its fourth year. One of these was centered around the observation that the LEAD effort targets a very broad audience (from 6th grade to operational forecasters). The EAP suggested that, for now, LEAD focus its efforts on a smaller target audience with emphasis providing them with high quality capabilities.

The LEAD team held it smi-annual All Hands meeting in San Antonio, TX this January, in conjunction with the AMS Annual Meeting. At that meeting, the team agreed that its best target audience will initially be undergraduate and early graduate students in meteorology and their professors. This is the primary community that has the most to gain from the capabilities LEAD is creating at this time and that has caught on to the promise of LEAD as demonstrated by the comments received at the Triennial Unidata Workshop.

Presentations

Demonstrations of LEAD capabilities were given in the UCAR Office of Programs booth and two papers from Unidata was presented by UPC staff at the American Geophysical Union (AGU) Fall meeting in San Francisco, CA.

Linked Environments for Atmospheric Discovery (LEAD) at the Unidata Triennial Workshop by Tom Baltzer, Anne Wilson, Suresh Marru, Al Rossi, Marcus Christie, Shawn Hampton, Dennis Gannon, Jay Alameda, Mohan Ramamurthy and Kelvin Droegemeier
TDR: A Repository for Long Term Storage of Geophysical Data and Metadata, by Anne Wilson, Tom Baltzer, and John Caron.

Demonstrations of LEAD capabilities were given in the Unidata booth and two papers were given by UPC staff at the AMS meeting in San Antonio, TX. These are:

“LEAD at the Unidata workshop: demonstrating democratization of NWP capabilities”, by Tom Baltzer, Anne Wilson, Suresh Marru, Albert Rossi, Marcus Christie, Shawn Hampton, Dennis Gannon, Jay Alameda, Mohan Ramamurthy, and Kelvin Droegemeier.

“The THREDDS data repository (TDR) for storage of LEAD data and metadata”, by Anne Wilson, John Caron, and Tom Baltzer.

Visitors

Mark Govette of NOAA's Earth Systems Research Laboratory visited Unidata in advance of the External Advisory Panel meeting on which he sat in place of Steven Koch.

Valentine Anantharaj from the GeoResources Institute at Mississippi State visited Unidata for discussions about LEAD and Grid computing initiatives as well as to attend the Unidata Workshop on TDS and NetCDF-Java.