Extensible NcML for AI/ML Ready Data on the THREDDS Data Server

Leo Matak
Leo Matak

I would like to begin by saying that this internship has definitely been one of the top highlights of my Ph.D. journey. Coming to Boulder from Houston made me realize how different climates and cities can be. I loved Boulder from the first moment I arrived. It is such a pedestrian-friendly city, and everyone seems to be jogging, biking, walking, or engaging in some other fitness activities. Following the example of other residents of Boulder, I activated my BCycle account and rode an electric bike every day to work and back, and on the weekends, I explored and experienced Boulder! What a fun way of commuting that was!

As a Ph.D. student researching Boundary Layer Dynamics, UCAR has always had a special place in my life because of all the data, tools, and knowledge it has provided. These resources aided me in my research and helped deepen my passion and skills in Earth Systems Sciences. Being accepted into an internship at NSF Unidata was truly a remarkable experience, as I used NSF Unidata's products (TDS, NetCDF, MetPy) on a daily basis. Now, I had the opportunity to go there, meet the people, and work with them on these software packages.

In today's technological world, where the volumes of collected and measured datasets are only expanding, Machine Learning (ML) and Artificial Intelligence (AI) have never been more suitable for analysis, exploration, and conclusion derivation. However, many ML applications are challenged by raw datasets that may contain outliers, which could significantly impact results. These outliers might simply be instrument errors or other types of errors. Data cleaning, such as handling missing data, is another common issue. With such problems, having data transformation and manipulation methods readily available is crucial.

Description
Figure 1: Data processed locally
(click to enlarge)

I spent most of my working hours implementing the idea of server-side virtual data processing. This means that data on the THREDDS Data Servers (TDS) could be virtually processed without actually modifying the data. As such, the data integrity would remain intact, but it could be optimized for ML/AI.

To make this work, I implemented what is known as a Service Provider in Java, where a concrete implementation of a service interface can be loaded at runtime without any hardcoding or modifying of the existing code. Before that, to get familiar with the large NetCDF-Java directory tree and the underlying code structure overall, I created a new class called Classifier to categorize data into arbitrary categories. After successfully writing this class and a couple of tests for it, I was ready to move on to the next step.

The implemented Classifier is defined with the following NcML code:

<variable name="Temperature_height_above_ground">
<attribute name="classify" value="0 65 0; 65 85 1; 85 inf 2;" />
</variable>

which will perform the following classification:

Temperature [F] [0,65) [65,85) [85,inf>
Assigned class 0 (bearable) 1 (comfort) 2 (not good)

The classification is applied locally and the results are shown in Figure 1.

We wanted to allow data processing to be done via the NetCDF Markup Language (NcML). The NcML is a very simple and intuitive way of specifying what kind of transformation is wanted along with the details of application. To make it work with Java, it required additional code to be refactored and added. After some additional thinking, brainstorming and a couple of meetings, I managed to create working code that allowed TDS administrators to implement custom data transformations to be applied server-side.

Description
Figures 2 & 3: Data processed directly on TDS
(click to enlarge)

After we got that extensible NcML mechanism working, we used my Classifier transformation as an example on NSF Unidata’s own TDS. We chose the GFS 20km CONUS dataset variables for relative humidity and temperature as the inputs: https://tds.scigw.unidata.ucar.edu/thredds/catalog/grib/NCEP/GFS/CONUS_20km/catalog.html (input data is from NSF Unidata's Science Gateway THREDDS Data Server). The corresponding output (classified) data variables can be viewed on NSF Unidata's thredds-test data server, where the transformation took place: https://thredds-test.unidata.ucar.edu/thredds/catalog/classified/grib/NCEP/GFS/CONUS_20km/catalog.html.

The Classifier can be used on raw temperature and relative humidity datasets directly on the TDS using the following NcML:

<variable name="Relative_humidity_height_above_ground">
<attribute name="classify" value="0 45 0; 45 75 1; 75 100 2;" />
</variable>

The results are shown in Figures 2 and 3. (Click to enlarge, then click on the image to see the classified version.)

In conclusion, this summer has been one of the greatest experiences so far. Not only have I learned and refined my knowledge and skills, but I’ve also met many wonderful people. I had the opportunity to spend quality time with the team at NSF Unidata and fellow students interning at UCAR. The skills I’ve honed, such as Java and Git, will undoubtedly benefit my future career and prepare me for upcoming challenges. I highly recommend applying to Unidata's internship programs, as they provide a life-lasting positive experience.

Comments:

Post a Comment:
  • HTML Syntax: Allowed
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« August 2024
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today