Your idea of what's entailed in setting up a supervised Machine Learning (ML) project as an Earth Systems scientist is probably not as fanciful as what an image generation algorithm came up with (see image at left!) But there are many little decisions ML practitioners make along the way when starting an Earth Systems Science (ESS) ML project. This article provides some tips and ideas to consider as you're getting started. These tips are not in any particular order, and like all things related to ML projects they depend on the specific types of data and project goals. (If you have any questions about your particular project, feel free to book a meeting with me — my contact details are at the end of this article.)
Try a Few Models
Even if you're sure that you need a deep learning model for your project, using some ‘shallow’ (sci-kit learn) learning models either as a baseline, or to aid with interpretation of input features is always recommended. This is one thing I look for when I review applied ML papers. The two links below compare different classification and regression models on different datasets.
- My own scikit-learn regressor comparison
- A high level comparison of different scikit-learn classifiers
Scale Your Data
Most ML models (not all!) require pre-processing and normalization. If you are using a decision tree type of model, while it's not required it might be a good idea for your particular dataset and use case. Scikit-Learn has a great suite of pre-processors, and these are even useful for non-ML use cases. Lately I have been using the quantile transformer for many of my workflows, but this is very much dataset and model dependent.
Testing, Training, and Validation Datasets
While training and testing are crucial, the often-overlooked key to robust analysis lies in a third, independent validation dataset. This independent set serves as a critical reality check, ensuring your model generalizes well beyond the training data and isn't simply overfitting. However, for environmental and geoscience data, blindly applying random sampling for validation can be a recipe for disaster. Spatial and temporal correlations inherent in these data can lead to misleading results if not accounted for. For a deeper dive into best practices for well-based geoscience data validation, you're welcome to read this paper I wrote as part of my doctoral work: Digitalization of Legacy Datasets and Machine Learning Regression Yields Insights for Reservoir Property Prediction and Submarine-Fan Evolution: A Subsurface Example From the Lewis Shale, Wyoming
Drop Unnecessary Data
If after you've done some exploratory data analysis and some training and testing of various models there are a few input features that do not seem to improve performance, it’s best practice to remove (or drop) them before doing your final analysis. Within the scikit-learn ecosystem, you can do this automatically using Recursive Feature Elimination (RFE) depending on the model. RFE not only simplifies and speeds up model training, but also identifies the potentially most impactful features, giving you better insights into your data.
Your Performance Metric Matters
In a previous post, I discussed the potential overuse of R2 as a metric for regression problems. Accuracy for classification problems can also be an issue for unbalanced, multi-class datasets which are common for ESS. It's worth experimenting with a couple different performance metrics, and reporting more than one metric! Within the scikit-learn ecosystem, there are many options besides R2 and accuracy. This is especially true for ML models that are trying to predict relatively rare events. (See "Metrics and scoring: quantifying the quality of predictions")
Visualize Your Data
Visualizing your data is not just something to do at the end of a project; it's a critical sanity check throughout any quantitative analysis. Datasets like Anscombe's quartet and the Datasaurus Dozen have shown how statistics alone do not tell the whole story of datasets. As a data scientist, I find that visualizing data at every step of the ML workflow, even if the plots don't make it into the final report, helps me identify potential issues and refine my workflow. Don't underestimate the power of a simple visualization. (See the previous post for some visual examples.)
Thomas Martin is an AI/ML Software Engineer at the NSF Unidata Program Center. Have questions? Contact support-ml@unidata.ucar.edu or book an office hours meeting with Thomas on his Calendar.