K Nearest Neighbors

Fred Rogers
Fred Rogers, famous for asking people to be his neighbor
(Click to enlarge)

K Nearest Neighbors (KNN) is a supervised machine learning method that "memorizes" (stores) an entire dataset, then relies on the concepts of proximity and similarity to make predictions about new data. The basic idea is that if a new data point is in some sense "close" to existing data points, its value is likely to be similar to the values of its neighbors. In the Earth Systems Sciences, such techniques can be useful for small- to moderate-scale classification and regression problems; one example uses KNN techniques to derive local-scale information about precipitation and temperature from regional- or global-scale numerical weather prediction model output.

When using a KNN algorithm, you select the number of "neighbors" to consider (K), and potentially a way of calculating the "distance" between data points. KNN algorithms can be used for both classification and regression problems. For regression problems, KNN predicts the target variable by using an averaging scheme. For classification problems it takes the mode of the nearest neighbors; as a result, it is generally recommended that the value of K be an odd number. Effective use of KNN often requires some experimentation to determine the best value for K.

Comparing decision boundary
Comparing the decision boundary between using 1 neighbor vs 20, from Kevin Zakka’s blog.

KNN is sometimes called a "lazy learning" method. This is because it does not generate a new explicit model, but rather memorizes the dataset in its entirety. While the scikit-learn API uses a .fit() method, this is largely to match the rest of the scikit-learn API.

Why you might use KNN for your ML project

  1. It's simple. Because KNN is a lazy learner, there is no complex model and only limited math is needed to understand the inner workings.
  2. It's adaptable to different data distributions. KNN works well with odd distributions of data.
  3. It's good for smaller datasets. Because no model is being constructed, KNNs can be a good choice for smaller datasets.

Some Downsides to KNN

  1. It's sensitive to outliers and poor feature selection. KNN does not do any automatic feature selection like decision tree models. These types of models can struggle in high dimensional space, both with a large number of input features and outliers within those features.
  2. It has a relatively high computational cost. While the analog/sample matching behavior of KNNs are great from an explainability point of view (model-free ML is great!), for large datasets the cost of memorizing the entire dataset can be enormous.
  3. It needs a complete dataset. Like many other ML models, KNNs do not handle missing data or NaN (Not a Number) values. If your dataset is not complete, you'll need to impute the missing values before using a KNN.

KNNs have been discussed previously on MetPy Mondays here: MetPy Mondays #183 - Predicting Rain with Machine Learning - Using KNN

KNNs are a great supervised ML model to try out if your dataset is on the smaller side. Happy modeling! What ML model should I cover in an upcoming blog?

More reading and resources

Thomas Martin is an AI/ML Software Engineer at the NSF Unidata Program Center. Have questions? Contact support-ml@unidata.ucar.edu or book an office hours meeting with Thomas on his Calendar.

Comments:

Post a Comment:
Comments are closed for this entry.
News@Unidata
News and information from the Unidata Program Center
News@Unidata
News and information from the Unidata Program Center

Welcome

FAQs

Developers’ blog

Recent Entries:
Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« December 2024
SunMonTueWedThuFriSat
2
3
4
5
6
7
8
11
12
13
14
15
16
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today