Choosing the Right Number of Neighbors (k) for the K-Nearest Neighbors (KNN) Algorithm

Author:Murphy  |  View: 24146  |  Time: 2025-03-22 22:39:56

In machine learning, KNN (K-Nearest Neighbors) plays an important role in Classification and regression tasks.

The major challenge when using KNN is choosing the right (best) value for k which is __ the number of neighbor instances considered for a new-instance classification.

In technical terms, k is a hyperparameter in the Knn Algorithm. The user needs to define its best value, as it can't learn the value from the input data.

from sklearn.neighbors import NearestNeighbors

KNN = NearestNeighbors(n_neighbors=???)

In the Scikit-learn KNN class, k is specified as a hyperparameter using the _nneighbors argument. Scikit-learn provides a default value of 5, but it is useless in most cases as the best k value depends on many other factors.

The theoretical largest for k is the total number of observations in the dataset. The smallest value is 1. But, we never use these two extremes. The best value occurs somewhere between the highest and lowest.

Today, we will discuss six effective methods of choosing the right k value. We will also discuss the effect of the k value on KNN model performance by plotting decision boundaries.

Both regression and classification tasks can be performed with KNN. But, here, we will only consider building classification models.

In Machine Learning terms, KNN is a supervised learning algorithm. It requires labeled data. When fitting the model, we need to provide both a data matrix and a label (target) vector as X, y. More technically, KNN falls under the category of instance-based learning which is also called lazy learning.

In the training phase, an instance-based model such as KNN does not learn anything from data, instead, it only stores data and nothing happens. No parameters are learned from the data. That's why instance-based methods are also known as non-parametric methods.

In the testing phase (in the case of predicting the class of a new instance), the algorithm calculates the Euclidean distance between the new instance and all other existing instances. The Euclidean distance between two data points (x, y) is calculated using the following formula [Citation¹]. Here, n is the dimension of the space. It can be any positive integer.

Euclidean distance formula (Image by author)

Then, the algorithm arranges the calculated distance values in the ascending order. Then, it finds the k nearest neighbors to the new instance based on the distance values. Let's assume the k is 5. The algorithm considers the nearest 5 data points to the new instance. The new instance will be classified to the majority of the class in the nearest neighbors. Let's say there are three classes (red, green, blue) in the dataset and the nearest neighbors have 3 red points out of 5, the new instance will be assigned to the red class. That's how KNN works behind the scenes.

KNN new instance classification (Image by author)

A simple and bad way to create a KNN model

Most people prefer to use the following traditional method to create a KNN model.

from sklearn.datasets import load_iris

X = load_iris().data
y = load_iris().target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier()
KNN.fit(X_train_scaled, y_train)
y_pred = KNN.predict(X_test_scaled)

from sklearn.metrics import accuracy_score

print("Test accuracy: ", accuracy_score(y_test, y_pred))

There are two drawbacks in this method.

  • The model has used the default value for the number of neighbors (k) as we haven't specified a value. The default value will not be the optimal value in most cases. So, we need to find the right k value before making the final model.
  • The model has been evaluated using just one test set which is obtained by splitting the entire dataset into train and test sets. The test set will differ significantly each time we execute the code due to the randomness involved in splitting. So, the evaluation score is not guaranteed. We need to use 5-fold cross-validation to avoid this. There, we will not rely on a single test set. Instead, we change the test set in each iteration (5 times) and get the average evaluation score at the end.

Choosing the right number of neighbors (k) for KNN

Method 1: Visualizing classification performance on different k values (visual inspection)

The best method to find out the right k value for KNN for a given dataset is to perform KNN several times by changing the value of k, and then visualize classification performance on different k values. Instead of using just one test dataset which is created by splitting, here we use 5-fold cross-validation to mitigate the variance that occurred due to the randomness involved in splitting.

from sklearn.datasets import load_iris

X = load_iris().data
y = load_iris().target

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_scaled = sc.fit_transform(X)

The KNN algorithm uses a distance function such as Euclidean distance to measure the similarity between two instances. This function is highly sensitive to the relative scale of the input features. If the features are not measured in a similar scale, we need to perform feature scaling to bring them into the same scale. That's why we standardize the features in the Iris dataset.

Also note that we only need to scale the input features (X), and it is not necessary to scale the label column (y).

Then, we train the KNN algorithm 30 times with different k values ranging from 1 to 30! Each time, we evaluate the model performance using 5-fold cross-validation. Here, we use ‘accuracy' as the model evaluation metric. We store accuracy each time and use that data to plot the graph between the performance score and k values.

k_values = list(range(1, 31)) # From 1 to 30
acc_scores = []

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

for k in k_values:
  KNN = KNeighborsClassifier(n_neighbors=k)

  scores = cross_val_score(KNN, X_scaled, y,
                           scoring='accuracy', cv=5)

  acc_scores.append(np.mean(scores))

import matplotlib.pyplot as plt
plt.style.use("ggplot")

plt.plot(k_values, acc_scores, marker='o', color='blue')
plt.xlabel("Neighbors - K Value")
plt.ylabel("Accuracy")
plt.title("Effect of K")
(Image by author)

We need to choose the k value that delivers the best classification performance. Here, k=6 and k=8 give the best classification performance.


Method 2: Finding the best k value using Grid Search hyperparameter optimization

Another way to find the best k value for KNN is to use Grid Search. Because k is a hyperparameter, we can tune it using Grid Search. Here, we do the same thing as in the previous method. Here, we do not use an explicit for loop as Grid Search automatically does the looping for us! In addition to that, cross-validation also happens behind the scenes. We don't need to do it manually. Here also, we train KNN 30 times with different k values ranging from 1 to 30!

k_values = list(range(1, 31)) # From 1 to 30

from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier()

hyperparameter_space = {'n_neighbors':k_values}

from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(KNN, param_grid=hyperparameter_space,
                  scoring='accuracy', cv=5)

gs.fit(X_scaled, y)

print("Best value of K: ", gs.best_params_)
print("Mean CV accuracy of best K-value: ", gs.best_score_)
(Image by author)

The Grid Search function automatically selects the best k value that delivers the best classification performance. The selected k value is 6.

Gird Search does all things automatically, behind the scenes! So, this method is easier than the previous one, but I prefer the previous method as it gives a whole picture of how the model behaves on different k values. Always, visual inspection is better than anything!


Method 3: Finding the best k value using NumPy argmax() function

Instead of using visual inspection as in method 1, we manually calculate the best k value using the NumPy argmax() function.

k_values = list(range(1, 31)) # From 1 to 30
acc_scores = []

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

for k in k_values:
  KNN = KNeighborsClassifier(n_neighbors=k)

  scores = cross_val_score(KNN, X_scaled, y,
                           scoring='accuracy', cv=5)

  acc_scores.append(np.mean(scores))

# Using NumPy argmax()
best_k_index = np.argmax(acc_scores)
best_k = k_values[best_k_index]

print("Best k:", best_k)
print("Score:", acc_scores[best_k_index])
(Image by author)

The NumPy argmax() function returns the index of the maximum value of the _accscores array. Then, we use that index value to filter out the best k value and its corresponding classification score.


Method 4: Using the validation curve to determine overfitting and underfitting conditions

The validation curve plots the influence of a single hyperparameter on the train and validation sets [Citation²]. So, this is a very useful tool to detect overfitting and underfitting conditions by changing the k value (single hyperparameter). You may choose the k value that builds a model that does not overfit or underfit the data.

# Importing Yellowbrick's validation curve visualizer
from yellowbrick.model_selection import ValidationCurve

# Importing KNN classifier
from sklearn.neighbors import KNeighborsClassifier

k_values = list(range(1, 31)) # From 1 to 30

# Creating the validation curve
viz = ValidationCurve(KNeighborsClassifier(), 
                      param_name="n_neighbors",
                      param_range=k_values, 
                      cv=5, scoring="accuracy")

viz.fit(X_scaled, y)

# Saving plot in PNG format
viz.show(outpath="Validation_Curve.png", dpi=80)
Validation curve (Image by author)

The model fits well on k=6 and k=8. It will perform well on the training data and will also generalize well on new unseen data.

The model begins to overfit the data between the values of 11 and 14 because the validation score begins to decrease while the training score is still increasing. When overfitting, the model will perform well on the training data but fails to perform well on new unseen data.

After k=19, both training and validation scores decrease and the model underfits the training data. It will poorly perform on both training and unseen data.

That's how we read the validation curve.


Method 5: Plotting the decision boundaries for different k values

In KNN, the decision boundaries define the regions of the class labels assigned to a data point based on the majority class of the k value.

Here, we see how the decision boundaries change with different values of k. Plotting decision boundaries manually is time-consuming and a tedious task involving many code lines. Therefore, we use the _plot_decisionregions() function of the Python's mlxtend (machine learning extensions) library.

You can install it by running the following command.

pip install mlxtend

To get help from the _plot_decisionregions() function, run the following command.

from mlxtend.plotting import plot_decision_regions

help(plot_decision_regions)

This function creates 2D plots. Therefore, we need to provide 2-dimensional data input. If the data is in the higher dimensional form, we need to apply PCA to reduce dimensionality into two.

Here is the general code for visualizing KNN decision boundaries using the mlxtend library.

from mlxtend.plotting import plot_decision_regions
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Perfrom PCA with 2 components to make 2D visulizations,
# when the number of features > 2
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled) # Retain 95% of the variance 
                                    # in the Iris data

def knn_decision_boundaries(data, y, k):
  KNN = KNeighborsClassifier(n_neighbors=k)
  KNN.fit(data, y)

  plot_decision_regions(data, y, clf=KNN)
  plt.xlabel("PC1")
  plt.ylabel("PC2")
  plt.title("KNN with k="+ str(k), fontsize=13)
  plt.savefig('ABC.png', dpi=80)

knn_decision_boundaries(X_pca, y, k)

You may create multiple plots by changing the value of k as needed.

KNN decision boundaries with different k values (Image by author)

Smaller k values tend to create more complex decision boundaries. But, they minimize the amount of wrongly classified data points.


Method 6: Choosing a larger k using weighted distances

The default KNN algorithm operated with uniform weights when calculating the distance. Scikit-learn KNN provides another option to set for the weight function.

from sklearn.neighbors import NearestNeighbors

KNN = NearestNeighbors(n_neighbors=k, weights='distance')

This method weights instances by the inverse of their distance. Here, the closer neighbors of a data point will have a greater influence than neighbor points that are further away, when calculating the distance.

We need to combine this method with method 1.

k_values = list(range(1, 31)) # From 1 to 30
acc_scores = []

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

for k in k_values:
  KNN = KNeighborsClassifier(n_neighbors=k, weights='distance')

  scores = cross_val_score(KNN, X_scaled, y,
                           scoring='accuracy', cv=5)

  acc_scores.append(np.mean(scores))

import matplotlib.pyplot as plt
plt.style.use("ggplot")

plt.plot(k_values, acc_scores, marker='o', color='blue')
plt.xlabel("Neighbors - K Value")
plt.ylabel("Accuracy")
plt.title("Effect of k with weighted distances")
plt.savefig('v.png', dpi=80)
(Image by author)

When comparing this plot with the plot obtained in method 1, many (larger) k values deliver the best classification performance. It is reasonable to choose a larger k value when using the weighted distance option.


Which method should I use?

It depends. The best strategy is to go for the visual inspection (method 1) and confirm the result by using another one or two methods! It is always better to create the validation curve (method 4).


Useful tips for choosing the right k-value

  • You need to maintain the balance between overfitting and underfitting. You can plot the validation curve to achieve this.
  • Larger k values are robust to variance caused by noisy data, but they may ignore important patterns (larger k values have low variance and high bias).
  • When the k is very small, noise and outliers will influence the model performance.
  • Smaller k values tend to create more complex decision boundaries. But, they minimize the amount of wrongly classified data points.
  • You may choose the k value that delivers the best classification performance.
  • When the dataset is small, you must choose the k value carefully. However, when the dataset is large, some variations between different k values will not substantially affect the model performance. So, pay attention to the size of the dataset.
  • Using an odd number for the k value will be beneficial to avoid ties!

This is the end of today's article.

Please let me know if you've any questions or feedback.

Recommended articles as prerequisites

Kindly read the following articles written by me. They provide prerequisite content for this article.

k-fold cross-validation explained in plain English

Parameters Vs Hyperparameters: What is the difference?

Why do we set a random state in machine learning models?

Why is Feature Scaling Important in Machine Learning? Discussing 6 Feature Scaling Techniques

Python Implementation of Grid Search and Random Search for Hyperparameter Optimization

Validation Curve Explained – Plot the influence of a single hyperparameter

3 Easy Steps to Perform Dimensionality Reduction Using Principal Component Analysis (PCA)

KNNImputer for Filling Missing Data in Data Preprocessing

How about an AI course?

Join my private list of emails

Never miss a great story from me again. By subscribing to my email list, you will directly receive my stories as soon as I publish them.

Thank you so much for your continuous support! See you in the next article. Happy learning to everyone!

Iris dataset info

  • Citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  • Source: https://archive.ics.uci.edu/ml/datasets/iris
  • License: R.A. Fisher holds the copyright of this dataset. Michael Marshall donated this dataset to the public under the Creative Commons Public Domain Dedication License (CC0). You can learn more about different dataset license types here.

Designed and written by: Rukshan Pramoditha

2024–02–27

Tags: Artificial Intelligence Classification Data Science Knn Algorithm Machine Learning

Comment