K-Nearest Neighbors from First Principles

5 minute read

Published: October 10, 2023

Author: Yusuf Brima

Introduction

In this post, I will walk through how to implement the K-Nearest Neighbors (K-NN) machine learning algorithm from scratch in Python. K-NN is one of the most fundamental ML algorithms used for classification and regression.

Understanding KNN

The intuition behind K-NN is quite simple. Given a new data point, let us call it $X_\textrm{new}$, one look at its closest $K$ neighbors in the training data and predict the label based on those neighbors. For classification, take a majority vote of the neighbor labels. For regression, take the average of the neighbor values.

Here are the steps to follow to apply K-NN:

Choose a value for $K$ - the number of neighbors to examine.
Calculate the distance between the new data point and all points in the training set using a distance metric like Euclidean or Manhattan distance.
Sort the distances and take the top $K$ closest points.
For classification, take the majority class label of the $K$ points. For regression, take the average value of the K points.

An illustration of the K-NN algorithm for a sample 2D dataset is shown below:

Euclidian distance in 2D — Figure 1: (Left) A 2D feature visualization of a dataset that comprises three classes ($C_1, C2, C_3$). The colors indicate the class labels. With a new datapoint $X_\mathrm{new}$, the objective of the K-NN algorithm is to find the class $C_i$ to which the datapoint can be classified based on its $K$ neighbors labels. (Right) After computing the distance between the query $X_\mathrm{new}$ and all the samples in the dataset, it has been classified into class $C_3$ which comprises a majority of the new sample neighbors..

The key steps are computing distances, finding neighbors, and aggregating labels. The effectiveness of K-NN depends heavily on the choice of distance metric and value of $K$.

Implementing K-NN Classifier in Python

I will now walk with you through a simple example of implementing a K-NN classifier in Python for classifying 2D points.

First, we generate some sample 2D data:

import numpy as np
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=500, centers=2, random_state=42)

For 2D data, I use the Euclidean distance metric.

The Euclidean distance between two points a and b is:

\[d(a,b) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}\]

Where:

$a = (a_1, a_2, …, a_n)$
$b = (b_1, b_2, …, b_n)$
$a_i, b_i$ are the $i$-th components of vectors a and b
$n$ is the dimensionality of the space Using summation notation makes the equation more compact and generalizable to any number of dimensions.

The implementation is straightforward using a loop. We can use advanced linear algebra techniques which are more efficient for high-dimensional data to achieve the same but I’ll skip that for the sake of simplicity since we are in 2D and number of samples is relatively small:

import math

def euclidean_distance(a, b):
  distance = 0
  for i in range(len(a)):
    distance += (a[i] - b[i])**2
  return math.sqrt(distance) 

To find the $K$ nearest neighbors, we calculate distances to all points, sort them, and take the first $K$:

import operator

def get_neighbors(training_set, new_point, K):
  distances = []
  for x in training_set:
    dist = euclidean_distance(x, new_point)
    distances.append((x, dist))

  distances.sort(key=operator.itemgetter(1))
  
  neighbors = []
  for x in range(K):
    neighbors.append(distances[x][0])
    
  return neighbors

For classification, we take a majority vote among the $K$ neighbors:

import collections

def predict_classification(training_set, new_point, K):
  neighbors = get_neighbors(training_set, new_point, K)
  output_values = [x[-1] for x in neighbors]
  prediction = collections.Counter(output_values).most_common(1)[0][0]  
    
  return prediction

And we have a basic K-NN classifier in Python! With just a few lines of code and have implemented the core ideas behind K-NN - computing distances, finding neighbors, and aggregating their labels. Next you may wonder, how do we get to choose $K$ and the distance metrics? Stay with me, we are delving into this right away.

Choosing $K$ and Distance Metrics

The two most important design choices with K-NN are selecting the value of $K$ and the distance metric. Finding the optimal values requires experimenting on a validation set.

Here are some guidelines on choosing $K$ and distance metrics:

$K$ controls the tradeoff between bias and variance. Larger $K$ reduces variance but increases bias. Hyperparameter optimization can help choose the ideal $K$.
Distance metrics depend on the data type. Euclidean works best for low dimensions. Manhattan or Hamming are better for high dimensions.
Feature scaling is important before computing distances.
For weighted features, use a weighted distance like Mahalanobis distance.

When does K-NN Shines?

K-NN tends to perform very well when:

The data has low inherent dimensionality, meaning number of idenpendent variables is relatively small like our examples shown here.
The decision boundaries between classes are irregular.
There is plentiful training data compared to dimensionality, meaning number os training samples is much much greater than the independent variables, tabular data for examples on customer data.

When does K-NN struggle?

Data has very high dimensionality curse of dimensionality
Classes are well-separated
Training data is sparse
Need to extrapolate beyond training data

In many cases, other algorithms like SVM, random forests, and neural nets outperform K-NN, especially on complex modern datasets. But K-NN is still widely used for its simplicity, interpretability, and surprisingly good performance.

Summary

In this post, I walked through implementing K-Nearest Neighbors from scratch in Python based on first principles. K-NN exemplifies the powerful “learning by analogy” paradigm underlying many ML algorithms. With just distances and neighbors, K-NN can deliver excellent results on many problems while being simple to understand and implement. While more complex techniques like deep nets are popular today, starting with fundamentals like K-NN is key for any practitioner looking to really grok machine learning.

Share on

Twitter Facebook LinkedIn

Yusuf Brima

K-Nearest Neighbors from First Principles

Introduction

Understanding KNN

Implementing K-NN Classifier in Python

Choosing $K$ and Distance Metrics

When does K-NN Shines?

When does K-NN struggle?

Summary

Share on

Leave a Comment

You May Also Enjoy

Tech Stack for Deep Learning Explanability Medical Imaging

K-Means Clustering from First Principles

Seminar on Representation Learning

Dynamical Systems Modelling Part II: Applications