Feb 27 - k nearest neighbors (KNN)

This week we're doing our first ML (machine learning) example, looking at probably the simplest approach : find the data points closest to the unknown, and guess from those.

Your mission is to :

Read about the k nearest neighbors algorithm in places like

the "from Scratch" textbook, chapter 12
wikipedia: k-nearest neigbors
Onel Harrison's Machine Learning Basics with the K-Nearest Neighbors Algorithm
scikit learn KNN classification | regression

By a week from today, Fri March 6, make a jupyter notebook that uses this algorithm on one of these datasets and come to class ready to describe what you did and how it worked.

I'll discuss the basic ideas today, and show an example and/or answer questions on Tuesday.

concepts

classification (labels) vs regression (numbers) tasks
distance between two sets of data
- euclidean distance : collections of (probably normalized) numbers
- hamming distance : count of how many labels are the same
- ... or perhaps some weighted combination of the two? (Hmmm.)
- neighbors : objects with the smallest distance away
weighting : choose numeric factors to emphasize some parts over others
k : a small integer
- k = 1 : just use nearest neighbor
- k = big : average_number or most_common_label using many neighbors

the problem

Suppose our data looks like this :

 x1    x2   x3    y        z
 --------------   --       --
 1.2   2.3   10   red      50 
 2.1   3.2   11   red      80
 20    20   500   green    90
 30    30   600   green    95

The problem is to guess either y or z given some (x1,x2,x3) e.g. (5,4,20).

Guessing y is a "categorization" problem, e.g. "Is this book a romance, western, or science_fiction?"

Guessing z is a "regression" problem, e.g. "What will the temperature be on July 11?"

algorithm

Divide the data into a training set and a test set.
Pick a model to evaluate :
- a value for k
- a method for calculating distance
- a rule (mean or voting, perhaps with weights) for guessing given k points.
Measure how well (y or z) can be guessed in the test set :
- loop over each member of the test set
- find the distance to each member of the training set
- sort to find the k closest ones
- guess mean(weighted closest z) or most_common(label y)
- accumulate total error (guessed - actual) across the test set.
Repeat for other versions of the model (change k or guessing rule)
Choose best model.
Use it to tell Netflix users which movie they will like.
Profit.

TLDR

Which movie will you like? Well, you liked Terminator. So you'll probably like Terminator II.

discussion

Can work well for numeric problems with fairly small number of problems.
Since "distance" between labels makes less sense, is less useful for non-numeric data.
As the number of features increases, distances tend to get larger and have less variation ... and the method doesn't work as well.

https://cs.marlboro.college /cours /spring2020 /data /notes /feb27
last modified Fri January 24 2025 7:49 pm

Data
Science

course

site

Feb 27 - k nearest neighbors (KNN)

concepts

the problem

algorithm

TLDR

related

discussion

DataScience

course

site

Feb 27 - k nearest neighbors (KNN)

concepts

the problem

algorithm

TLDR

related

discussion

Data
Science