Spring 2020

Feb 27 - k nearest neighbors (KNN)

This week we're doing our first ML (machine learning) example, looking at probably the simplest approach : find the data points closest to the unknown, and guess from those.

By a week from today, Fri March 6, make a jupyter notebook that uses this algorithm on one of these datasets and come to class ready to describe what you did and how it worked.

I'll discuss the basic ideas today, and show an example and/or answer questions on Tuesday.

concepts

• classification (labels) vs regression (numbers) tasks
• distance between two sets of data
• euclidean distance : collections of (probably normalized) numbers
• hamming distance : count of how many labels are the same
• ... or perhaps some weighted combination of the two? (Hmmm.)
• neighbors : objects with the smallest distance away
• weighting : choose numeric factors to emphasize some parts over others
• k : a small integer
• k = 1 : just use nearest neighbor
• k = big : average_number or most_common_label using many neighbors

the problem

Suppose our data looks like this :

 x1    x2   x3    y        z
--------------   --       --
1.2   2.3   10   red      50
2.1   3.2   11   red      80
20    20   500   green    90
30    30   600   green    95


The problem is to guess either y or z given some (x1,x2,x3) e.g. (5,4,20).

Guessing y is a "categorization" problem, e.g. "Is this book a romance, western, or science_fiction?"

Guessing z is a "regression" problem, e.g. "What will the temperature be on July 11?"

algorithm

• Divide the data into a training set and a test set.
• Pick a model to evaluate :
• a value for k
• a method for calculating distance
• a rule (mean or voting, perhaps with weights) for guessing given k points.
• Measure how well (y or z) can be guessed in the test set :
• loop over each member of the test set
• find the distance to each member of the training set
• sort to find the k closest ones
• guess mean(weighted closest z) or most_common(label y)
• accumulate total error (guessed - actual) across the test set.
• Repeat for other versions of the model (change k or guessing rule)
• Choose best model.
• Use it to tell Netflix users which movie they will like.
• Profit.

TLDR

Which movie will you like? Well, you liked Terminator. So you'll probably like Terminator II.

discussion

• Can work well for numeric problems with fairly small number of problems.
• Since "distance" between labels makes less sense, is less useful for non-numeric data.
• As the number of features increases, distances tend to get larger and have less variation ... and the method doesn't work as well.
https://cs.marlboro.college /cours /spring2020 /data /notes /feb27