Feb 27 - k nearest neighbors (KNN)
This week we're doing our first ML (machine learning) example,
looking at probably the simplest approach : find the data points
closest to the unknown, and guess from those.
Your mission is to :
Read about the k nearest neighbors algorithm in places like
By a week from today, Fri March 6, make a jupyter notebook
that uses this algorithm on one of these datasets
and come to class ready to describe what you did and how it worked.
I'll discuss the basic ideas today, and show an example and/or answer questions on Tuesday.
concepts
- classification (labels) vs regression (numbers) tasks
- distance between two sets of data
- euclidean distance : collections of (probably normalized) numbers
- hamming distance : count of how many labels are the same
- ... or perhaps some weighted combination of the two? (Hmmm.)
- neighbors : objects with the smallest distance away
- weighting : choose numeric factors to emphasize some parts over others
- k : a small integer
- k = 1 : just use nearest neighbor
- k = big : average_number or most_common_label using many neighbors
the problem
Suppose our data looks like this :
x1 x2 x3 y z
-------------- -- --
1.2 2.3 10 red 50
2.1 3.2 11 red 80
20 20 500 green 90
30 30 600 green 95
The problem is to guess either
y or z given some (x1,x2,x3) e.g. (5,4,20).
Guessing y is a "categorization" problem, e.g. "Is this book a romance, western, or science_fiction?"
Guessing z is a "regression" problem, e.g. "What will the temperature be on July 11?"
algorithm
- Divide the data into a training set and a test set.
- Pick a model to evaluate :
- a value for k
- a method for calculating distance
- a rule (mean or voting, perhaps with weights) for guessing given k points.
- Measure how well (y or z) can be guessed in the test set :
- loop over each member of the test set
- find the distance to each member of the training set
- sort to find the k closest ones
- guess mean(weighted closest z) or most_common(label y)
- accumulate total error (guessed - actual) across the test set.
- Repeat for other versions of the model (change k or guessing rule)
- Choose best model.
- Use it to tell Netflix users which movie they will like.
- Profit.
TLDR
Which movie will you like? Well, you liked Terminator. So you'll probably like Terminator II.
related
discussion
- Can work well for numeric problems with fairly small number of problems.
- Since "distance" between labels makes less sense, is less useful for non-numeric data.
- As the number of features increases, distances tend to get larger and have less variation ... and the method doesn't work as well.