This week we're doing our first ML (machine learning) example, looking at probably the simplest approach : find the data points closest to the unknown, and guess from those.

Your mission is to :

Read about the k nearest neighbors algorithm in places like

- the "from Scratch" textbook, chapter 12
- wikipedia: k-nearest neigbors
- Onel Harrison's Machine Learning Basics with the K-Nearest Neighbors Algorithm
- scikit learn KNN classification | regression

By a week from today, Fri March 6, make a jupyter notebook that uses this algorithm on one of these datasets and come to class ready to describe what you did and how it worked.

I'll discuss the basic ideas today, and show an example and/or answer questions on Tuesday.

- classification (labels) vs regression (numbers) tasks
- distance between two sets of data
- euclidean distance : collections of (probably normalized) numbers
- hamming distance : count of how many labels are the same
- ... or perhaps some weighted combination of the two? (Hmmm.)
- neighbors : objects with the smallest distance away

- weighting : choose numeric factors to emphasize some parts over others
- k : a small integer
- k = 1 : just use nearest neighbor
- k = big : average_number or most_common_label using many neighbors

Suppose our data looks like this :

```
x1 x2 x3 y z
-------------- -- --
1.2 2.3 10 red 50
2.1 3.2 11 red 80
20 20 500 green 90
30 30 600 green 95
```

The problem is to guess either y or z given some (x1,x2,x3) e.g. (5,4,20).

Guessing y is a "categorization" problem, e.g. "Is this book a romance, western, or science_fiction?"

Guessing z is a "regression" problem, e.g. "What will the temperature be on July 11?"

- Divide the data into a training set and a test set.
- Pick a model to evaluate :
- a value for k
- a method for calculating distance
- a rule (mean or voting, perhaps with weights) for guessing given k points.

- Measure how well (y or z) can be guessed in the test set :
- loop over each member of the test set
- find the distance to each member of the training set
- sort to find the k closest ones
- guess mean(weighted closest z) or most_common(label y)
- accumulate total error (guessed - actual) across the test set.

- Repeat for other versions of the model (change k or guessing rule)
- Choose best model.
- Use it to tell Netflix users which movie they will like.
- Profit.

Which movie will you like? Well, you liked Terminator. So you'll probably like Terminator II.

- Can work well for numeric problems with fairly small number of problems.
- Since "distance" between labels makes less sense, is less useful for non-numeric data.
- As the number of features increases, distances tend to get larger and have less variation ... and the method doesn't work as well.

https://cs.marlboro.college /cours /spring2020 /data /notes /feb27

last modified Tue October 15 2024 6:09 am

last modified Tue October 15 2024 6:09 am