Spring 2020

# April 24

The assignment for Tuesday is posted - please come ready to share what you did this week on your project.

Last time I said I'd show a decision tree example ... but after some consideration decided to do something else instead.

While you're working on your projects, I think it would be more valuable to take a brief look at some of the other topics in our textbook - starting with clustering.

## clustering

The goal here is to partition your data into similar clumps.

While the other machine learning algorithms we've looked at were all "supervised" (i.e. they are given examples of the answer and build a model to do that), clustering is an "unsupervised" algorithm, namely one that tries to find some new thing on its own, without supplied answers.

There are a number of different approaches to finding clusters, which give somewhat different results depending on what you think "close together" means.

### k-means

One of the popular approaches is "k-means clustering", which is fairly straightforward once you understand the approach. This is one of the methods described in "from scratch".

It's related to k-nearest_neighbors, but not the same.

The value of k is an input that you supply to the algorithm. It's common to try different k's to see which gives results that seem better in some way.

the algorithm :

- pick k
- choose k arbitrary starting cluster positions
- loop:
- assign each point to the closest cluster position
- if those assignments stop changing, stop
- otherwise, replace cluster positions with means of points assigned to cluster


### bottom-up hierarchical

The text also describes and implements one other approach that's simple to understand.

- assign each point to its own cluster of size 1
- loop while there are more than 1 cluster:
- join the two closest clusters together


By keeping track of the history, this can be unwound to find any number of k clusters.

The results vary depending on how you define "closest" for two clusters. The text discusses three choices : minimum distance, maximum distance, average distance.

### "correct" number of clusters

One way to see if there's a "best" number of clusters is to plot some sort of error function (like the sum of the squares of all the distances from cluster position to the members of the clusters), plot for different number of clusters k, and see if there's a "knee" in the plot where the rate of improvement changes more abruptly.

### iris data

Consider again the iris data that we've looked at several times.

Can a cluster algorithm find the species on its own?

We'd want to

• get the data
• normalize the numeric columns
• be able to assign a new "cluster" label 1 ... k.
• be able to plot (using various axes) points with clusters & cluster centers.
• apply one of the cluster algorithms
• analyze the results for different numbers of clusters k

... but I haven't done that myself yet. ;) Maybe for next week.

Googling "cluster iris jupyter" finds many examples ... though most are black-boxy-y.

For your projects, some sort of clustering could well be an interesting thing to try.

## next week

The chapters left in the book include

• natural language processing
• network analysis
• recommender systems
• databases and SQL (in which he codes his own NotQuiteABase and uses it to illustrate joins, indexes, and all that.)
• MapReduce
• Data Ethics

We'll discuss what we want to look at next.

https://cs.marlboro.college /cours /spring2020 /data /notes /apr24