Data
Science

Spring 2020
course
site

April 21

projects

First order of business: show us what you've done on your project.

decision trees

I'll walk through what a "decision tree" is, namely another machine learning model, using the material from the textbook in chapter 17.

Decision trees are like the game of twenty questions ... the trick is figuring out which questions to ask, and in which order.

One mathy way to choose which questions is to minimize the "average partition entropy", taking a "greedy" algorithmic approach to doing the best we can with each question that partitions the data.

So the model is a tree of questions, splitting the data on a category label or range of numeric data.

Here's a (very) short illustration.

The "from scratch" text uses that idea to generate this model for this data

These models tend to overfit. One way to avoid this is to use many models and average them ... the "random forest" approach.

Something like this would be entirely possible to put in place for your projects, and might be a good alternative to nearest neighbors for datasets that are of manageable size.

next ?

Discuss what to do for Thursday ... perhaps look at this example of kaggle titanic data using decision trees ?

aside

https://cs.marlboro.college /cours /spring2020 /data /notes /apr21
last modified Fri April 26 2024 8:11 am