Spring 2020

# assignments

## 1. getting started due Tue Jan 28

• Log into jupyter.marlboro.college. On that system, create a "hello world" program in two ways: in a jupyter notebook hello.ipynb (which includes with a markdown title cell), and in a python file hello.py. Create the file with the built-in text editor (notice how it does color formatting once you give the file a .py name), and run in from a terminal.
• Using the CSV data file vernon_1850.csv, write a python program (in a jupyter notebook) that calculates and prints their average age.
• Read chapter 1 in the textbook, and come to class ready to discuss.

## 2. visualization due Fri Feb 7

• Read chapters 3 and 4 in the text, and explore the book's code and examples.
• You mission this week is to start working with some data, massaging it and graphing it.
• See my Jan 31 notes for the details :
• Grab some csv data from kaggle or elsewhere.
• Play with it : put it into buckets or combine some columns.
• Make some plots to visualize what it's all about.
• Do this in a jupyter notebook, explaining what you've done.
• If time allows, do this with twice.

## 3. statistics due Fri Feb 14

• Read chapters 5 and 6 in the text (statistics and probability).
• Using the Iris data and doing something like what's in my jupyter notebook, make a histogram of the virginica sepal lengths.
• Find their mean and standard deviation.
• Superimpose a plot of the normal distribution which has the same mean and standard deviation. Is it a reasonable fit?
• Find the probability that one of these flowers has a length greater than 8 cm.
• Confirm with a scatter plot and correlation coefficient that there is not much of a relation between the versicolor and virginica sepal length data.
• Check to see if there is a correlation between the versicolor sepal and petal lengths with a scatter plot and the coefficient. What do you find?

## 4. probability and inference due Fri Feb 21

• Choose a few pages (at least) of text from a well known book.
• Write a python program to find
• the probability P(w) of its words,
• the conditional probabilities P(2nd=word_j |1st=word_i) for consecutive (1st_word, 2nd_word) pairs.
• (You may find this code to count the words in Moby Dick to be helpful.)
• What are the most common words? Given one of those, what are the most common words that follow it?
• Show by direct calculation that Bayes Theorem holds for one pair P(1st|2nd) .
• Is this coin fair? (It gives a random 't' or 'h' with each page load.) Make and discuss an explicit hypothesis test to decide, in two cases : with 10 coin flips, and with 5000 coin flips.

## 5. k-nearest neighbors due Fri Mar 6

• Create a jupyter notebook that uses k-nearest neighbors algorithm as described in our textbook and in class on one of these datasets.
• Come to class Friday ready to describe what you did and how it worked.

## 6. naive bayes due Fri Mar 13

• Apply the naive bayes text classification method to either
• a tiny example of your own, like I worked through in class in Tuesday
• a text classification dataset like this one
• Please don't use the black-box routines from scikit-learn - the point here is to work through the calculation "from scratch".

## 7. ⮕ spring break checkin ⬅ due Mon Mar 30

• As we gear up to do this online, please drop me a note here to let me know you're doing.
• What timezone are you in?
• Do you have questions or concerns?

## 9. neural nets due Tue Apr 14

• Read chapter 18 in the text, "neural nets", and play with that code.
• Check out the related blog post by the same author, Fizz Buzz in Tensor Flow
• ... a more specific coding piece for this may be coming ...
• Decide what data you want to work with for your final project, and what sorts of investigations you want to do on it. (Presentations will be in a month, Tue May 5. Expect a "how is it going" update due in about two weeks.

## 10. deep learning due Tue Apr 21

• Work on your projects. Describe what you've done.
• Come to class Tuesday with something to show - data loaded into a jupyter notebook and a plot, for example.
• Read chapter 19, on deep learning, and/or check out some of the articles I posted in the class notes. In a jupyter notebook, try running one from the "from scratch" examples or a tutorial from tensorflow or pytorch. (Both should work on jupyter.marlboro.)

## 11. decision trees due Tue Apr 28

• Continue work on your projects.
• Be ready to give another project status update in class on Tuesday.
• Read chapter 17 on decision trees, my notes from Tuesday, and/or explore the "for further exploration" at the end of the chapter.
• Also check out chapter 20, on clustering.
• Optional : try a decision tree model on any data you choose, using scikit learn's decision tree or the textbook code or other library. Or try a clustering algorithm from the text or elsewhere. (sci-kit learn has a bunch).
• Tell me here what you did this week.

## 12. final project presentation due Tue May 5

• Present your data analysis projects to the rest of the class on our last meeting.

## 13. final project submission due Fri May 8

• Turn in a jupyter notebook of your final project data analysis.
• Include
• a bibliography of other similar or related work that helped you along
• a description of your exploratory investigation, with plots