assignments
1. getting started
due Tue Jan 28
- Log into jupyter.marlboro.college. On that system, create a "hello world" program in two ways: in a jupyter notebook hello.ipynb (which includes with a markdown title cell), and in a python file hello.py. Create the file with the built-in text editor (notice how it does color formatting once you give the file a .py name), and run in from a terminal.
- Using the CSV data file vernon_1850.csv,
write a python program (in a jupyter notebook) that calculates and prints their average age.
- Read chapter 1 in the textbook, and come to class ready to discuss.
2. visualization
due Fri Feb 7
- Read chapters 3 and 4 in the text, and explore the book's code and examples.
- You mission this week is to start working with some data, massaging it and graphing it.
- See my Jan 31 notes for the details :
- Grab some csv data from kaggle or elsewhere.
- Play with it : put it into buckets or combine some columns.
- Make some plots to visualize what it's all about.
- Do this in a jupyter notebook, explaining what you've done.
- If time allows, do this with twice.
3. statistics
due Fri Feb 14
- Read chapters 5 and 6 in the text (statistics and probability).
- Using the Iris data and doing something like what's in my jupyter notebook, make a histogram of the virginica sepal lengths.
- Find their mean and standard deviation.
- Superimpose a plot of the normal distribution which has the same mean and standard deviation. Is it a reasonable fit?
- Find the probability that one of these flowers has a length greater than 8 cm.
- Confirm with a scatter plot and correlation coefficient that there is not much of a relation between the versicolor and virginica sepal length data.
- Check to see if there is a correlation between the versicolor sepal and petal lengths with a scatter plot and the coefficient. What do you find?
4. probability and inference
due Fri Feb 21
- Choose a few pages (at least) of text from a well known book.
- Write a python program to find
- the probability P(w) of its words,
- the conditional probabilities P(2nd=word_j |1st=word_i) for consecutive (1st_word, 2nd_word) pairs.
- (You may find this code to count the words in Moby Dick to be helpful.)
- What are the most common words? Given one of those, what are the most common words that follow it?
- Show by direct calculation that Bayes Theorem holds for one pair P(1st|2nd) .
- Is this coin fair? (It gives a random 't' or 'h' with each page load.) Make and discuss an explicit hypothesis test to decide, in two cases : with 10 coin flips, and with 5000 coin flips.
5. k-nearest neighbors
due Fri Mar 6
- Create a jupyter notebook that uses k-nearest neighbors algorithm as described in our textbook and in class on one of these datasets.
- Come to class Friday ready to describe what you did and how it worked.
6. naive bayes
due Fri Mar 13
- Apply the naive bayes text classification method to either
- a tiny example of your own, like I worked through in class in Tuesday
- a text classification dataset like this one
- Please don't use the black-box routines from scikit-learn - the point here is to work through the calculation "from scratch".
7. ⮕ spring break checkin ⬅
due Mon Mar 30
- As we gear up to do this online, please drop me a note here to let me know you're doing.
- What timezone are you in?
- How is your access to the internet?
- Do you have questions or concerns?
8. linear regression 1
due Tue Apr 7
9. neural nets
due Tue Apr 14
- Read chapter 18 in the text, "neural nets", and play with that code.
- Check out the related blog post by the same author, Fizz Buzz in Tensor Flow
- ... a more specific coding piece for this may be coming ...
- Decide what data you want to work with for your final project, and what sorts of investigations you want to do on it. (Presentations will be in a month, Tue May 5. Expect a "how is it going" update due in about two weeks.
10. deep learning
due Tue Apr 21
- Work on your projects. Describe what you've done.
- Come to class Tuesday with something to show - data loaded into a jupyter notebook and a plot, for example.
- Read chapter 19, on deep learning, and/or check out some of the articles I posted in the class notes. In a jupyter notebook, try running one from the "from scratch" examples or a tutorial from tensorflow or pytorch. (Both should work on jupyter.marlboro.)
11. decision trees
due Tue Apr 28
- Continue work on your projects.
- Be ready to give another project status update in class on Tuesday.
- Read chapter 17 on decision trees, my notes from Tuesday, and/or explore the "for further exploration" at the end of the chapter.
- Also check out chapter 20, on clustering.
- Optional : try a decision tree model on any data you choose, using scikit learn's decision tree or the textbook code or other library. Or try a clustering algorithm from the text or elsewhere. (sci-kit learn has a bunch).
- Tell me here what you did this week.
12. final project presentation
due Tue May 5
- Present your data analysis projects to the rest of the class on our last meeting.
13. final project submission
due Fri May 8
- Turn in a jupyter notebook of your final project data analysis.
- Include
- your data sources
- a bibliography of other similar or related work that helped you along
- a description of your exploratory investigation, with plots
- questions that your work tries to answer
- any machine learning models that you developed and applied
- whatever conclusions or thoughts you ended with
14. semester grade
due Mon May 11
- a place for Jim to leave end of term comments
15. your feedback
due Wed May 13
- Please give me any feedback you have about how the class went - what you liked and didn't like.
- What would have improved it?
- What worked for you?
- How did the last month online work? What would have helped it go better?