assignments

1. getting started due Tue Jan 28

Log into jupyter.marlboro.college. On that system, create a "hello world" program in two ways: in a jupyter notebook hello.ipynb (which includes with a markdown title cell), and in a python file hello.py. Create the file with the built-in text editor (notice how it does color formatting once you give the file a .py name), and run in from a terminal.
Using the CSV data file vernon_1850.csv, write a python program (in a jupyter notebook) that calculates and prints their average age.
Read chapter 1 in the textbook, and come to class ready to discuss.

2. visualization due Fri Feb 7

Read chapters 3 and 4 in the text, and explore the book's code and examples.
You mission this week is to start working with some data, massaging it and graphing it.
See my Jan 31 notes for the details :
- Grab some csv data from kaggle or elsewhere.
- Play with it : put it into buckets or combine some columns.
- Make some plots to visualize what it's all about.
- Do this in a jupyter notebook, explaining what you've done.
If time allows, do this with twice.

3. statistics due Fri Feb 14

Read chapters 5 and 6 in the text (statistics and probability).
Using the Iris data and doing something like what's in my jupyter notebook, make a histogram of the virginica sepal lengths.
- Find their mean and standard deviation.
- Superimpose a plot of the normal distribution which has the same mean and standard deviation. Is it a reasonable fit?
Find the probability that one of these flowers has a length greater than 8 cm.
Confirm with a scatter plot and correlation coefficient that there is not much of a relation between the versicolor and virginica sepal length data.
Check to see if there is a correlation between the versicolor sepal and petal lengths with a scatter plot and the coefficient. What do you find?

4. probability and inference due Fri Feb 21

Choose a few pages (at least) of text from a well known book.
Write a python program to find
- the probability P(w) of its words,
- the conditional probabilities P(2nd=word_j |1st=word_i) for consecutive (1st_word, 2nd_word) pairs.
- (You may find this code to count the words in Moby Dick to be helpful.)
- What are the most common words? Given one of those, what are the most common words that follow it?
- Show by direct calculation that Bayes Theorem holds for one pair P(1st|2nd) .
Is this coin fair? (It gives a random 't' or 'h' with each page load.) Make and discuss an explicit hypothesis test to decide, in two cases : with 10 coin flips, and with 5000 coin flips.

5. k-nearest neighbors due Fri Mar 6

Create a jupyter notebook that uses k-nearest neighbors algorithm as described in our textbook and in class on one of these datasets.
Come to class Friday ready to describe what you did and how it worked.

6. naive bayes due Fri Mar 13

Apply the naive bayes text classification method to either
- a tiny example of your own, like I worked through in class in Tuesday
- a text classification dataset like this one
Please don't use the black-box routines from scikit-learn - the point here is to work through the calculation "from scratch".

7. ⮕ spring break checkin ⬅ due Mon Mar 30

As we gear up to do this online, please drop me a note here to let me know you're doing.
- What timezone are you in?
- How is your access to the internet?
- Do you have questions or concerns?

8. linear regression 1 due Tue Apr 7

Do something like what I showed in class on Friday in this linear regression with gradient descent notebook .
See my Friday notes for the details.

9. neural nets due Tue Apr 14

Read chapter 18 in the text, "neural nets", and play with that code.
Check out the related blog post by the same author, Fizz Buzz in Tensor Flow
... a more specific coding piece for this may be coming ...
Decide what data you want to work with for your final project, and what sorts of investigations you want to do on it. (Presentations will be in a month, Tue May 5. Expect a "how is it going" update due in about two weeks.

10. deep learning due Tue Apr 21

Work on your projects. Describe what you've done.
Come to class Tuesday with something to show - data loaded into a jupyter notebook and a plot, for example.
Read chapter 19, on deep learning, and/or check out some of the articles I posted in the class notes. In a jupyter notebook, try running one from the "from scratch" examples or a tutorial from tensorflow or pytorch. (Both should work on jupyter.marlboro.)

11. decision trees due Tue Apr 28

Continue work on your projects.
Be ready to give another project status update in class on Tuesday.
Read chapter 17 on decision trees, my notes from Tuesday, and/or explore the "for further exploration" at the end of the chapter.
Also check out chapter 20, on clustering.
Optional : try a decision tree model on any data you choose, using scikit learn's decision tree or the textbook code or other library. Or try a clustering algorithm from the text or elsewhere. (sci-kit learn has a bunch).
Tell me here what you did this week.

12. final project presentation due Tue May 5

Present your data analysis projects to the rest of the class on our last meeting.

13. final project submission due Fri May 8

Turn in a jupyter notebook of your final project data analysis.
Include
- your data sources
- a bibliography of other similar or related work that helped you along
- a description of your exploratory investigation, with plots
- questions that your work tries to answer
- any machine learning models that you developed and applied
- whatever conclusions or thoughts you ended with

14. semester grade due Mon May 11

a place for Jim to leave end of term comments

15. your feedback due Wed May 13

Please give me any feedback you have about how the class went - what you liked and didn't like.
- What would have improved it?
- What worked for you?
How did the last month online work? What would have helped it go better?

https://cs.marlboro.college /cours /spring2020 /data /sys /assignments

Data
Science

course

site

assignments

1. getting started due Tue Jan 28

2. visualization due Fri Feb 7

3. statistics due Fri Feb 14

4. probability and inference due Fri Feb 21

5. k-nearest neighbors due Fri Mar 6

6. naive bayes due Fri Mar 13

7. ⮕ spring break checkin ⬅ due Mon Mar 30

8. linear regression 1 due Tue Apr 7

9. neural nets due Tue Apr 14

10. deep learning due Tue Apr 21

11. decision trees due Tue Apr 28

12. final project presentation due Tue May 5

13. final project submission due Fri May 8

14. semester grade due Mon May 11

15. your feedback due Wed May 13

DataScience

course

site

assignments

1. getting started due Tue Jan 28

2. visualization due Fri Feb 7

3. statistics due Fri Feb 14

4. probability and inference due Fri Feb 21

5. k-nearest neighbors due Fri Mar 6

6. naive bayes due Fri Mar 13

7. ⮕ spring break checkin ⬅ due Mon Mar 30

8. linear regression 1 due Tue Apr 7

9. neural nets due Tue Apr 14

10. deep learning due Tue Apr 21

11. decision trees due Tue Apr 28

12. final project presentation due Tue May 5

13. final project submission due Fri May 8

14. semester grade due Mon May 11

15. your feedback due Wed May 13

Data
Science