Playing around with ideas from Hands on with Machine Learning.
The github repo that goes along with the book has a number of examples showing numpy, matplotlib, and pandas in jupyter notebooks.
Jim Mahoney | MIT License | Summer 2018 | cs.marlboro.college
import matplotlib.pyplot as plt
from matplotlib.pyplot import plot
import numpy as np
from numpy import *
import pandas as pd
import hashlib
% matplotlib inline
def myplot(x, y, pointstyle=False,
x2=False, y2=False, pointstyle2=False,
title='', xlabel='', ylabel='',
xlimits=False, ylimits=False,
grid=False, width=4, height=2, dpi=218,
savefile=False):
""" An API for making a plot with the matplotlib library. """
# other options that I might want in the future :
# semilogx(x,y), semilogy(x,y), loglog(x,y),
figure = plt.figure(dpi=dpi, figsize=(width, height)) # in inches
axis = figure.add_subplot(111) # 111 means (rows,cols,which) i.e. 1x1 #1
if pointstyle:
axis.plot(x, y, pointstyle) # i.e. 'ro' for red circles
else:
axis.plot(x, y)
if pointstyle2:
axis.plot(x2, y2, pointstyle2)
axis.set(xlabel=xlabel, ylabel=ylabel, title=title)
if grid:
axis.grid()
if savefile:
figure.savefig(savefile) # i.e. savefile='myplot.png'
if xlimits:
axes.set_xlim(xlimits) # i.e. xlimits=(0,10)
if ylimits:
axes.set_ylim(ylimits)
plt.show()
# Make the random routines behave the same each time
# this notebook is loaded.
np.random.seed(42)
x = np.linspace(0, 2*pi, 200) # 200 points evenly spaced from 0 to 2*pi
myplot(x, sin(x), grid=True, title='$y=sin(x)$ from 0 to $2\pi$')
This is a classic set of data, just to illustrate some a few data manipulations. Google "iris csv" for sources. I downloaded iris.csv manualy and put it in the datasets/ folder.
iris = pd.read_csv('datasets/iris.csv')
iris.head()
iris.columns
iris[2:5] # rows
iris['sepal_length'][2:5] # column, rows
mean(iris.sepal_length) # or use column name as an attribute
virginica = iris[ iris.species=='virginica' ] # select one species.
myplot(x=virginica.sepal_length, y=virginica.petal_length, pointstyle='b.',
xlabel='sepal length', ylabel='petal length', title='species virginica')
Now I'll do an example of the learning protocol described in the "Hands on ..." book, without yet using the scikitlearn pipelines but instead using numpy to implement the basic idea.
We're looking only at the virginica species looking at the relation between petal length and sepal length - one piece of the data chosen arbitrarily.
The plot above shows that this looks roughly linear and so I'll use that for the model, namely
$$ petal = a * sepal + b $$
where (a, b) are the slope and intercept of the linear model. Numpy has (polyfit, polyval) functions which will find and apply the model, which will be treated just as a black box though in this simple case the linear fit math to do everything explicitly is well understood.
fit = np.polyfit(x, y, deg=1) # fit=[a,b], a linear model of the arrays of values.
y = np.polyval(fit, x) # find values y = a * x + b where fit=[a,b]
def performance(x, y, fit):
""" Return the preformance measure of the fit=[a,b] linear
model on some x, y data values.
"""
n = len(x)
assert n == len(y)
y_model = np.polyval(fit, x)
return sqrt( sum( (y_model - y)**2 ) / n )
def perfs(training, testing, fit):
""" Return the performannces of two data sets. """
x_train = training.sepal_length
y_train = training.petal_length
x_test = testing.sepal_length
y_test = testing.petal_length
return (performance(x_train, y_train, fit),
performance(x_test, y_test, fit) )
First step is to divide the data into training and testing sets, about 80% and 20%, which I'll do with integer indeces.
n_virginica = len(virginica)
index_training = sorted(np.random.choice(range(n_virginica), int(0.8*n_virginica), replace=False)) # 80% training
index_testing = sorted(list(set(range(len(virginica))) - set(index_training))) # 20% testing
virginica_training = virginica.iloc[ index_training ]
virginica_testing = virginica.iloc[ index_testing ]
Here's what those two sets look like.
myplot(x=virginica_training.sepal_length, y=virginica_training.petal_length, pointstyle='r.',
x2=virginica_testing.sepal_length, y2=virginica_testing.petal_length, pointstyle2='g+',
xlabel='sepal length', ylabel='petal length', title='red_dot=training ; green_plus=testing')
I'll try a bad fit first, just for fun. The size ranges of x and y are about the same, so I'll try
$$ y = 1 x + 0 $$
on the training data, and see what the preformance measure is.
badfit = np.array([1.0, 0.0])
xline = np.linspace(min(virginica.sepal_length), max(virginica.sepal_length), 100)
yline = np.polyval(badfit, xline)
myplot(x=virginica_training.sepal_length, y=virginica_training.petal_length, pointstyle='r.',
x2=xline, y2=yline, pointstyle2='b',
xlabel='sepal length', ylabel='petal length', title='bad fit')
The performance (i.e. RMS error) of this bad fit on the training data is
performance(virginica_training.sepal_length, virginica_training.petal_length, badfit)
which looks about right, namely that the blue line is on average about 1 away from the red dots.
On the testing data it gives
performance(virginica_testing.sepal_length, virginica_testing.petal_length, badfit)
about the same, which is good: the testing and training data are similar, and the performance doesn't depend on the sample size.
Now for a better fit.
fit1 = np.polyfit(virginica_training.sepal_length, virginica_training.petal_length, deg=1)
fit1
myplot(x=virginica_training.sepal_length, y=virginica_training.petal_length, pointstyle='r.',
x2=xline, y2=np.polyval(fit1, xline), pointstyle2='b',
xlabel='sepal length', ylabel='petal length', title='linear fit')
The perfomances on the training and testing sets are
perfs(virginica_training, virginica_testing, fit1)
The performance of both the training and testing set is better. A visualization of the testing fit is
myplot(x=virginica_testing.sepal_length, y=virginica_testing.petal_length, pointstyle='g+',
x2=xline, y2=np.polyval(fit1, xline), pointstyle2='b',
xlabel='sepal length', ylabel='petal length', title='linear fit - testing data')
Now for an overfit. Let's do the same but with a degree 15 polynomial.
fit10 = np.polyfit(virginica_training.sepal_length, virginica_training.petal_length, deg=10)
fit10
myplot(x=virginica_training.sepal_length, y=virginica_training.petal_length, pointstyle='r.',
x2=xline, y2=np.polyval(fit10, xline), pointstyle2='b',
xlabel='sepal length', ylabel='petal length', title='poly fit degree 10, training')
perfs(virginica_training, virginica_testing, fit10)
myplot(x=virginica_testing.sepal_length, y=virginica_testing.petal_length, pointstyle='g+',
x2=xline, y2=np.polyval(fit10, xline), pointstyle2='b',
xlabel='sepal length', ylabel='petal length', title='poly degree 10 - testing data')
This 10th degree poly gives marginally better performance on both the training and testing data.
However, the funny bump near x=5 makes it look pretty suspicious ... the linear fit is more believable.