Nov 17
aside
More Stanford online classes:
nlp-class.com (natural language processing)
pgm-class.com (probabilistic graph models)
& others
decision trees
I assigned 18.6 as a way to get you to look at the
basic idea, without getting too far into the math details.
The basic idea is that one mechanism to draw a conclusion
from a data, is to make a series of sequential choices.
This is particularly good when all the variables are discrete, i.e.
in1 in2 in3 output
1 0 1 1
2 0 3 1
...
In a machine learning context, each row is one training example,
and the decision tree is the machine we're going to build
from the examples.
First point: any ordering of the inputs gives you
a possible tree which can give those outputs.
Second point: depending on how well the outputs
match the inputs, some trees will be simpler (i.e. better)
than others, giving some confidence that it is
a good "model" for that data. Complicated trees
are over trained, fitting that specific set of
data but not representing general trends.
Third point: since we want a simple tree,
we want to find an order for the choices
that splits things as much as possible.
With that in mind, discuss the assigned problem AMIA 18.6 :
A1 A2 A3 Output
------------------
1 0 0 0
1 0 1 0
0 1 0 0
1 1 1 1
1 1 0 1
Look at what happens intuitively for various
choices of using A1, A2, A3 first to divide
things up, and what the good choices are after that.
The math details of the best way to do
this heads into information theory, which
I was just glossing over.
But Sam asked about how the "importance"
function works, which is at the heart of it,
so, here's how it works:
Discuss (briefly) the idea of information entropy,
bits per symbol. If p1, p2, p3, ... are the
probabilities of each symbol, then
H(p1, p2, p3) = -sum( p[i] * log2(p[i])
For a boolean with only two probabilities (q, 1-q)
and following the books notation, this is
B(q) = - q log2(q) - (1-q) log2(1-q)
which is how many bits of info there is.
(Discuss briefly; draw the upside down parabola sketch.)
Still following the book notation,
in one "clump" of things
p = number of positive
n = number of negative
B(p/(n+p)) = bits of info
When we use one variable to split the data
into a partition of clumps, the best split
causes the biggest information gain (bits per symbol).
So the technique is to use look at B
before and after the split :
importance = B(p/(n+p)) - weighted_sum B(pk/(nk+pk))
before split after split
where pk = number of positive in k'th partition
nk = number of negative in k'th partition
weighting is over number in that paritition compared to total
Apply these numbers to 18.6, and compare with intuition.
I put the solution in
this folder.
more computer vision
openCV and processing.org
Examples I tried crashed on my Mac.
I did get some opencv + python working:
$ sudo port install opencv +python26
$ sudo port select --set python python26 # as opposed to python26-apple
$ python
>> import cv
>> # works!
Then on to
This worked :
import cv
img = cv.LoadImageM("dime_building.jpeg", cv.CV_LOAD_IMAGE_GRAYSCALE)
eig_image = cv.CreateMat(img.rows, img.cols, cv.CV_32FC1)
temp_image = cv.CreateMat(img.rows, img.cols, cv.CV_32FC1)
for (x,y) in cv.GoodFeaturesToTrack(img, eig_image, temp_image, 10, 0.04, 1.0, useHarris = True):
print "good feature at", x,y
including facedetect.py : see the attached screenshot for an example.
The heart of the python code is a call to HaarDetectObjects(),
which uses an xml description of a "frontal face detection"
trained specificiation; a "cascade" of "haar-like features".