""" Artificial Neural Network example in python Jim Mahoney, Nov 2011, GPL --- some neural network definitions ---------------------------------- Here is one specific neural net model, using exactly the notation from AIMA 3rd textbook, section 18.7 : the supervised learning via gradient descent back-propogation neural network, with the logistic function as the activation threshold. (Yeah, there are *lots* of neural net variations.) Define the logistic function g(x) = 1/(1+exp(-x)). And write its derivative as g'(x). The network has n nodes, numbered 1 through n, and each possibly connected to any other. Each node j has an "activation" a[j], which is the value it is sending out to other nodes. Define a dummy "offset" activation a[0] = 1. The connections from node i into node j are defined by weights w[i][j] with 0 <= i <= n and 1 <= i < n, which gives a net input to node j of in[j] = sum_i { w[i][j] * a[i] } where the a[i] are the activations that the other nodes are sending to this one. Then the activation of node j is a[j] = g(in[j]) = g( sum_i { w[i][j] * a[i] } ) **** That's the "forward-propogate" formula for how information travels through the network. (I note in passing that this looks like a matrix-vector multiplication, but is using i,j rather than the tradtional j,i labeling conventions. Don't blame me; I'm just following the textbook.) The input for the whole network comes from sending information into a few specific nodes by fiat, in[j_in] = x[j_in] = inputs for some subset of j j_in Similarly the output for the whole network comes from pulling information from a few specific nodes, a[j_out] = y[j_out] = outputs for some subjset of j j_out In terms of in[j_in] and a[j_out], the whole network is a vector-valued (y's) function of the vector x's, which depends on all the n*(n+1) w[i][j] weights, which I'll notate all together as W outputs[j_out] = h_W[j_out](inputs[j_in]) A training set for supervised learning is given by values X=x[j_in] and Y=y[j_out] of desired inputs and outputs. Define the "loss" or "error" function as the squared difference between what we actually get and what we want to get with the training data loss(w, X, Y) = sum over j_out { (Y[j_out] - h_W[j_out](X[j_in]))**2 } Finally, we can "back-propogate" the training data to modify the weights in a way that lowers the loss, which is a gradient-descent search for a learned set of W's that let the network turn inputs similar to the X training inputs to something like the Y training outputs. The first thing to understand about modifying these things is what gradient descent is all about. Given any function of many variables, say f(x,y), the uphill direction in which phi increases the fastest in the direction (Vx, Vy) = (df/dx, df/dy). So moving a little the downhill direction is a change in x and y with (dx, dy) = - epsilon * (df/dx , df/dy). That's what "gradient descent" is all about. In this case, the loss function is what we're trying to minimize, and we're going to modify all the weights to get there. So the gradient descent is the update rule (which is just gradient descent) for the weights is w[j][k] = w[j][k] - epsilon * d(loss)/d(w[j][k]) and it turns into a multi-variable calculus problem with lots of subscripts with several applications of the chain rule. See section 18.7.4 in AIMA 3rd for the math details. Starting at the output terms, we look at the errors that the training data gives us, and use those to define these quantities : err[k=j_out] = y[k] - h_W[k](X) at the output nodes delta[k] = g'(in[k]) err[k] also at the output nodes, delta[j] = g'(in[j]) sum_k( w[j][k] delta[k] ) at other nodes w[j][k] = w[j][k] + alpha * a[j] * delta[k] **** That gives a way to propogate the errors *backwards* from the outputs to the inputs, giving an "error" for all nodes, and to use those errors to push the weights in a direction that lessens all the errors. As a practical matter, we'll typically want to only allow some connections between the nodes of the network, which is to say some of the weights. The rest will effectively be zero. --- an example network -------------------------------------- (I) So let's do a tiny example to start : input outputs desired_output .. 1 3 .. boolean OR ... .. 2 4 .. boolean AND (II) Here's a second one to try : input hidden outputs desired_output 1 3 6 xor ... ...... 2 5 8 questions: Is any of thse a linear network? If so, why; if not, why not? Is boolean OR linearly separable? Is boolean AND linearly separable? Is boolean XOR linearly separable? """ from math import exp from random import uniform, choice from doctest import testmod def g(x): """ logistic function : /----- 0 <= g(x) <= 1 ---/ see e.g. wikipedia """ return 1.0/(1.0 + exp(-x)) def dgdx(x): """ derivative of logistic function """ y = g(x) return y * (1.0 - y) # Take the derivative analytically to see this. class NeuralNet: """ case (I) above ... doesn't work yet""" # Since g(x) never really gets to 0 or 1, # I'll use 0.1 for "false" and 0.9 for "true" here. # Node pairs that aren't connected will have weight None # Nodes are numbered 1..n ; 0th component is offset. # in1 in2 out3 out4 training_data = [[None, 0.1, 0.1, 0.1, 0.1], [None, 0.1, 0.9, 0.1, 0.1], [None, 0.9, 0.1, 0.9, 0.9], [None, 0.9, 0.9, 0.9, 0.9]] def __init__(self): self.n = n = 4 # number of nodes self.alpha = 0.3 # learning rate self.loss = None # a[0] is offset = 1 by definition; see comments at top. self.a = [1.0] + [0.0]*n # activations self.in_ = [None]*(n+1) # net input for each node # empty (no connections yet) initial weights, w[i][j] is (n+1) x (n+1) self.w = [[None for i in range(n+1)] for j in range(n+1)] # random w[0][j] initial offset weights for j in range(n): self.w[0][j] = uniform(0,1) # w[i][j] is weight for connection from node i into node j. # inputs are connected to outputs only self.inputs = (1, 2) self.outputs = (3, 4) self.connections = ((1,3), (1,4), (2,3), (2,4)) for (i,j) in self.connections: self.w[i][j] = uniform(0,1) def propagate(self, inputs): """propagate inputs through the net; store a[], in[], inputs, outputs""" self.input_data = inputs self.in_[1:3] = inputs for j in (1,2): self.a[j] = g(self.in_[j]) for j in (3,4): self.in_[j] = self.w[1][j] * self.a[1] + self.w[2][j] * self.a[2] self.a[j] = g(self.in_[j]) self.output_data = (self.a[3], self.a[4]) def back_prop(self, training, check=False): """store loss & update the weights given an input/output training data set""" inputs = training[1:3] self.propagate(inputs) err = [None]*(1+self.n) delta = [None]*(1+self.n) for k in (3,4): err[k] = training[k] - self.a[k] delta[k] = dgdx(self.in_[k]) * err[k] self.loss = (err[3]**2 + err[4]**2) for j in (1,2): delta[j] = dgdx(self.in_[j]) * (self.w[j][3]*delta[3] + self.w[j][4]*delta[4]) for (i,j) in self.connections: self.w[i][j] += self.alpha * self.a[j] * delta[j] if (check): # check to see if loss just got smaller : self.propagate(inputs) err = [None]*(1+self.n) for k in (3,4): err[k] = training[k] - self.a[k] loss = (err[3]**2 + err[4]**2) print " loss1=%5.3f, loss2=%5.3f, change=%5.3f " \ % (self.loss, loss, loss - self.loss) def run_training(self, n_iterations=1e5, verbose=True): n_iterations = int(n_iterations) count = 0 while count < n_iterations: count += 1 self.back_prop(choice(NeuralNet.training_data)) if verbose and count % 1000 == 0: print " loss: %f %s" % (self.loss, str(self)) def test(self, n_iterations=1e3): n_iterations = int(n_iterations) count = 0 correct = 0 while count < n_iterations: count += 1 training = choice(NeuralNet.training_data) self.propagate(training[1:3]) if (training[3] > 0.5, training[4] > 0.5) == \ (self.output_data[0] > 0.5, self.output_data[1] > 0.5): correct += 1 print "%i of %i correct" % (correct, count) def __str__(self): return "" % \ (self.w[1][3], self.w[1][4], self.w[2][3], self.w[2][4]) def main(): print "-- neural net example --" nn = NeuralNet() print nn nn.run_training() nn.test() print nn if __name__ == "__main__": testmod() main()