"""
 Artificial Neural Network example in python
 Jim Mahoney, Nov 2011, GPL

 --- some neural network definitions ----------------------------------

 Here is one specific neural net model, using exactly the notation
 from AIMA 3rd textbook, section 18.7 : the supervised learning
 via gradient descent back-propogation neural network, with
 the logistic function as the activation threshold. (Yeah, there
 are *lots* of neural net variations.)

 Define the logistic function g(x) = 1/(1+exp(-x)).
 And write its derivative as g'(x).

 The network has n nodes, numbered 1 through n, and each possibly
 connected to any other. Each node j has an "activation" a[j],
 which is the value it is sending out to other nodes.

 Define a dummy "offset" activation a[0] = 1.

 The connections from node i into node j are defined by weights w[i][j]
 with 0 <= i <= n and 1 <= i < n, which gives a net input to node j of

   in[j] = sum_i { w[i][j] * a[i] }

 where the a[i] are the activations that the other nodes are sending
 to this one. Then the activation of node j is

   a[j] = g(in[j]) = g( sum_i { w[i][j] * a[i] } )         ****

 That's the "forward-propogate" formula for how information
 travels through the network.

 (I note in passing that this looks like a matrix-vector multiplication,
 but is using i,j rather than the tradtional j,i labeling conventions.
 Don't blame me; I'm just following the textbook.)
      
 The input for the whole network comes from sending information
 into a few specific nodes by fiat, 

   in[j_in] = x[j_in] = inputs for some subset of j j_in

 Similarly the output for the whole network comes from pulling
 information from a few specific nodes, 

   a[j_out] = y[j_out] = outputs for some subjset of j j_out

 In terms of in[j_in] and a[j_out], the whole network is a
 vector-valued (y's) function of the vector x's, which depends
 on all the n*(n+1) w[i][j] weights, which I'll notate all together as W

   outputs[j_out] = h_W[j_out](inputs[j_in])

 A training set for supervised learning is given by
 values X=x[j_in] and Y=y[j_out] of desired inputs and outputs.

 Define the "loss" or "error" function as the squared difference
 between what we actually get and what we want to get with
 the training data 

   loss(w, X, Y) = sum over j_out { (Y[j_out] - h_W[j_out](X[j_in]))**2 }

 Finally, we can "back-propogate" the training data to modify the weights
 in a way that lowers the loss, which is a gradient-descent search
 for a learned set of W's that let the network turn inputs similar
 to the X training inputs to something like the Y training outputs.

 The first thing to understand about modifying these things is
 what gradient descent is all about. Given any function of many
 variables, say f(x,y), the uphill direction in which phi
 increases the fastest in the direction (Vx, Vy) = (df/dx, df/dy).
 So moving a little the downhill direction is a change in
 x and y with (dx, dy) = - epsilon * (df/dx , df/dy).
 That's what "gradient descent" is all about.

 In this case, the loss function is what we're trying to minimize,
 and we're going to modify all the weights to get there.
 So the gradient descent is 
 the update rule (which is just gradient descent) for the weights is

   w[j][k] = w[j][k] - epsilon * d(loss)/d(w[j][k])

 and it turns into a multi-variable calculus problem with lots
 of subscripts with several applications of the chain rule.
 See section 18.7.4 in AIMA 3rd for the math details.

 Starting at the output terms, we look at the errors that
 the training data gives us, and use those to define these quantities :
 
   err[k=j_out] = y[k] - h_W[k](X)       at the output nodes

   delta[k] = g'(in[k]) err[k]           also at the output nodes,

   delta[j] = g'(in[j]) sum_k( w[j][k] delta[k] )   at other nodes

   w[j][k] = w[j][k] + alpha * a[j] * delta[k]          ****

 That gives a way to propogate the errors *backwards* from the
 outputs to the inputs, giving an "error" for all nodes,
 and to use those errors to push the weights in a direction
 that lessens all the errors.

 As a practical matter, we'll typically want to only allow
 some connections between the nodes of the network, which
 is to say some of the weights. The rest will effectively be zero.
 
 --- an example network --------------------------------------

 (I)
 So let's do a tiny example to start :

       input    outputs    desired_output
 
      .. 1      3 ..        boolean OR 
           ... 
      .. 2      4 ..        boolean AND


 (II)
 Here's a second one to try :

       input   hidden    outputs    desired_output
 
         1      3        6          xor
           ...    ...... 
         2      5        8

 questions:
   Is any of thse a linear network? If so, why; if not, why not?
   Is boolean OR linearly separable?
   Is boolean AND linearly separable?
   Is boolean XOR linearly separable?
   
"""

from math import exp
from random import uniform, choice
from doctest import testmod

def g(x):                       
    """ logistic function  :       /-----    0 <= g(x) <= 1
                               ---/          see e.g. wikipedia
    """
    return 1.0/(1.0 + exp(-x))

def dgdx(x):
    """ derivative of logistic function """
    y = g(x)
    return y * (1.0 - y) # Take the derivative analytically to see this.

class NeuralNet:
    """ case (I) above ... doesn't work yet"""

    # Since g(x) never really gets to 0 or 1,
    # I'll use 0.1 for "false" and 0.9 for "true" here.

    # Node pairs that aren't connected will have weight None

    # Nodes are numbered 1..n ; 0th component is offset.

                  #         in1  in2   out3  out4
    training_data = [[None, 0.1, 0.1,  0.1,  0.1],
                     [None, 0.1, 0.9,  0.1,  0.1],
                     [None, 0.9, 0.1,  0.9,  0.9],
                     [None, 0.9, 0.9,  0.9,  0.9]]

    def __init__(self):
        self.n = n = 4    # number of nodes
        self.alpha = 0.3  # learning rate
        self.loss = None
        # a[0] is offset = 1 by definition; see comments at top.
        self.a = [1.0] + [0.0]*n      # activations
        self.in_ = [None]*(n+1)       # net input for each node
        # empty (no connections yet) initial weights, w[i][j] is (n+1) x (n+1)
        self.w = [[None for i in range(n+1)] for j in range(n+1)]
        # random w[0][j] initial offset weights
        for j in range(n):
            self.w[0][j] = uniform(0,1)
        # w[i][j] is weight for connection from node i into node j.
        # inputs are connected to outputs only
        self.inputs = (1, 2)
        self.outputs = (3, 4)
        self.connections = ((1,3), (1,4), (2,3), (2,4))
        for (i,j) in self.connections:
            self.w[i][j] = uniform(0,1)

    def propagate(self, inputs):
        """propagate inputs through the net; store a[], in[], inputs, outputs"""
        self.input_data = inputs
        self.in_[1:3] = inputs
        for j in (1,2):
            self.a[j] = g(self.in_[j])            
        for j in (3,4):
            self.in_[j] = self.w[1][j] * self.a[1] + self.w[2][j] * self.a[2]
            self.a[j] = g(self.in_[j])
        self.output_data = (self.a[3], self.a[4])

    def back_prop(self, training, check=False):
        """store loss & update the weights given an input/output training data set"""
        inputs = training[1:3]
        self.propagate(inputs)
        err = [None]*(1+self.n)
        delta = [None]*(1+self.n)
        for k in (3,4):
            err[k] = training[k] - self.a[k]
            delta[k] = dgdx(self.in_[k]) * err[k]
        self.loss = (err[3]**2 + err[4]**2)
        for j in (1,2):
            delta[j] = dgdx(self.in_[j]) * (self.w[j][3]*delta[3] +
                                            self.w[j][4]*delta[4])
        for (i,j) in self.connections:
            self.w[i][j] += self.alpha * self.a[j] * delta[j]
        if (check):
            # check to see if loss just got smaller :
            self.propagate(inputs)
            err = [None]*(1+self.n)
            for k in (3,4):
                err[k] = training[k] - self.a[k]
            loss = (err[3]**2 + err[4]**2)
            print " loss1=%5.3f, loss2=%5.3f, change=%5.3f " \
                % (self.loss, loss, loss - self.loss)
    
    def run_training(self, n_iterations=1e5, verbose=True):
        n_iterations = int(n_iterations)
        count = 0
        while count < n_iterations:
            count += 1
            self.back_prop(choice(NeuralNet.training_data))
            if verbose and count % 1000 == 0:
                print " loss: %f   %s" % (self.loss, str(self))

    def test(self, n_iterations=1e3):
        n_iterations = int(n_iterations)
        count = 0
        correct = 0
        while count < n_iterations:
            count += 1
            training = choice(NeuralNet.training_data)
            self.propagate(training[1:3])
            if (training[3] > 0.5, training[4] > 0.5) == \
                (self.output_data[0] > 0.5, self.output_data[1] > 0.5):
                correct += 1
        print "%i of %i correct" % (correct, count)


    def __str__(self):
        return "<NeuralNet %4.2f %4.2f %4.2f %4.2f>" % \
            (self.w[1][3], self.w[1][4], self.w[2][3], self.w[2][4])

def main():
    print "-- neural net example --"
    nn = NeuralNet()
    print nn
    nn.run_training()
    nn.test()
    print nn    

if __name__ == "__main__":
    testmod()
    main()