Basics 
Neural network is a mathematical function to solve some programming problems which is almost impossible to be solved with conventional programming.
Neural network consist of neurons and weights. Neurons are made up with activation function and holds output of the function called activation value. Weights are just values which connects one neuron to another. Neurons are grouped and forms a layer. Each layer is connected with other layer by weights. There are three kind of layers; input layer, hidden layer and output layer. Any neural network contains at least one hidden layer, though there might be more than one hidden layers are possible. Input layers accepts input, calculate activation values for next hidden layer, the next layer does same process and pass result to next of it. This goes on until output layer is reached.
Each layer calculates Z values using weight and activation. Using Z values activation of next layer is calculated. Once we get activation of output layer we have end result.
Each layer contains a bias neuron except output layer. The bias neurons always stays at value 1.
The propagation of activation values from input layer to output layer called forward propagation.
(Superscript denotes layer index and subscript denotes neuron index. For weights, first digit indicates next layer's neuron index where weight connected to, whereas second digit indicates current layer's neuron index to where weight is connected from two digit subscript)
 
Figure 1 
Forward propagation
We have two neurons in input layer, three neurons in hidden layer and two in output layer. Each layer except output has one bias neuron.
Suppose, we have set all weights with proper values and we pass input we will get correct output. The input and activation values should be between 0 to 1 inclusively.
If we check closely we can see that every neuron from a layer connected to each neuron of next layer using weights. Only bias neuron of next layer stays unaffected from previous layer's values because we want it to stay at 1. We can find out Z value of 1st index neuron of next layer as below.
 
Equation 1 
We know that activation of bias is always one and using this we can find out other Z values as below.
Equations 2 
We will have large positive or negative number as Z values, but we need values between 0 to 1 as activation for next layer. We can have such a function which can give us value between 0 and 0.5 if we pass values between large negative and 0. And if we pass value between 0 and large positive we should get 0.5 to 1 as output. This kind of function is called activation function.
Activation function
There are many types of activation functions, but we will use sigmoid function here as activation function.
 
 Equation 3 
The activation function accepts large Z value and returns smaller activation value. In other words, it accepts positive value and returns value between 0.5 and 1, and negative values would be resolved between 0 and 0.5.
Now, we can get all activation values based on Z value for next hidden layer.
Equations 4 
We can write the same equations in matrix form as below.
and
Equations 5 and 6
This way we can find activation values for output layer too as below.
Equation 7
 Equation 8 
We have now output values based on input we passed.
So far so good, but we have assumed that we have already proper weights to process input and to get correct output. We can call the neural network trained once we have proper weight values. Real part of machine learning is training the model. So, we need to somehow train the neural network and come up with proper weights. To find out proper weights we going to use "Backward propagation".
Backward propagation
The process starts from output layer and tries to compute weights for each input. We need to iterate the process multiple time to achieve proper weights.
 
 Figure 2
 
 Figure 3
To train the network we need to first initialize all weights with random values. We now pass input to the network. We know we will not get correct output because of wrong / random weight values. But, at least we will have some output values which we could compare with correct output.
Suppose we get value a31 and a32 shown in figure 2 and 3. We could know deviation from correct value. We can call it cost. Once we have cost we can figure out how can we minimize the cost and can reach near to correct values.
Cost function
Equation 9 
We can describe cost function for each output value as below.
Equations 10 
Our goal is to minimize the cost. As the output value is relied on previous layer's activation and weights it is not in our direct control. We need to change previous layer's activation and weights to change output accordingly. The previous layer's values also dependent on its previous layer (input layer in our example). So, we need to start from output layer and try to find out minimum cost of each neuron going backward up to input layer. This is why the process called backward propagation.
We need to find ∂C/∂a for each output neuron to find minimum cost. ∂C/∂a describes line with 0 slope on curve of function a -> C. Means, it indicates lowest value of C on curve for particular activation. 
 
Equation 11 
That way we have below values for each output neuron.
Equations 12 
We have minimum cost for each output value and we need to find out how we can minimize it further. We need to find minimum cost for previous layer too, since output layer is dependent on it.
Output layer's neuron a31 and a32 is dependent on four activation and four weight from previous layer as shown in figure 2 and 3. So to find out changes in output layer we need to find out how much changes are needed in all these values of previous layer.
We need to find out ∂C/∂w and ∂C/∂a for weight and activation receptively for hidden layer.
If we look closely we can see that weights and activation of previous layer affects Z value in next layer. Z value affects activation in next output layer, and activation of output layer affects C. So, in other words, C is dependent on activation of output layer, the activation values are dependent on Z value of previous layer, and the Z values are dependent on weights and activation. So to find out how much cost can be changed with changes of weights in hidden layer (∂C/∂w), we need to find out how much Z of next layer will going to change on weights, how much activation of next layer going to change on Z, how much cost (C) going to change on the activation. The far related values can be expressed with chain rule in terms of closely related values.
Equations 13
Now, we need to find values for each expression and evaluate final value.
From equation 7 we can derive below. 
We know that
So, to derive ∂a/∂z we need to do this way.
If we use chain rule here.
Using below rule,
We can come up with
If we use chain rule again
 
and using below rule 
we can get this
So, we have
So,
Equation 14
Now, we have values of all thee derivatives so we can combine all and show that equations as below using the information
Equations 15
We can write this equations in matrix form as below.
Equation 16
We have cost derivatives on weights of hidden layer, now we need to find out cost derivatives on activation of hidden layer.
As discussed previously, activation of hidden layer affects Z value of output layer, Z value affects activation values of output layer and activations affects cost (C). But, here is a change, weights were contributing to only one value because from each neuron to each next neuron we have single weight. But, one activation of hidden layer is simultaneously affecting every neuron of next layer. So, average cost changes on activation depends on both Z value of output layer, and that Z values are dependent of output layer's activation. Please check figure  2 and 3 for more clarity. So, to find of changes on cost based on changes of hidden layer activation we need to consider all neurons of output layer, because they all take part to suggest new values of hidden layer activation.
So, ∂C/∂a can be describe as
 
Equations 17
(Here, * indicates average cost.)
We need to simplify each expression same way as we did previously.
Equation 18 
We already know for other two expression so, if we combined all and we get
Equations 19 
We can write above equations as matrix form
Equation 20 
We have now all two type of cost derivatives based on weights and based on activations. One thing we need to notice here is that we are using partial derivatives not total derivatives. Since, output layer values are dependent on two independent values (weights and activations) of previous layer, we have to use partial derivatives keeping one variable constant and treat other as only variable. This is the main reason why we need to repeat the entire backward propagation multiple times. We are assuming one value as correct / constant such that we do not need to change it while we are changing other value to find minimum cost. The constant value is not actually correct, so our minimum cost is far from actual minimum. First, we made activation constant and derive C on weights, second time we made weight constant and derive C on activation. By the process we could find out gradient matrix shown in equation 16 and ∂C/∂a vector shown in equation 20. We will find the same gradient matrix for input layer too and that ∂C/∂a vector will help us finding it. We will subtract the gradient matrices from weight matrices which we initialized using random values. That way we are now one step closer to proper weights. We will use the new weights again to find output using forward propagation and repeat whole backward propagation again to find out good weights. We repeat this so many times until we get proper / desired weights.
Lets do the same process for input layer now, but without going into too much detail as we did for hidden layer.
 
Figure 4
Figure 5
Figure 6
(Omitting * from here as every C would have it)
Equations 21 
We know the values of
from equations 19, so we are not going to put that here otherwise our equations become huge.
We evaluate other two expression and can come up with this
Equations 22 
We can write above equations in matrix form as
Equation 23 
Now, we have two gradient matrices for both the input layer and hidden layer respectively as below
We need to subtract the gradient matrices from initial random weight matrices and we will get new weight matrices which we will use for next iteration.
Equation 24
and
 
Equation 25 
Now, our random initial values are no more random and it is little bit closer to proper weights matrices. As we iterate this process over and over, we get proper weight matrices, and we can say that our neural network is trained properly, and ready to take input and produce reliable output.
Comments
Post a Comment