Skip to main content

A deep understanding of neural network

Basics

Neural network is a mathematical function to solve some programming problems which is almost impossible to be solved with conventional programming.

Neural network consist of neurons and weights. Neurons are made up with activation function and holds output of the function called activation value. Weights are just values which connects one neuron to another. Neurons are grouped and forms a layer. Each layer is connected with other layer by weights. There are three kind of layers; input layer, hidden layer and output layer. Any neural network contains at least one hidden layer, though there might be more than one hidden layers are possible. Input layers accepts input, calculate activation values for next hidden layer, the next layer does same process and pass result to next of it. This goes on until output layer is reached. 

Each layer calculates Z values using weight and activation. Using Z values activation of next layer is calculated. Once we get activation of output layer we have end result.

Each layer contains a bias neuron except output layer. The bias neurons always stays at value 1.

The  propagation of activation values from input layer to output layer called forward propagation.

(Superscript denotes layer index and subscript denotes neuron index. For weights, first digit indicates next layer's neuron index where weight connected to, whereas second digit indicates current layer's neuron index to where weight is connected from two digit subscript)

 

Figure 1

Forward propagation

We have two neurons in input layer, three neurons in hidden layer and two in output layer. Each layer except output has one bias neuron.

Suppose, we have set all weights with proper values and we pass input we will get correct output. The input and activation values should be between 0 to 1 inclusively.

If we check closely we can see that every neuron from a layer connected to each neuron of next layer using weights. Only bias neuron of next layer stays unaffected from previous layer's values because we want it to stay at 1. We can find out Z value of 1st index neuron of next layer as below.

 

Equation 1

We know that activation of bias is always one and using this we can find out other Z values as below.

Equations 2

We will have large positive or negative  number as Z values, but we need values between 0 to 1 as activation for next layer. We can have such a function which can give us value between 0 and 0.5 if we pass values between large negative and 0. And if we pass value between 0 and large positive we should get 0.5 to 1 as output. This kind of function is called activation function.

Activation function

There are many types of activation functions, but we will use sigmoid function here as activation function.

Equation 3

The activation function accepts large Z value and returns smaller activation value. In other words, it accepts positive value and returns value between 0.5 and 1, and negative values would be resolved between 0 and 0.5.

Now, we can get all activation values based on Z value for next hidden layer.

Equations 4

We can write the same equations in matrix form as below.

and

Equations 5 and 6

This way we can find activation values for output layer too as below.

Equation 7

Equation 8

We have now output values based on input we passed. 

So far so good, but we have assumed that we have already proper weights to process input and to get correct output. We can call the neural network trained once we have proper weight values. Real part of machine learning is training the model. So, we need to somehow train the neural network and come up with proper weights. To find out proper weights we going to use "Backward propagation".

Backward propagation

The process starts from output layer and tries to compute weights for each input. We need to iterate the process multiple time to achieve proper weights.

Figure 2

Figure 3

To train the network we need to first initialize all weights with random values. We now pass input to the network. We know we will not get correct output because of wrong / random weight values. But, at least we will have some output values which we could compare with correct output.

Suppose we get value a31 and a32 shown in figure 2 and 3. We could know deviation from correct value. We can call it cost. Once we have cost we can figure out how can we minimize the cost and can reach near to correct values.

Cost function


Equation 9

We can describe cost function for each output value as below.

Equations 10

Our goal is to minimize the cost. As the output value is relied on previous layer's activation and weights it is not in our direct control. We need to change previous layer's activation and weights to change output accordingly. The previous layer's values also dependent on its previous layer (input layer in our example). So, we need to start from output layer and try to find out minimum cost of each neuron going backward up to input layer. This is why the process called backward propagation.

We need to find ∂C/∂a for each output neuron to find minimum cost. ∂C/∂a describes line with 0 slope on curve of function a -> C. Means, it indicates lowest value of C on curve for particular activation.

Equation 11

That way we have below values for each output neuron.

Equations 12

We have minimum cost for each output value and we need to find out how we can minimize it further. We need to find minimum cost for previous layer too, since output layer is dependent on it.

Output layer's neuron a31 and a32 is dependent on four activation and four weight  from previous layer as shown in figure 2 and 3. So to find out changes in output layer we need to find out how much changes are needed in all these values of previous layer.

We need to find out C/w and C/a for weight and activation receptively for hidden layer.

If we look closely we can see that weights and activation of previous layer affects Z value in next layer. Z value affects activation in next output layer, and activation of output layer affects C. So, in other words, C is dependent on activation of output layer, the activation values are dependent on Z value of previous layer, and the Z values are dependent on weights and activation. So to find out how much cost can be changed with changes of weights in hidden layer (C/w), we need to find out how much Z of next layer will going to change on weights, how much activation of next layer going to change on Z, how much cost (C) going to change on the activation. The far related values can be expressed with chain rule in terms of closely related values.

Equations 13

Now, we need to find values for each expression and evaluate final value.

From equation 7 we can derive below.

We know that 

So, to derive a/∂z we need to do this way.

If we use chain rule here.

Using below rule,

We can come up with

If we use chain rule again

 

and using below rule

we can get this

So, we have 

So, 

Equation 14

Now, we have values of all thee derivatives so we can combine all and show that equations as below using the information


Equations 15

We can write this equations in matrix form as below.

Equation 16

We have cost derivatives on weights of hidden layer, now we need to find out cost derivatives on activation of hidden layer.

As discussed previously, activation of hidden layer affects Z value of output layer, Z value affects activation values of output layer and activations affects cost (C). But, here is a change, weights were contributing to only one value because from each neuron to each next neuron we have single weight. But, one activation of hidden layer is simultaneously affecting every neuron of next layer. So, average cost changes on activation depends on both Z value of output layer, and that Z values are dependent of output layer's activation. Please check figure  2 and 3 for more clarity. So, to find of changes on cost based on changes of hidden layer activation we need to consider all neurons of output layer, because they all take part to suggest new values of hidden layer activation.

So, C/a can be describe as

Equations 17

(Here, * indicates average cost.) 

We need to simplify each expression same way as we did previously.

Equation 18

We already know for other two expression so, if we combined all and we get

Equations 19

We can write above equations as matrix form

Equation 20

We have now all two type of cost derivatives based on weights and based on activations. One thing we need to notice here is that we are using partial derivatives not total derivatives. Since, output layer values are dependent on two independent values (weights and activations) of previous layer, we have to use partial derivatives keeping one variable constant and treat other as only variable. This is the main reason why we need to repeat the entire backward propagation multiple times. We are assuming one value as correct / constant such that we do not need to change it while we are changing other value to find minimum cost. The constant value is not actually correct, so our minimum cost is far from actual minimum. First, we made activation constant and derive C on weights, second time we made weight constant and derive C on activation. By the process we could find out gradient matrix shown in equation 16 and C/a vector shown in equation 20. We will find the same gradient matrix for input layer too and that C/a vector will help us finding it. We will subtract the gradient matrices from weight matrices which we initialized using random values. That way we are now one step closer to proper weights. We will use the new weights again to find output using forward propagation and repeat whole backward propagation again to find out good weights. We repeat this so many times until we get proper / desired weights.

Lets do the same process for input layer now, but without going into too much detail as we did for hidden layer.


 Figure 4

Figure 5

Figure 6

(Omitting * from here as every C would have it)

Equations 21

We know the values of 

from equations 19, so we are not going to put that here otherwise our equations become huge.

We evaluate other two expression and can come up with this

Equations 22

We can write above equations in matrix form as

Equation 23

Now, we have two gradient matrices for both the input layer and hidden layer respectively as below

We need to subtract the gradient matrices from initial random weight matrices and we will get new weight matrices which we will use for next iteration.

Equation 24

and


Equation 25

Now, our random initial values are no more random and it is little bit closer to proper weights matrices. As we iterate this process over and over, we get proper weight matrices, and we can say that our neural network is trained properly, and ready to take input and produce reliable output.

Comments

Popular posts from this blog

Basics of quantum computing: change of basis

Computing requires classical bits, which can hold a value of 0 or 1. Quantum computing gets benefit from superposition of quantum bit (qubit). Qubit is a representation of state of quantum object (electron, proton, photon, etc). The qubit holds a value between 0 and 1 while it is in superposition state. Superposition means the qubit possesses multiple values at the same time so we can harness its power for parallel computing. Once we try to measure the value it can be collapsed into either 0 or 1 based on probability. (Observer collapses wave function simply by observing.) Using bloch sphere, we can represent qubit physically. (The bloch sphere is a physical model to represent a spin state of qubit) A state vector can point to any direction from the center of the sphere. If a vector points to Z+ and Z-, it represents 0 and 1 state respectively. All other vectors represent superposition state in a Z basis. Now, if a vector points to X+, X-, Y+, Y-, all are at a 90-degree angle from

Monad: in programming

Monad is a structure with type constructor M and two functions unit ( η) and multiplication ( μ). and   To lift value from a to Ma we need unit function and to flatten from M(Ma) to Ma we need multiplication . Why do we need monad in functional programming? Why do we need functional programming anyways? Functional programming is a paradigm which uses pure functions to build an application. The power of pure functions are, we can compose them and we can build complex things out of the small and simple functions. If we have these functions   and   We can get   We can use below general purpose compose function to compose any two pure functions.   If we need to compose three functions then we can do like Now, we can't build application using only pure functions. It is next to impossible. We need side effects anyways to make the application practical in use. To achieve the goal, we need to keep pure functions in center/core part of application and need to move side effects outsid

Monad: to programming from category theory

We have seen the summarized article "monad in programming" in the previous post. Let's find out what is a monad in category theory and how the concept became relevant for functional programming. First, we need to understand the basics of category theory. Category A category consists of objects and morphisms (relationships) between those objects. More of it, the morphisms should be composable too. Below is an example of a category. Each dot is an object and each arrow shows morphism from some object to another. As arrows are composable, if there is a morphism from object a to object b , and there is another morphism from b to c , then we can find a morphism from a to c .  If you notice, there is identity morphism too, it exists for each object, but we have shown it just for x . It is a morphism from an object to itself. If we compose identity morphism with any other morphism w , we get the same w in return. Functor A functor is a relationship between categories