Evaluation and Backpropagation

Next: Backpropagation for a 2-Layer Up: Backpropagation Previous: The Error Gradient Contents

Subsections

Evaluation and Backpropagation

The main feature of backpropagation in comparison with other gradient descent methods is, that, provided that all netto input functions are linear, the weight update $\Delta {\bf W}_j$ of the neurone can be found by using only local information, thus information passed through the incoming and outgoing transitions of the neurone. This process consists of the evaluation step, where the error is calculated and the backpropagation of the error in the inverse direction form the output back to the input neurones.

The Network Function

Due to the linearity of the netto input function, the overall network function consists merely of additions, scalar multiplications and compositions of the activation functions. The partial derivations are thus calculated as follows:

$\begin{displaymath}{\partial{f_1(x)+f_2(x)}\over\partial{x}}={\partial{f_1(x)}\o... ...al{k\,f(x)}\over\partial{x}}=k{\partial{f(x)}\over\partial{x}} \end{displaymath}$

$\begin{displaymath}{\partial{f_2\left( f_1(x) \right)}\over\partial{x}}= {\part... ...{x}}\left[{\partial{f_2(y)}\over\partial{y}}\right]_{y=f_1(x)} \end{displaymath}$

Calculating the Error Gradient

During the evaluation step, not only the value of the activation function but also the value of its derivation is calculated for the netto input . If $g=\sigma_1=\sigma$ , the derivation has a very simple form.

$\begin{displaymath}\sigma(x) = {1 \over 1+{\rm e}^{-x}}, \quad {\partial{\sigma... ...\rm e}^{-x} \over (1+{\rm e}^{-x})^2}= \sigma(x)(1-\sigma(x)) \end{displaymath}$

Since depends of the output vector (calculated by the network function ) and only indirectly on the weights, $\nabla E_k$ can be written as

$\begin{displaymath}\nabla E_k={\partial{E_k}\over\partial{{\bf W}}}={\partial{E_... ...l{O}} {\partial{O}\over\partial{{\bf W}}}, \quad O=f(I^{(k)}) \end{displaymath}$

$\begin{displaymath}\mbox{and} \quad {\partial{E_k}\over\partial{O_i}}={1\over ... ...}}\,\sum_{i=1}^q (O^{(k)}_i-O_i)^2= O_i-O^{(k)}_i = \Delta O_i\end{displaymath}$

To calculate the partial derivation for each element of the weight vector for each node, the output nodes are set to $\Delta O_i$ and ${\partial{O}\over\partial{{\bf W}}}$ is calculated by successively stepping backward in opposite direction of the transition in ${\bf T}$ and applying the above listed derivation rules. Composition is handled by multiplying the stored outer derivation onto the sum of the inner derivations $\delta_j$ received via the inverted output transitions.

$\begin{displaymath}\mbox{input} \: \delta_i, \quad \delta={\partial{E_k}\over\partial{x}}=g'(x)\sum_{j=1}^q \delta_j \end{displaymath}$

Then, $\delta$ is propagated to the input nodes by multiplying it with the corresponding weight . Then the weight is updated.

$\begin{displaymath}\delta'_i= w_i \delta, \quad \mbox{output} \: \delta'_i, \quad \Delta w_i = I_i \delta \end{displaymath}$

Next: Backpropagation for a 2-Layer Up: Backpropagation Previous: The Error Gradient Contents