The Error Gradient

Next: Evaluation and Backpropagation Up: Backpropagation Previous: Backpropagation Contents

Subsections

The Error Gradient

As in Section 3.4.3, the error function of the training pattern $(I^{(k)},O^{(k)})$ is once again defined as

$\begin{displaymath}E_k={1\over 2} \sum_{i=1}^q (O^{(k)}_i-O_i)^2 \quad \mbox{with} \quad O=f(I^{(k)}), \quad f: {\bf R}^p \rightarrow {\bf R}^q \end{displaymath}$

Since the network function is itself a function of the network parameters ${\bf P}$ and ${\bf P}$ is the set ${\bf W}$ of the weight vectors of the neurones, the error function is also a function of ${\bf W}$ and a gradient can be defined as

$\begin{displaymath}\nabla E_k=\nabla E_k(W_1,W_2, \ldots W_n)= \left( {\partial... ...artial{W_1}}, \ldots {\partial{E_k}\over\partial{W_n}} \right) \end{displaymath}$

$\begin{displaymath}{\partial{E_k}\over\partial{W_j}}=\left( {\partial{E_k}\over\... ...rtial{E_k}\over\partial{w_{jp_j}}} \right), \quad p_j=\dim W_j \end{displaymath}$

The backpropagation algorithm is a gradient descent method and thus, the weights are updated with the negative gradient of the error function.

Online and Batch Learning

The weights can be updated immediately after $\Delta {\bf W}$ is determined for a pattern. This is called online learning.

$\begin{displaymath}{\bf W}^{(i+1)} = {\bf W}^{(i)}+\Delta {\bf W}^{(i)}, \quad \Delta {\bf W}^{(i)}= - \gamma \nabla E_k({\bf W}^{(i)}) \end{displaymath}$

Batch Learning updates the weights with the arithmetic mean of the corrections for all patterns. This can lead to better results with small and very heterogeneous learn sets.

$\begin{displaymath}\Delta {\bf W}^{(i)}= - {\gamma \over p} \sum_{k=1}^t \nabla E_k({\bf W}^{(i)}) \end{displaymath}$

The constant $\gamma$ is called learn rate. A high value of $\gamma$ leads to greater learn steps at the cost of lower accuracy.

Learning with Impulse

In regions where the error function is very flat, the resulting gradient vector will be very short and lead to very small learn steps. A solution to this problem is the introduction of an impulse term which is added to the update $\Delta {\bf W}$ and steadily becomes greater if the direction of $\Delta {\bf W}$ remains stable.

$\begin{displaymath}\Delta {\bf W}^{(i)}= - \gamma \nabla E({\bf W}^{(i)}) + \alpha \Delta {\bf W}^{(i-1)}, \quad \alpha \in [0,1) \end{displaymath}$

The impulse constant $\alpha$ reflects the ``acceleration'' a point gets on descending the error function. If we assume $\nabla E$ as constant, the maximum acceleration factor is given by

$\begin{displaymath}a={\Delta {\bf W}^{(\infty)} \over \Delta {\bf W}^{(0)}}= \sum_{i=0}^\infty \alpha^i = {1\over 1-\alpha} \end{displaymath}$

Fig. 1 shows a training process for the XOR-problem (Section 6.2.2) with $\alpha=0$ and $\alpha=0.9$ .

**Figure 1:** **Error Graph for the XOR Problem**
$\begin{figure} \centerline {\fbox{\epsffile{backerrw12.ps}}} \small\end{figure}$

Next: Evaluation and Backpropagation Up: Backpropagation Previous: Backpropagation Contents