The on-line training changes all the weights within each backward propagation after every item from the training set. Here the parallelization is very fine-grained. The vector-matrix-operations have to be calculated in parallel. This needs a lot of communication.
The responsibility for the calculation of the activations and output of the neurons is distributed among the processors. The hidden and output layers are partitioned into p disjoint sets and each set is mapped on one of the p processors. Therefor the weight matrices are distributed in rows and these are distributed among the processors.
When splitting is necessary the neuron and its weights are broadcasted around the processor cycle. The responsibility for this neuron is given to the first processor with the lowest load.
After each step of the propagation the new output vector in the layer has to be broadcasted because each processor needs the whole receptive layer to calculate the activations of its own neurons for the next step.
The backward phase is more complicated: for the error propagation each processor needs the columns of the weight matrices to calculate the error in the previous layer, but the weights are stored and updated in the rows of the weight matrix due to the operations in the forward propagation. There are two different methods to implement the parallel calculation of the errors.
Figure 9: Parallel on-line backward propagation of Yoon et.al.