Table 1 shows the performance results of AIDA* for three instances of the 44-Puzzle out of Korf's problem set [1]. All efficiency data is normalized to the performance of an optimal sequential Parix implementation.
Table 1: The results on the Parsytec GCel
We used two different PVM implementations for our experiments: PVM is an -release of a homogeneous PVM version where all tasks are spawned on the transputer nodes of the parallel system. The heterogeneous version, PVM, runs on any available hardware, permitting a heterogeneous collection of serial and parallel computers to appear as a single virtual machine. The task management is done by a PVM daemon running on the front-end machine of the parallel system. Both PVM environments are based on Parix.
As can be seen from the table, all three versions scale reasonably well for moderate system sizes. Of course, the implementation on Parix is the fastest. The homogeneous PVM is less efficient, but it performs much better than the heterogeneous version. From the very nature of the heterogeneous PVM, this programming environment cannot be as efficient as the homogeneous implementation, because all messages are checked whether the destination is `within' or `outside' the MPP system.
Table 2: Communication time between neighbored processors
In a separate experiment, we benchmarked latency times and communication speed of the three environments. Table 2 shows the communication time between neighbored processors in the GCel system. The messages are not encoded to achieve maximal performance. As can be seen, the setup times of the PVM communication are very high, especially in the heterogeneous version. In our application, only small messages ( 110 byte) are sent, resulting in higher PVM communication overheads for the larger system sizes (Table 1).
Table 3: Times to send a message (128 bytes) through all processors
In another experiment we tried to determine the performance loss due to the restriction in the process model. We can't do concurrent work and loadbalancing, because we only have one task on each transputer. So, communication can only take place at certain time intervals, when the search process is interrupted. Table 3 shows the time for sending a message on a ring through all processors of the system with and without a search routine. The time measured without the search routine is the time one would expect, if the communication can be done by a special task concurrent to the node expansion. The delay caused by the search routine increases the average time for a message between `neighbored' processors on the ring from 681 to 2.4 on 512 processors when using the same communication frequency like in the real application.