While in principle any of the Parix virtual topologies could have been chosen for our implementation, it is clear that the torus- and ring-embeddings yield lowest dilation and congestion on our 2D hardware grid.
For the work distribution, we used a packet forwarding scheme, where the work requests are forwarded to neighbored processors until either some work is sent back or the message makes a full round through the system, thereby indicating that no work is available. On a torus topology, requests are first forwarded along the horizontal ring. If none of the processors has work to share, the request is then sent along the vertical ring, see Figure 2. This yields a wide-spread work distribution, while each processor communicates only with a subset of processors.
Figure 2: Process architecture on Parix
The right part of Figure 2 illustrates the process architecture of our Parix implementation. Each processor executes nine concurrent threads (=light weighted processes). All threads run in the same context, that is, they share the same global variables defined by the main program. Hence, the worker and receiver threads have direct memory access to the processor's frontier node array for retrieving new work packets. Memory contention is arbitrated by semaphores. The sender and receiver threads serve incoming messages and send work requests on demand.