# Optimal Processor Mapping for Linear- # Complement Communication on Hypercubes # and Their Variations #### Yomin Hou Department of Computer and Information Science National Chiao-Tung University, Hsinchu, Taiwan, ROC Email: ymhou@cis.nctu.edu.tw ## Chien-Min Wang Institute of Information Science, Academia Sinica Taipei, Taiwan, ROC Email: cmwang@iis.sinica.edu.tw #### Chiu-Yu Ku Avanti Technology Inc., Taipei, Taiwan, ROC Email: cyku@avanticorp.com ## Lih-Hsing Hsu Department of Computer and Information Science National Chiao-Tung University, Hsinchu, Taiwan, ROC lhhsu@cis.nctu.edu.tw #### Abstract In this technical report, we address the problem of minimizing channel contention of linear-complement communication on wormhole-routed hypercubes. Our research reveals that, for traditional routing algorithms, the degree of channel contention of a linear-complement communication can be quite large. To solve this problem, we propose an alternative approach, which applies processor mapping at compile time. In this compiler approach, processors are logically reordered according to the given communication(s) so that the new communication(s) can be efficiently realized on the hypercube network. It is proved that, for any linear-complement communication, there exists a reordering mapping such that the new communication has minimum channel contention. An $O(n^3)$ algorithm is proposed to find such a mapping for an n-dimensional hypercube. An algorithm based on dynamic programming is also proposed to find an optimal reordering mapping for a set of linear-complement communications. Several computer simulations have been conducted, and the results clearly show the advantage of the proposed approach. **Keywords**: Hypercubes, linear-complement communication, channel contention, processor mapping, wormhole routing. ### 1. Introduction In a message-passing multi-computer, efficient schemes to move messages among the processors are required for obtaining fast and efficient parallel algorithms. This problem is called the *message routing problem*. Many studies on the message routing problem have been based on *store-and-forward routing*, where the message latency is proportional to the product of the message length and the number of routing steps. Hence, most of them have concentrated on minimizing the number of routing steps in moving messages among processors [4], [15], [16], [17], [19], [20]. On the other hand, wormhole routing has been widely adopted recently due to its effectiveness in inter-processor communication [1], [2], [21]. With wormhole routing, each message is divided into a number of flits. The header flit(s) carries the address information and governs the route while the remaining flits of the message follow in a pipeline fashion. The pipelined nature of wormhole routing provides two attractions. First, in the absence of channel contention, the network latency would be relatively insensitive to the path length [9], [21]. Second, the large message buffers for each router are obviated; only small flit buffers are required [21]. However, channel contention can have a severe impact on the network latency of wormhole routing. Channel contention happens when multiple messages simultaneously use the same channel in their routes. If k messages are contending for the same channel, only one of them can reserve the channel and be forwarded through it. The other messages have to wait until the channel is released and then contend for it again. In the worst case, the network latency will become k times. Hence, an important issue on wormhole-routed parallel computers is to minimize channel contention between messages. Many studies have tried to improve the adaptability of routing algorithms to solve the contention problem [5], [7], [10], [11], [14], [21]. However, this approach requires extra hardware support, such as buffer space and control logics. Moreover, the complexity of adaptive routers significantly increases their inter-router setup delay and flow control cycle times [6]. Consequently, the claims of performance advantages in channel utilization may not be able to be balanced against losses on achievable implementation speed. For these reasons, we try to solve the contention problem by compiler approaches rather than routing algorithms at run-time. In this report, we focus on the problem of minimizing the channel contention of linear-complement communication on wormhole-routed hypercubes. The hypercube is one of the most efficient networks for parallel computation. It can efficiently simulate any other network of the same size. In particular, the N-node hypercube can simulate any O(N)-node array, binary tree, or mesh of trees with only a small constant factor slowdown [13]. Furthermore, it is complete symmetric and decomposable into sub-hypercubes [22]. These properties, in addition to its simplicity, make it be an excellent and popular architecture for distributed memory parallel computers [1], [2]. Linear-complement communication (LCC) is a class of communications where the address bits of the destination of each message are linear combinations of the address bits of its source and their complements. Many important problems, like fast Fourier transform, matrix transposition, polynomial evaluation, etc., can be effectively solved on parallel computers, which have an efficient scheme to support this type of communications. To minimize the channel contention of LCC, we adopt a new approach that applies processor mapping at compile time. In the compiler approach, processors are logically rearranged according to the given communications before the executable code of the given parallel program is generated. In this way, no extra data movement will be incurred. By appropriately rearranging the processors, the new communications can be efficiently realized on the hypercube network. It can be proved that, for any LCC, there exists a reordering mapping such that the new communication has minimum channel contention. An $O(n^3)$ algorithm is proposed to find such a mapping for an n-dimensional hypercube. For a parallel program containing more than one LCC, a dynamic programming algorithm is proposed to find an optimal reordering mapping. With these results, compiler techniques can be used to minimize the channel contention of LCCs on hypercubes. In addition, only the e-cube routing is assumed in the proposed approach, and no extra hardware is needed. Therefore, it is of practical use. Experiments based on computer simulation have been conducted, and the results clearly show the advantage of the proposed approach. Related researches for store-and-forward interconnection networks have been reported. Most of them focused on subsets of LCC, such as *linear-complement permutation* (LCP) and *bit-permute-complement permutation* (BPC). For example, Boppana and Raghavendra [4] considered LCPs on hypercubes, Nassimi and Sahni [19], [20] dealt with BPCs on meshes and hypercubes, and Masuyama [16], [17] dealt with BPCs on chordal rings and hypercubes. However, none of these methods can be applied to *linear-complement scatter* (LCS) or *linear-complement gather* (LCG). Lin and Wang [15] considered another set of communications represented by $\binom{s}{d}$ -mask formalism. Though this representation scheme can encode a broad class of communications, it still can not represent a LCS or LCG with a single $\binom{s}{d}$ -mask. All the above researches tried to give efficient routing algorithms. Hence, they require new-designed hardware and can not be applied to existing parallel computers. Moreover, these researches aim at minimizing the number of routing steps, and are not suitable for wormhole-routed networks. The rest of this report is organized as follows. Section 2 introduces the notations and definitions. In Section 3, we describe the compiler approach. Some properties and an algorithm of processor reordering mapping are presented in Section 4. In Section 5, a dynamic programming algorithm is proposed for a set of LCCs. Section 6 shows the experimental results based on computer simulation. In Section 7, we show the method to apply the processor mapping approach for performing LCCs on SCI Origin 2000. Finally, conclusions are given in Section 8. ## 2. Background When a communication is performed, messages are generated by a set of source nodes and transmitted through the interconnection network to their destination nodes. The *communication latency* is the interval from the time the source nodes begin to send messages until the last destination node has received the message. If some of the paths for transmitting these messages contend for the same channel, then the communication latency will be increased. The more paths contend for the same channel, the longer communication latency is required. Therefore, the maximum number of paths contending for the same channel has a severe impact on the communication latency. Let the degree of channel contention for a communication be defined as the maximum number of paths contending for the same channel. In this report, we consider the problem of performing LCC on a hypercube computer with the *e*-cube wormhole routing capability. Our goal is to minimize the degree of channel contention so that LCC can be performed efficiently. In this section, the definitions and notations of these terminologies will be clarified. #### 2.1 The hypercube network An *n*-dimensional hypercube is a directed graph which contains $N = 2^n$ nodes and $n \times 2^n$ channels. Each node corresponds to an *n*-bit binary string, $b_{n-1}b_{n-2}...b_1b_0$ . We shall use a binary vector $[b_0 \ b_1 \ ... \ b_{n-1}]^t$ to represent it. Two nodes are connected with a pair of channels, one for each direction, if and only if their binary strings differ in exactly one bit. As a consequence, each node is incident to *n* other nodes through *n* different channels, one for each bit position. The channel from node *x* to *x'* is denoted by (x, x'), and said to be at dimension *k* if *x* and *x'* differ in the *k*th bit position. ### 2.2 Linear-complement communication The messages generated when performing a communication operation usually can be formulated by some specific pattern. For example, in the bit-reverse communication operation, the source and destination nodes of each message can be represented by $[x_0 \ x_1 \ ... \ x_{n-1}]^t$ and $[x_{n-1} \ x_{n-2} \ ... \ x_0]^t$ , respectively. In what follows, we will define the class of linear-complement communications on an n-dimensional hypercube. The addition and multiplication in this subsection are modulo-2, i.e., they are defined in the finite field, GF(2) [18]. **Definition 1**: A communication is a *linear-complement communication* (*LCC*) if there exits a binary matrix $A_{n\times n}$ and an *n*-dimensional binary vector b such that, for every message with source node x, its destination node y is given by the equation y = Ax + b. **Definition 2**: A LCC with a binary matrix $A_{n\times n}$ and an *n*-dimensional binary vector *b* is a linear-complement permutation (LCP) if $A_{n\times n}$ is non-singular, i.e., rank(A) = n. **Definition 3**: A LCC with a binary matrix $A_{n \times n}$ and an *n*-dimensional binary vector *b* is a linear-complement gather (LCG) if rank(A) < n. Scatter, the dual operation of gather, can be implemented by simply reversing the direction of message transmission of gather. Thus, we can define linear-complement scatter as follows. **Definition 4**: A communication is a *linear-complement scatter (LCS)* if there exists a binary matrix $A_{n\times n}$ , $rank(A_{n\times n}) < n$ , and an *n*-dimensional binary vector b such that, for every message with destination node y, its source node x is given by the equation Ay+b=x. **Example 1**. The bit-reverse communication operation on an 8-dimensional hypercube is a linear-complement communication, $$\begin{bmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \\ x_7 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}.$$ For any source-destination pair (x, y) in a LCC defined above, to obtain the address of y is equivalent to performing a linear transformation of the n-dimensional vector space on x and then adding a constant binary vector. Since there is an one-to-one correspondence between the $n \times n$ binary matrices and linear transformations on an n-dimensional vector space over GF(2), we shall utilize this property in the following sections. ## 2.3 Routing strategies The interconnection network must allow every node to send messages to every other node. In the absence of a complete network, we need a routing algorithm to determine the path selected by a message to reach its destination. Efficient routing algorithms are critical to the performance of the interconnection network. The *e*-cube routing algorithm is the simplest deadlock-free routing strategy on wormhole-routed hypercubes, which reserves the required channels in a strictly increasing order of dimensions. It allows messages to be forwarded only over channels at higher dimensions than that of the last traversed channel. Many current hypercube computers use the *e*-cube routing because of its simplicity and ease of implementation. However, the *e*-cube routing establishes only one shortest path between each pair of source-destination nodes, and does not take the advantage of the flexibility provided by hypercubes. Fully adaptive routing strategies can route messages along any of the shortest paths available in the hypercube network. Unfortunately, multiple channels are needed for a pair of neighboring nodes in order to help these strategies prevent deadlock. This means that extra hardware supports, such as buffer space and control logics, have to be added to the routers. Some partially adaptive routing strategies without the need of multiple channels have been proposed, such as [5], [7], [11], and [14]. Although they can better utilize the flexibility provided by hypercubes, the complexity of adaptive routers significantly increases the inter-router setup delay and flow control cycle times [6]. Furthermore, even though a partially adaptive routing strategy is used, our simula- tion results reveal that the average network latency grows much more rapidly than expected as the network throughput increases. This limits the utilization of the communication capacity of an interconnection network. For example, when the adaptive routing strategies in [7] or [14] is used, the throughput of performing bit-reverse on an 8-dimensional hypercube is always less than 1/8 of the communication capacity of the hypercube network. The main reason for the poor performance is that there exists a set of 8 source-destination pairs contending for the channel ([0 0 0 1 0 0 0 0]<sup>t</sup>, [0 0 0 0 0 0 0 0]<sup>t</sup>). Similar conditions also happen when performing matrix-transpose or reverse-flip. The adaptive routing strategy in [11] also suffers from similar problems when performing matrix-transpose or bit-reverse. From the above discussions, it is shown that the adaptive routing algorithms can not appropriately solve the contention problem for performing LCCs on hypercubes. Hence, in this report, we propose a compiler approach called *processor mapping* to minimize the channel contention. The hypercube network is assumed to support only *e*-cube routing. Thus, no extra hardware is needed and the proposed approach is of practical use. # 3. The Compiler Approach In the proposed compiler approach, the communications to be performed in a parallel program can be detected by compilers automatically or specified by programmers. These communications will be transformed into the matrix form for the optimization process. Next, according to the given communications, an optimal processor mapping is determined. The compiler then can generate the SPMD (Single Program Multiple Data) node program for each processor accordingly. By appropriately choosing a processor mapping, the new communications can be efficiently realized on the hypercube network. As an example, consider executing the parallel loop show in Fig. 1(a) on hypercubes. The function reverse(i, n) returns the bit-reverse of i, i.e., reverse(i, n)= $i_0\times 2^{n-1}+i_1\times 2^{n-2}+...$ $i_{n-2}\times 2+i_{n-1}$ , for the loop index $i=i_{n-1}\times 2^{n-1}+i_{n-2}\times 2^{n-2}+...$ $i_1\times 2+i_0$ . If array elements b[i] and r[i] are distributed on processor $P_i$ , and the task of executing iteration i is also assigned to processor $P_i$ , then the compiler can determine that the communication to be performed is bit-reverse and transform it into the matrix form as shown in Example 1. The SPMD node program without processor mapping can be generated as shown in Fig. 1(b) for comparison. With processor mapping, the compiler has to determine an optimal one-to-one mapping function f according to the given communication. The function f maps $virtual\ processor\ x$ onto $physical\ processor\ x'=f(x)$ . From the view point of virtual processors, data distribution and iteration assignment remain unchanged; i.e., array elements b[i] and r[i] are distributed on virtual processor $VP_i$ and the task of executing iteration i is also assigned to virtual processor $VP_i$ . The only difference is that virtual processor $VP_i$ is now mapped onto the physical processor $P_{f(i)}$ . Accordingly, the SPMD node program for each processor can be generated as shown in Fig. 1(c). Though the communication to be performed between virtual processors is still bit-reverse, the one between physical processors is changed. It will be proved in the next section that the degree of channel contention can be reduced by appropriately choosing the mapping function f. The algorithm of finding an optimal mapping function will be also proposed in the next section. Since sending and receiving of data can be accomplished by hardware as shown in Fig. 1(b), the only overhead of the program shown in Fig. 1(c) at run-time is the mapping ``` real b[2<sup>n</sup>], r[2<sup>n</sup>] doall i = 0 to 2^n-1 do b[i] = r[reverse(i,n)] enddo (a) An example parallel loop. real b_local, r_local i = get_pid() send(reverse(i,n), r_local) receive(reverse(i,n), b local) (b) An example SPMD node program. real b_local, r_local y = get_pid() send(f(reverse(f<sup>-1</sup>(y),n)), r_local) receive(f(reverse(f<sup>-1</sup>(y),n)), b_local) (c) An example SPMD node program for mapping f. real b_local, r_local int to_virtual[2<sup>n</sup>], to_physical[2<sup>n</sup>] i = v_get_pid() /* v_get_pid() returns to_virtual[get_pid()] */ v send(reverse(i,n), r local) /* v_send(v_pid, data) calls send(to_physical[v_pid], data) */ v_receive(reverse(i,n), b_local) /* v_receive(v_pid, data) calls receive(to_physical[v_pid], data) */ (d) An alternative SPMD node program for mapping f. ``` Fig. 1. An example parallel loop and its corresponding SPMD node programs. between virtual processors and physical processors. This overhead can be minimized as array indexing operations as shown in Fig. 1(d). Apparently, this overhead is far less than the communication latency and, therefore, can be neglected. Note also that the SPMD node program shown in Fig. 1(d) is very similar to the one shown in Fig. 1(b) except that the functions get\_pid(), send(), and receive() in Fig. 1(b) are replaced by the functions v\_get\_pid(), v\_send(), and v\_receive(), respectively. This property makes code generation with processor mapping much easier. In addition to generating the node program, the compiler has to determine an appropriate mapping function and set up the two mapping arrays to\_virtual[] and to\_physical[] for use in the node program. Another issue arises when more than one communications will be performed in a parallel program. Our approach is to search for an optimal processor mapping for all the communications. We shall propose a dynamic programming algorithm for that purpose in Section 5. Another approach is to perform *processor re-mapping* at run-time. However, it may result in significant overhead at run-time. Thus, it is not considered in this report. ## 4. Processor Mapping First, we shall show that the degree of channel contention for a LCC is directly related to the ranks of sub-matrices of the binary matrix A of the LCC. A sub-matrix of the binary matrix A is the matrix obtained from A by retaining entries in some row(s) and column(s) and deleting other entries. We shall use $A_{(i)}$ to denote the ith row of A and $A^{(i)}$ to denote the ith column of A. The following definition defines the special sub-matrices and their notations to be used in this report. **Definition 5**: A sub-matrix of the binary matrix A obtained by retaining rows in the set R and columns in the set C is denoted as $A_{R,C}$ . We shall use $L_i$ to denote the set of non-negative integers smaller than i. Therefore, given integer i and j, $A_{L_i,L_j}$ is the upper-left sub-matrix of A with i rows and j columns. Example 2 shows the sub-matrix $A_{L_3,L_3}$ of a 4×4 matrix A. Example 2. Given $$A = \begin{bmatrix} a_{0,0} & a_{0,1} & a_{0,2} & a_{0,3} \\ a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,0} & a_{3,1} & a_{3,2} & a_{3,3} \end{bmatrix}$$ , $A_{L_3,L_2}$ is equal to $\begin{bmatrix} a_{0,0} & a_{0,1} \\ a_{1,0} & a_{1,1} \\ a_{2,0} & a_{2,1} \end{bmatrix}$ . **Theorem 1**: Given a LCC y = Ax + b, the maximum number of paths that contend for the same channel at dimension i, denoted as $T_i(A, b)$ , can be determined as follows, $$T_i(A,b) = \begin{cases} 0 & \text{, when } x_i = y_i \\ 2^{i-rank(A_{I_{i+1},I_i})} & \text{, otherwise.} \end{cases}$$ **Proof**: For any channel $l = ([z_0 \ z_1 \ ... \ z_i \ ... \ z_{n-1}]^t, [z_0 \ z_1 \ ... \ \overline{z}_i \ ... \ z_{n-1}]^t)$ at dimension i, suppose that l is in the path from node $x = [x_0 \ x_1 \ ... \ x_{n-1}]^t$ to node $y = [y_0 \ y_1 \ ... \ y_{n-1}]^t$ according to the e-cube routing, then we have $[x_i \ x_{i+1} \ ... \ x_{n-1}]^t = [z_i \ z_{i+1} \ ... \ z_{n-1}]^t$ and $[y_0 \ y_1 \ ... \ y_i]^t = [z_0 \ z_1 \ ... \ z_{i-1}]^t$ . Since y = Ax + b, we have $$\begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_i \end{bmatrix} = \begin{bmatrix} z_0 \\ z_1 \\ \vdots \\ \overline{z}_i \end{bmatrix} = \begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,i-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,i-1} \\ \vdots \\ a_{i,0} & a_{i,1} & \cdots & a_{i,i-1} \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_{i-1} \end{bmatrix} + \begin{bmatrix} a_{0,i} & a_{0,i+1} & \cdots & a_{0,n-1} \\ a_{1,i} & a_{1,i+1} & \cdots & a_{1,n-1} \\ \vdots \\ a_{i,i} & a_{i,i+1} & \cdots & a_{i,n-1} \end{bmatrix} \begin{bmatrix} z_i \\ z_{i+1} \\ \vdots \\ z_{n-1} \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_i \end{bmatrix}$$ Note that the number of paths contending for channel I is equivalent to the number of solutions of $[x_0 \ x_1 \ ... \ x_{i-1}]^t$ . According to Linear Algebra, either there is no solution, or there are exactly $2^{i-rank(A_{I_{i+1},I_i})}$ distinct solutions satisfying the above set of equations. Note also that no solution means no path passing channel I. Hence, for all channels at dimension i, there is no solution if and only if $y_i = x_i$ . In all other cases, there exists a channel at dimension i such that exactly $2^{i-rank(A_{I_{i+1},I_i})}$ paths contend for it. The above theorem shows that the degree of channel contention is determined only by the ranks of sub-matrices of the binary matrix A. As an example, we shall compute the degree of channel contention of matrix-transpose on an 8-dimensional hypercube according to Theorem 1. **Example 3**. Consider the matrix-transpose communication on an 8-dimensional hypercube, $$\begin{bmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \\ x_7 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}.$$ According to Theorem 1, the degree of channel contention at each dimension can be computed as follows: $T_0=2^{0-0}=1$ , $T_1=2^{1-0}=2$ , $T_2=2^{2-0}=4$ , $T_3=2^{3-0}=8$ , $T_4=2^{4-1}=8$ , $T_5=2^{5-3}=4$ , $T_6=2^{6-5}=2$ and $T_7=2^{7-7}=1$ . **Definition 6**: The degree of channel contention of a LCC y=Ax+b is defined to be $\underset{0 \le i \le p-1}{MAX} \{T_i(A,b)\}$ and denoted by T(A,b). To minimize T(A,b), the binary matrix A must be "changed" and, at the same time, the LCC must be correctly performed. To meet these two requirements, we propose an alternative approach, called *processor mapping*. In this approach, processors are logically reordered by a linear or reordering mapping, which are defined formally as follows. **Definition** 7: A processor mapping is said to be a *linear mapping* if there exists a binary matrix $Q_{n \times n}$ such that, for every node x, it is mapped to node x' = Qx. **Definition 8**: A processor mapping is a *reordering mapping* if it is a linear mapping and the matrix O is a permutation matrix, i.e. each row and column of O has exactly one 1. Reordering mapping has the property that the neighboring relation of processors is kept unchanged after processors are reordered. Since the communication between neighboring processors is the most frequently used class of communications, this property provides a great advantage for practical use. In order to ensure that the LCC will be performed correctly, the communication after processor reordering mapping must be changed as shown in the following theorem. **Theorem 2**: Given a LCC with a binary matrix A and a binary vector b, the new communication after the reordering mapping with a permutation matrix Q is a LCC with a binary matrix $QAQ^{-1}$ and a binary vector Qb. **Proof**: Fig. 2 provides a good explanation for this theorem. For a source node x in the LCC with A and b, the destination node y can be computed by the equation, y=Ax+b. After reordering mapping by Q, we have x'=Qx and y'=Qy. Since Q is non-singular, we can Fig. 2. The new communication after reordering mapping Q for y=Ax+b. derive $x = Q^{-1}x'$ . Hence we have $y' = Qy = Q(Ax + b) = QA(Q^{-1}Q)x + Qb = QAQ^{-1}x' + Qb$ . $\Box$ In other words, the degree of channel contention after the reordering mapping is now determined by $QAQ^{-1}$ . By choosing an appropriate processor mapping Q, the degree of channel contention could be greatly reduced. In the following discussion, we will show how to find an optimal reordering mapping for a LCC. First, notice the special situation in Theorem 1 that $T_i(A,b)=0$ at dimension i. It only happens when $y_i=x_i$ for all x, i.e., $A_{(i)}=I_{(i)}$ and $b_i=0$ , where I is the *identity matrix*. Since $T_i(A,b)=0$ means no communication at dimension i, we wish to keep it unchanged when performing processor mapping. It can be accomplished by applying an reordering mapping Q, which moves the ith row of A to the last dimension, before searching for an optimal mapping. Hence, the ith row and column in A will be moved to the last dimension in $QAQ^{-1}$ . Since $[QAQ^{-1}]_{L_{n-1},L_{n-1}} = A_{L_n-\{i\}L_n-\{i\}}$ , and $T_{n-1}(QAQ^{-1}, Qb) = T_i(A, b)=0$ , we can just consider the sub-matrix $A_{L_n-\{i\}L_n-\{i\}}$ when minimizing channel contention of A. Without loss of generality, we may assume $T_i(A,b)>0$ for $0 \le i \le n-1$ in the following discussion. **Theorem 3:** Consider an LCC, y = Ax + b, on an *n*-dimensional hypercube. If $Q_{n \times n}$ is a non-singular matrix, then $T(QAQ^{-1}, Qb) \ge 2^{n-1-rank(A)}$ . **Proof:** Let $B = QAQ^{-1}$ . From Theorem 1, $T_{n-1}(QAQ^{-1}, Qb) = 2^{n-1-rank(B_{L_{n-1},L_{n-1}})}$ . Since $\underset{0 \le i \le n-1}{MAX} \{i - rank(QAQ^{-1})_{L_{i+1},L_i}\} \ge n-1-rank(B_{L_{n-1},L_{n-1}}) \ge n-1-rank(A)$ , it can also be obtained that $T(QAQ^{-1}, Qb) \ge 2^{n-1-rank(A)}$ . **Theorem 4**: For any non-singular binary matrix $A_{n \times n}$ , there exists a permutation matrix $Q_{n \times n}$ such that $rank([QAQ^{-1}]_{L_{i+1},L_i}) = i$ for $0 \le i \le n-1$ . **Proof**: The proof of this theorem is by mathematic induction on the integer i. Obviously, it is true for i=0. Suppose that it is true for $0 \le i \le k$ . There must exist a permutation matrix Q such that $rank([QAQ^{-1}]_{L_{int},L_k}) = i$ for $0 \le i \le k$ . We shall prove that it is also true for $0 \le i \le k+1$ . Note that $$(k+1) \geq rank(\left[QAQ^{-1}\right]_{L_{k+1},L_{k+1}}) \geq rank(\left[QAQ^{-1}\right]_{L_{k+1},L_{k+1}}) \geq rank(\left[QAQ^{-1}\right]_{L_{k+1},L_{k}}) = k.$$ If $rank([QAQ^{-1}]_{L_{k+2},L_{k+1}}) \neq k+1$ , it must held that $$rank([QAQ^{-1}]_{L_{k+2},L_{k+1}}) = rank([QAQ^{-1}]_{L_{k+1},L_{k+1}}) = k.$$ Since A is non-singular and Q is a permutation matrix, we have $rank(QAQ^{-1}) = rank(A) = n$ . This means the columns in $QAQ^{-1}$ are linearly independent. Hence, we have $rank(\left[QAQ^{-1}\right]_{L_n,L_{k+1}})=k+1$ . In other words, there must exist a row $j\geq k+1$ that is linearly independent of rows 0 to k in $\left[QAQ^{-1}\right]_{L_n,L_{k+1}}$ . Let $\widetilde{Q}$ be the permutation matrix exchanging rows j and k+1. Accordingly, $\widetilde{Q}^{-1}$ must be the permutation matrix exchanging column j and k+1. We can derive $rank(\left[\widetilde{Q}(QAQ^{-1})\widetilde{Q}^{-1}\right]_{L_{k+1},L_k})=i$ for $0\leq i\leq k$ , and $$rank(\left[\widetilde{Q}(QAQ^{-1})\widetilde{Q}^{-1}\right]_{L_{k+2},L_{k+1}}) = k+1.$$ Since $(\widetilde{Q}Q)^{-1} = Q^{-1}\widetilde{Q}^{-1}$ , it can be derived that $$rank(\left[\left(\widetilde{Q}Q\right)A\left(\widetilde{Q}Q\right)^{-1}\right]_{L_{i+1},L_{i}}) = i \text{ for } 0 \le i \le k+1.$$ Since $(\widetilde{Q}Q)$ is also a permutation matrix, this theorem is true for $0 \le i \le k+1$ . Therefore, by induction, this theorem is true for $0 \le i \le n-1$ . This completes the proof of this theorem. Corollary 1: For any LCP, there exists a reordering mapping such that the new communication has no channel contention. **Proof:** This corollary can be derived from Theorem 1 and Theorem 4. Following the method in the proof of Theorem 4, we can design an algorithm, as shown in Fig. 3, to find an optimal reordering mapping for any LCP. The input of the algorithm, LCP\_Optimizer, is a LCP y=Ax+b, and the outputs are the optimal reordering ``` /* Given: a LCP, y=Ax+b, where A is a n \times n matrix and b is a vector. Goal: find an optimal reordering mapping Q and the new communication y'=Dx'+d, where D=QAQ<sup>-1</sup> and d=Qb. Initial values for Q, D, and d: Q=I, D=A, and d=b. */ LCP_Optimizer( GF_Matrix A, GF_Vector b, GF_Matrix Q, GF_Matrix D, GF_Vector d){ 2. GF_Matrix V=D; /* V will be used for finding independent rows. */ 3. GF_Vector R=0; /* To mark independent rows */ 4. int i, j, k; 5. for (i=0; i< n-1; i++){ /* At the beginning of iteration i, D has already been optimal for dim \le i and will be optimized for dim=i+1 */ 6. for(j=0; j<n; j++) /* To find a new independent row in sub-matrix D_{Ln,L(i+1)}. */ if( V[j][i]!=0 && R[j]==0 ) break; 8. if(j>i+1){} /* The new independent row is not in D<sub>L(i+2),L(i+1)</sub>, so move it in. */ 9. RowSwap(D, i+1, j); ColSwap(D, i+1, j); RowSwap(d, i+1, j); 10. RowSwap(V, i+1, j); ColSwap(V, i+1, j); RowSwap(R, i+1, j); 11. RowSwap(Q, i+1, j); 12. j=i+1; 13. } 14. for(k=0; k<n; k++) 15. if( k!=j && V[k][i]!=0 ) RowSub(V, j, k); /* V[k]= V[k] - V[j] */ 16. R[j]=1; 17. } 18. } ``` Fig. 3. The proposed algorithm for finding an optimal reordering mapping for a LCP. mapping Q and the new LCP y'=Dx'+d, where $D=QAQ^{-1}$ and d=Qb. At the beginning of LCP\_Optimizer, D, d, and Q are set to A, b, and I, respectively. By appropriately exchanging the rows and the columns, the desired new LCP can be obtained. An important work of LCP\_Optimizer is to find a set of linear independent rows in some sub-matrix of D. In order to do that efficiently, a matrix V is used to find linearly independent rows in a way similar to Gaussian Elimination, and a vector R is used to mark those rows. R[j] set to 1 means the jth row of some sub-matrix of D is linearly independent, and so is the jth row of the corresponding sub-matrix of V. Initially, V is set to D and R is set to 0. During the processing of LCP\_Optimizer, $Rank(D_{L_p,L_q}) = Rank(V_{L_p,L_q})$ for any p, q. The loop from line 5 to line 17 is the main part of LCP\_Optimizer. At the beginning of iteration i, D has already been optimal for those dimensions less than or equal to i, i.e., $Rank(D_{L_{r,1},L_{r}})=Rank(V_{L_{r,1},L_{r}})=r$ for $0 \le r \le i$ , and will be optimized for dimension (i+1). In the meaning time, the selected linearly independent rows of $V_{L_n,L_1}$ are all in $V_{L_{n,1},L_1}$ and marked by R. Each of those rows contains only one 1 and the other elements are 0. All the other rows in $V_{L_n,L_1}$ are zero rows. Hence, to find a new linearly independent row in $D_{L_n,L_{i+1}}$ and $D_{L_n,L_{i+1}}$ , we only need to examine column i of $D_{L_n,L_{i+1}}$ and $D_{L_n,L_{i+1}}$ and $D_{L_n,L_{i+1}}$ are all in $D_{L_n,L_{i+1}}$ and $D_{L_n,L_{i+1}}$ are zero rows. Hence, to find a new linearly independent row in $D_{L_n,L_{i+1}}$ and are zero rows. Hence, to find a new linearly independent row in $D_{L_n,L_{i+1}}$ and $D_{L_n,L_{i+1$ O(n) execution time, the complexity of LCP\_Optimizer can be proved to be $O(n^3)$ . **Example 4**. Consider the matrix-transpose communication on an 8-dimensional hypercube as shown in Example 3. From Fig. 3, we can compute the optimal reordering mapping Q and the new LCP as follows. The degree of channel contention at each dimension for the new LCP becomes $T_0=T_1=T_2=T_3=T_4=T_5=T_6=T_7=2^0=1$ . Compared with Example 3, the degree of channel contention is greatly reduced from 8 to 1, i.e., contention-free, by the reordering mapping. $\Box$ **Theorem 5**: For any binary matrix $A_{n \times n}$ , $rank(A_{n \times n}) < n$ , there exists a permutation matrix $Q_{n \times n}$ such that $i - rank(QAQ^{-1})_{L_{i+1},L_i} \le (n-1) - rank(A)$ for $0 \le i \le n-1$ . **Proof**: Since $rank(A_{n \times n}) < n$ , there exists a column j in A such that column j is a linear combination of other columns. Let P be the permutation matrix exchanging rows j and n-1. Accordingly, $P^{-1}$ must be the permutation matrix exchanging column j and n-1. Let $B=PAP^{-1}$ , we have $rank(B_{L_n,L_{n-1}})=rank(B)=rank(A)$ . In the following, we will prove that there exists a permutation matrix Q such that $$i - rank([QBQ^{-1}]_{L_{l+1},L_i}) \le (n-1) - rank(B_{L_n,L_{n-1}}) \text{ for } 0 \le i \le n-1.$$ The proof is by mathematic induction on the integer i. Obviously, it is true for i = n-1. Suppose that it is true for $k \le i \le n-1$ . There must exist a permutation matrix Q such that $$i - rank([QBQ^{-1}]_{L_{i+1},L_i}) \le (n-1) - rank(B_{L_n,L_{n-1}}) \text{ for } k \le i \le n-1.$$ We shall prove that it is also true for $k-1 \le i \le n-1$ . Consider i=k-1 for the matrix $QBQ^{-1}$ . If $(k-1) - rank([QBQ^{-1}]_{L_k,L_{k-1}}) > (n-1) - rank(B_{L_n,L_{n-1}})$ , it must be held that $$(k-1) - rank([QBQ^{-1}]_{L_k,L_{k-1}}) > k - rank([QBQ^{-1}]_{L_{k+1},L_k}).$$ That is $rank([QBQ^{-1}]_{L_{k+1},L_k})$ - 1 > $rank([QBQ^{-1}]_{L_k,L_{k-1}})$ . So we can derive $$rank(\left[QBQ^{-1}\right]_{L_{k},L_{k}}) \geq rank(\left[QBQ^{-1}\right]_{L_{k},L_{k-1}}), \text{ and } k-1 \geq rank(\left[QBQ^{-1}\right]_{L_{k},L_{k-1}}).$$ This means in $\left[QBQ^{-1}\right]_{L_k,L_k}$ the column (k-1) is linearly independent of other columns and there must exist a column j, $0 \le j \le k-2$ , which is a linear combination of other columns. Let $\widetilde{Q}$ be the permutation matrix exchanging rows j and k+1. Accordingly, $\widetilde{Q}^{-1}$ must be the permutation matrix exchanging column j and k+1. We can derive $$i - rank(\left[\widetilde{Q}(QBQ^{-1})\widetilde{Q}^{-1}\right]_{L_{i+1},L_{i}}) \leq (n-1) - rank(B_{L_{n},L_{n-1}}) \text{ for } k \leq i \leq n-1, \text{ and}$$ $$(k-1) - rank(\left[\widetilde{Q}(QBQ^{-1})\widetilde{Q}^{-1}\right]_{L_{k},L_{k-1}}) = (k-1) - rank(\left[QBQ^{-1}\right]_{L_{k},L_{k}})$$ $$\leq k - rank(\left[QBQ^{-1}\right]_{L_{k+1},L_{k}})$$ $$\leq (n-1) - rank(B_{L_{n},L_{n-1}}).$$ Since $(\widetilde{Q}Q)$ is also a permutation matrix, the inequality is true for $k-1 \le i \le n-1$ . Hence, by mathematic induction, it can be proved that there exists a permutation matrix Q such that $i - rank([QBQ^{-1}]_{L_{i+1},L_i}) \le (n-1) - rank(B_{L_n,L_{n-1}})$ for $0 \le i \le n-1$ . Therefore, $$i - rank([Q(PAP^{-1})Q^{-1}]_{L_{i+1},L_i}) \le (n-1) - rank([PAP^{-1}]_{L_n,L_{n-1}}) \text{ for } 0 \le i \le n-1.$$ In other words, $$i - rank([(QP)A(QP)^{-1}]_{L_{i+1},L_i}) \le (n-1) - rank(A)$$ for $0 \le i \le n-1$ . Since (QP) is also a permutation matrix, this completes the proof of Theorem 5. **Corollary 2**: For any LCG or LCS, the new communication with the reordering mapping in Theorem 5 has minimum channel contention. **Proof**: Given a LCG or LCS with binary matrix $A_{n \times n}$ , for any reordering mapping Q, $$MAX_{0 \le i \le n-1} \{i - rank(QAQ^{-1}]_{L_{i+1},L_i}\} \ge (n-1)-rank(A).$$ From Theorem 5, there exists a permutation matrix $\widetilde{Q}$ , such that $$i - rank \left( \widetilde{Q} A \widetilde{Q}^{-1} \right)_{L_{i+1}, L_i} \le (n-1) - rank(A) \text{ for } 0 \le i \le n-1.$$ Therefore, by Theorem 5, we can find an optimal reordering mapping such that the new communication has minimum channel contention. Similar to LCP\_Optimizer, an algorithm for LCS/LCG can be easily derived following the method in the proof of Theorem 5. It is omitted here to save the space. ## 5. Processor Mapping for a Set of LCCs From the above section, we can find an optimal reordering mapping for any LCC. However, there are probably more than one communications in a parallel program. For efficiently executing such a parallel program, we have to deal with the problem of per- forming a set of LCCs. An example for a set of LCCs is given below, which shows some frequently used subroutines that may be in a library for image processing, including image rotation, reflection, FFT, etc. **Example 5**. Suppose a $16\times16$ image is distributed on an 8-dimensional hypercube such that the pixel at coordinates (px, py) is on the node $(py\times16+px)$ . Some frequently used subroutines for processing such an image are shown as follows. - To reflect on the diagonal, each pixel (px, py) has to be moved to new position (py, px), i.e., to be sent from node (py×16+ px) to (px×16+ py). The communication on the hypercube is a LCC, y= A<sub>1</sub>x+b<sub>1</sub>, which is the matrix transpose shown in Example Similarly, to reflect on the line py= 15-px, the required communication is y=A<sub>1</sub>x+ [1 1 1 1 1 1 1]<sup>t</sup>; to rotate 90° clockwise, the required communication is y=A<sub>1</sub>x+ [1 1 1 0 0 0 0]<sup>t</sup>; and to rotate 90° counterclockwise, the required communication is y=A<sub>1</sub>x+ [0 0 0 0 1 1 1 1]<sup>t</sup>. - 2. To reflect on a vertical line, each pixel (px, py) has to be moved to new position (15-px, py). The required communication is $y = Ix + [1 \ 1 \ 1 \ 1 \ 0 \ 0 \ 0]^t$ . Similarly, to reflect on a horizontal line, the required communication is $y = Ix + [0 \ 0 \ 0 \ 1 \ 1 \ 1]^t$ ; to rotate 180°, the required communication is $y = Ix + [1 \ 1 \ 1 \ 1 \ 1 \ 1]^t$ . - 3. To scale the lower-left 8×8 sub-image by factor 2 along both of the axes, each pixel (px, py) in the sub-image has to be sent to four new positions (2px, 2py), (2px+1, 2py), (2px, 2py+1), and (2px+1, 2py+1). The required communication is a LCS, - 4. Consider those pixels of the image to be 256 discrete data and perform 1D FFT on these data. In addition to the neighboring communications, a bit-reverse communication is required [8], [12]. The bit-reverse communication, $y=A_3x+b_3$ , is the one shown in Example 1. - 5. To perform 2D FFT on the image, the process is to perform 1D FFT for each row of the image and then perform 1D FFT for each column of the image [12]. For performing 1D FFT on each row of the image, in addition to the neighboring communications, the bit-reverse communication for the rows is required. Similarly, for performing 1D FFT on each column of the image, the bit-reverse communication for the columns is required. They are shown as follows. Since a reordering mapping that is good for some LCCs may be harmful for others, our goal is to find a reordering mapping that is good enough for all the LCCs to be performed in a parallel program. For different applications, the objective functions of optimi- 23 zation may also be different. Let $y = A_r x + b_r$ , $0 \le r \le m-1$ , be the m LCCs to be performed. In some applications, these LCCs are in different subroutines that will be called dynamically. Example 5 shows one of such applications. Since we can not determine at the compile-time which and how many times the subroutines will be called, a reasonable choice is to minimize the maximum channel contention of these subroutines. That is to minimize $\underset{0 \le r \le m-1}{MAX} (T(A_r, b_r))$ . In some other applications, those LCCs may require to be performed simultaneously. Such a situation may happen when messages are scatter-gathered among processors. Sometimes, the interference between different communications may also lead to this situation. Since those LCCs are performed simultaneously, it is appropriate to minimize $MAX\{SUM_{0 \le i \le m-1}(T_i(A_r, b_r))\}$ . There are other possible concerns, such as to minimize mize $SUM_{0 \le i \le n-1} \{SUM_{0 \le r \le m-1} (T_i(A_r, b_r))\}$ . In this section, we propose an algorithm to find an optimal reordering mapping based on the dynamic programming approach. The concept of dynamic programming can be applied to any of the three objective functions in a similar way. Without loss of generality, we focus on the problem of minimizing $\max_{0 \le r \le m-1} (T(A_r, b_r))$ , i.e., finding an optimal reordering mapping O that minimize $$\underset{0 \leq r \leq m-1}{MAX} \left\{ \underset{0 \leq i \leq m-1}{MAX} \left\{ i - rank \left( \left[ QA_r Q^{-1} \right]_{L_{l+1}, L_i} \right) \right\} \right\}.$$ Though this problem can be solved by an exhaustive search on all possible reordering matrices, the large number of all possible reordering matrices, n!, makes such a solution unfeasible for a large hypercube computer. Therefore, we present an algorithm based on the technique of dynamic programming to reduce the search space. Since the matrix Q is a permutation matrix, there is a one-to-one correspondence between reordering mapping and the order of address bits. We shall use R(S) to denote an optimal ordering of the address bits in the set S. In other words, R(S) define an optimal reordering mapping for $[A_r]_{S,S}$ , $0 \le r \le m-1$ . For example, the corresponding order of address bits for the reordering mapping Q in Example 4 is $R(L_8) = (0,4,2,6,1,5,3,7)$ . The following theorem provides the theoretical foundation for applying dynamic programming to reduce the search space. **Theorem 6**: There exists an address bit j in S such that $(R(S - \{j\}), j)$ is an optimal ordering of address bits for $[A_r]_{S,S}$ , $0 \le r \le m-1$ . **Proof**: Let s be the cardinality of S. Suppose that j is the last address bit of R(S). Thus, the maximum number of channel contention at dimension s-1 must be the same for R(S) and $(R(S - \{j\}), j)$ . Since $R(S - \{j\})$ is optimal for $[A_r]_{S - \{j\}, S - \{j\}}$ , $(R(S - \{j\}), j)$ is optimal for $[A_r]_{S,S}$ at dimensions 0 to s-2. Therefore $(R(S - \{j\}), j)$ must be an optimal reordering for $[A_r]_{S,S}$ . $\square$ According to Theorem 6, we can designate R(S) as an optimal ordering chosen from $(R(S - \{i\}), i)$ for all i in S and find $R(L_n)$ by computing R(S) for all subsets $S \subset L_n$ . The computing of R(S), for $S \subset L_n$ , can be performed according to the cardinality of S in increasing order. For any $S \subset L_n$ , the number of searches is equivalent to its cardinality. Therefore, we can compute the total number of searches as follows. $$\sum_{k=1}^{n} k \cdot C_k^n = \sum_{k=1}^{n} n \cdot C_{k-1}^{n-1} = n \cdot 2^{n-1}.$$ It can be observed the search space is reduced from n! to $n \cdot 2^{n-1}$ . Fig. 4 gives the concept for computing $R(L_4)$ . Since the running time for computing each case in the search space is $O(m \times n^3)$ , the running time for computing $R(L_n)$ is $O(m \times n^4 \times 2^n)$ . Although the running time Fig. 4. Search steps for $R(L_4)$ . is not polynomial in terms of the dimension n, it is polynomial in terms of the number of processors. Hence, the algorithm is feasible even for a 16-dimensional hypercube. **Example 6.** Consider the two LCCs $A_1$ and $A_8$ in Example 5. We know that $A_1$ is for the matrix-transpose operation and $A_8$ is for the bit-reverse operation on an 8-dimensional hypercube. From the proposed dynamic programming approach, we can find that $R(L_8) = (3,4,0,7,2,5,1,6)$ . Hence, we can compute and the two new communications are After the processor mapping, the degrees of channel contention now become 2 for matrix transpose and 1 for bit-reverse instead of 8 in the original communications. # 6. Performance Study To investigate the performance improvement of the proposed approach, some experiments were carried out by simulating the network behavior of an 8-dimensional hypercube. In Subsection 6.1, we compare the *e*-cube routing and some partially adaptive routing strategies with our approach. These partially adaptive routing strategies include P-cube routing proposed by Glass and Ni [11]; the routing strategy proposed by Chiu [7]; minimal (Min) routing proposed by Li [14]; and MIXab3 and MIXbb3 routing proposed by Chen and Yihng[5]. This simulation is based on those performed in [5] and [11]. In Subsection 6.2, a practical application, the parallel FFT program, is performed on our simulator to show the benefit of the proposed approach. The network architecture assumed in the simulation is described as follows. There are 256 nodes connected as an 8-dimensional hypercube. Each node consists of a processor, local memory, a router, and other supporting devices. Between two neighboring routers, there are two uni-directional channels, one for each direction. A router can communicate with its local processor through pairs of ports. A separate buffer with a slot for one flit is associated with each channel. When more than one input channels contain header flits waiting for the same available output channel, the arbitration policy is in favor of the header flit that arrived at the router first. If a header flit in an input channel has more than one available output channels allowed by the routing strategy, the channel with the lowest dimension is selected. #### 6.1 Simulation for three communications Network performance is significantly affected by the communications, which are application-dependent. In the following discussion, we consider three communications: matrix-transpose, bit-reverse, and reverse-flip. They are chosen not only because they are frequently used in many scientific and engineering applications but also because they are used as tested cases in many routing algorithms, so that we can compare our simulation with their results. For matrix-transpose, every node $x = [x_0 \ x_1 \ x_2 \ x_3 \ x_4 \ x_5 \ x_6 \ x_7]^t$ sends messages to node $y = [x_4 \ x_5 \ x_6 \ x_7 \ x_0 \ x_1 \ x_2 \ x_3]^t$ . For bit-reverse, node $x = [x_0 \ x_1 \ x_2 \ x_3 \ x_4 \ x_5 \ x_6 \ x_7]^t$ sends messages to node $y = [x_7 \ x_6 \ x_5 \ x_4 \ x_3 \ x_2 \ x_1 \ x_0]^t$ . Reverse-flip behaves like bit-reverse except the address bits of destination node were complemented, i.e., node $x = [x_0 \ x_1 \ x_2 \ x_3 \ x_4 \ x_5 \ x_6 \ x_7]^t$ sends messages to node $y = [x_7 \ x_6 \ x_5 \ x_4 \ x_3 \ x_2 \ x_1 \ x_0]^t$ . All these three communications are LCPs. We may find reordering mapping for them and see how the performance can be improved. In the simulation, processors generate messages at time intervals given by a negative exponential distribution random variable. Each message is assumed to have 20 flits, including the header (flits). A flit requires a cycle to be transmitted through a channel. The measures of interest in this subsection are average message latency and average sustainable network throughput. The message latency is the number of cycles spent by a message in traveling from its source processor to its destination, taking the queuing delay into account. The average network throughput indicates the average number of flits delivered per cycle per processor. It is sustainable if the number of messages queued at their source processors is small and bounded. For a given system, the average message latency, in general, grows as the throughput increases. At low throughput, the network latency is contributed mainly by the message length and the distance to travel because there is little queuing delay involved. As the throughput increases, more channel contention and longer queuing delay happened, give rise to a higher message latency. One system exhibits better communication performance than another if it has a lower message latency for any given throughput. Fig. 5 to Fig. 7 show the simulation results of matrix-transpose, bit-reverse, and reverse-flip, respectively. In these figures, M-S denotes the simulation result using an optimal reordering mapping for the specific communication, which is proved to be contention-free in Section 4; and M-G uses the reordering mapping chosen for the three communications by using the dynamic programming algorithm proposed in Section 5. After the processor mapping M-G, the degree of channel contention of matrix-transpose is 2, and it is contention free for bit-reverse and reverse-flip. From Fig. 5 to Fig. 7, it can be observed that the routing strategies MIXab3, MIXbb3, Chiu and Min indeed improve the performance over the *e*-cube routing since the difference of the inter-router setup delay and flow control cycle time are not considered in this experiment [6]. The maximum sustainable network throughput of these routing strategies is about 30% to 100% higher than that of the *e*-cube routing. Their message latencies are also lower for any given sustainable throughput. The P-cube routing performs quite different for these three communications. It performs well for reverse-flip but worst for bit-reverse. From these results, it can be observed that it is very difficult for a routing algorithm to perform well for all communication patterns. It can be also observed that, for any of these three communications, the network throughput of the *e*-cube routing is always less than 0.125, which is 1/8 of the network capacity. As we have pointed out in Section 2, this is due to the degree of channel contention being 8 for the *e*-cube routing. In other words, the maximum throughput that can be achieved is approximately inversely proportional to the degree of channel contention. After applying M-S and M-G, there is no channel contention for performing these three communications except performing matrix-transpose after applying M-G. The degree of channel contention is 2 for matrix-transpose after applying M-G. Theoretically their throughput can approach 1 and 0.5, respectively. The actual value is somewhat smaller because of the queuing delay between messages generated by the same processor. As shown in Fig. 5, its value is about 0.3 instead of 0.5 for M-G. Note also that, for any given sustainable network throughput, the message latencies of M-G and M-S are far lower than that of traditional approaches. With these results, it is obvious that our approach can greatly reduce the network latency and significantly improve the throughput for LCC. Furthermore, no extra hardware supports and sophisticated routing strategies are needed. Only the *e*-cube wormhole routing is assumed in the proposed approach. Therefore, it is of practical use. Fig. 5. Simulation result of matrix-transpose on an 8-dimensional hypercube. Fig. 6. Simulation result of bit-reverse on an 8-dimensional hypercube. Fig. 7. Simulation result of reverse-flip on an 8-dimensional hypercube. #### 6.2 Simulation for FFT The Fast Fourier Transform (FFT) is one of the most commonly used algorithms in digital signal processing and is widely used in applications such as image processing and spectral analysis. The purpose of this subsection is to investigate the benefit of the proposed approach for such a practical application. The Discrete Fourier Transform (DFT) of an m-point discrete signal x(i) is defined by $X(k) = \sum_{i=0}^{m-1} x(i)W_m^{ik}$ , $0 \le k < m$ , where $W_m = e^{j2\pi i m}$ , and $j = \sqrt{-1}$ . Direct DFT computation requires $O(m^2)$ arithmetic operations. A faster method of computing the DFT is the FFT algorithm, which requires only $O(m \mid g \mid m)$ arithmetic operations. A more detailed analysis of FFT can be found in [8]. Fig. 8 shows an example of the flow chart of the FFT algorithm for 16 points. The FFT algorithm begins with a bit-reverse permutation of the inputs, followed by $\log m$ stages, each stage consisting of m/2 butterfly operations. A FFT input x can be identified by a binary vector $[x_0 \mid x_1 \mid x_2 \mid \dots \mid x_{\log(m-1)}]^t$ . In the ith computational stage, the two inputs of a butterfly operation are $[x_0 \mid x_1 \mid \dots \mid x_{\log(m-1)}]^t$ and $[x_0 \mid x_1 \mid \dots \mid x_{\log(m-1)}]^t$ . Fig. 8. Flow chart of a 16-point FFT. We can exploit some properties of the FFT algorithm to produce an efficient parallel algorithm. The Parallel\_FFT algorithm is described in the following paragraph. Let the number of data points $m=2^{n+2d}$ , where n is dimensions of the hypercube network, and d is a positive integer. Each input $x = [x_0 \ x_1 \ \dots \ x_d \ x_{d+1} \dots \ x_{n+d-1} \ x_{n+d} \dots \ x_{n+2d-1}]^t$ is assigned to processor $[x_d x_{d+1} \dots x_{n+d-1}]^t$ . Hence, the *m* inputs are distributed on the $2^n$ processors with block-cyclic distribution. There are 2<sup>d</sup> inputs in each block, and 2<sup>d</sup> blocks in each processor. The Parallel\_FFT algorithm follows the $(1 + \lg m)$ stages in the FFT algorithm. In the first stage, a bit-reverse communication between processors is required for completing a bit-reverse permutation of the inputs. In the following n+2d computational stages, the first and the last d stages do not require communication operations, and a neighboring communication is needed for each of the other n stages. The reasons are explained as follows. In the *i*th computational stage, $1 \le i \le d$ , the two inputs of a butterfly operation, $[x_0 \ x_1 \dots x_{i-1} \dots x_d \dots x_{n+2d-1}]^t$ and $[x_0 \ x_1 \dots \overline{x}_{i-1} \dots x_d \dots x_{n+2d-1}]^t$ , are in the same processor $[x_d \ x_{d+1} \dots \ x_{n+d-1}]^t$ . Hence, the data required for computation are all in local memory and no communication is needed. Similarly, no communication is needed in the ith computational stage, $n+d+1 \le i \le n+2d$ . In the computational stage i, $d+1 \le i \le n+d$ , the two inputs of a butterfly operation, $[x_0 \ x_1 \ \dots \ x_d \ \dots \ x_{i-1} \dots \ x_{n+d-1} \ \dots \ x_{n+2d-1}]^t$ and $[x_0 \ x_1 \ \dots \ x_d x_d \ \dots \ x_d \ x_d \ \dots$ $\overline{x}_{i-1} \dots x_{n+d-1} \dots x_{n+2d-1}$ are in processors $[x_d \dots x_{i-1} \dots x_{n+d-1}]^t$ and $[x_d \dots \overline{x}_{i-1} \dots x_{n+d-1}]^t$ , respectively. Therefore, a neighboring communication at dimension i-(d+1) is performed. After these (1 + lg(m)) stages, the FFT outputs can be obtained, and are also distributed on the 2<sup>n</sup> processors with block-cyclic distribution as inputs. The Parallel\_FFT algorithm requires O( $(m/2^n)lg(m)$ ) arithmetic operations, and (n+1) communication operations for each processor. By processor reordering mapping, the channel contention of the bit-reverse communication in the first stage of the Parallel\_FFT algorithm can be obviated, and those neighboring communications will stay unchanged. To see the performance improvement of processor mapping for the Parallel\_FFT algorithm, we simulated the algorithm on an 8-dimensional hypercube. Each input of FFT is a complex number which consists of two double precision floating-point numbers, one for the real part and the other for the imaginary part. The characteristics of the hypercube computer are based on nCUBE-2. The software latency is about $164 \mu s$ for a message. The time required for transmitting one byte through a channel is about $0.57\mu s$ . A butterfly operation requires about $5.12 \mu s$ provided that the value of $W^t$ can be found in a pre-computed table. If only half of a butterfly operation is performed in a processor, about $4.47 \mu s$ is required. The simulation results are shown in Table 1. The computation time and neighboring communication time are the same for all the routing strategies and are not changed after processor reordering mapping. Bit-reverse communication time is the most important part in this comparison. We can observe that the performances of those partially adaptive routing strategies are not so good as expected, and even worse than *e*-cube routing. This situation may be caused by the channel contention between the neighboring communica- | FFT | Comp | NB | Bit-Reverse | | | | | | | |------|--------|--------|--------------------|--------|--------|--------|--------|--------|-------| | Size | Time | Comm | Communication Time | | | | | | | | | | Time | e-cube | chiu | min | mixab3 | mixbb3 | pcube | map | | 28 | 35.8 | 1398.6 | 248.9 | 276.9 | 248.4 | 258.1 | 258.6 | 299.7 | 178.8 | | 210 | 163.5 | 1617.5 | 467.8 | 577.8 | 467.2 | 504.3 | 504.9 | 693.5 | 206.2 | | 212 | 736.0 | 2493.0 | 1343.3 | 1781.7 | 1342.8 | 1489.3 | 1489.8 | 2225.7 | 315.6 | | 214 | 3271.7 | 5995.1 | 4845.4 | 6597.0 | 4844.8 | 5429.1 | 5429.7 | 8354.3 | 753.4 | Table 1. Simulation result for the Parallel\_FFT algorithm. | FFT | Computation | Neighboring | Bit-Reverse | | | Execution | | | |------|-------------|---------------|--------------------|---------|---------|-----------|---------|---------| | Size | Time | Communication | Communication Time | | | Time | | | | | | Time | Original | Mapping | Speedup | Original | Mapping | Speedup | | 28 | 35.8 | 1398.6 | 248.9 | 178.8 | 1.39 | 1683.3 | 1613.2 | 1.04 | | 210 | 163.5 | 1617.5 | 467.8 | 206.2 | 2.27 | 2248.9 | 1987.2 | 1.13 | | 212 | 736.0 | 2493.0 | 1343.3 | 315.6 | 4.26 | 4572.4 | 3544.7 | 1.29 | | 214 | 3271.7 | 5995.1 | 4845.4 | 753.4 | 6.43 | 14112.2 | 10020.2 | 1.41 | Table 2. Speedup after optimal reordering mapping. tions and the bit-reverse communication since there is no barrier synchronization in the program. The benefit of processor reordering mapping can be easily observed here because the bit-reverse communication time is greatly reduced. Table 2 shows the speedup after applied the processor reordering mapping. The ideal speedup of the bit-reverse communication time is $2^{(n/2-1)}$ for an n-dimensional hypercube because the degree of contention is decreased from $2^{(n/2-1)}$ to 1. However, the speedup in the simulation result is smaller than the ideal speedup due to the effect of the software latency. When the number of data points m is small, the software latency dominates the communication time. Hence, the performance improvement is not evident. As m increases, the message size also increases, and the effect of channel contention becomes more critical for the communication time. Therefore, the performance improvement of processor mapping becomes more significant as m increases. From Table 2, it can be observed that the speedup of the bit-reverse communication time reaches 6.43 for m= $2^{14}$ . The overall execution time is about 41% faster after applying processor mapping. # 7. Performing LCC on SCI Origin 2000 In this section, we shall introduce how the processor mapping approach can be applied for efficiently performing LCCs on SCI Origin 2000. Fig. 9. An Origin system with 32 nodes. ### **7.1 SGI Origin 2000** The SGI Origin 2000 is a *cache-coherent non-uniform memory access (ccNUMA)* multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system consists of up to 512 nodes interconnected by a scalable *Craylink network*. Each node consists of one or two R10000 processors and up to 4 GB of coherent memory. For systems up to 32 nodes, the routers connect nodes in a bristled hypercubes. The network topology is bristled in that two nodes are connected to a single router instead of one. Fig. 9 shows an example of the Origin system. Low latency wormhole routing and software programmable routing table are important features of the router. Therefore, the proposed approached for performing LCCs on hypercubes can possibly be applied to the Origin system. The only problem to be solved is that there are two nodes connected to a single router. In the following subsections, we shall show the method to modify the proposed approach for performing LCCs on the Origin system. ### 7.2 LCC on an Origin 2000 system An **n**-dimensional LCC, y = Ax + b, on an Origin 2000 can be specified as the following formula. $$\begin{bmatrix} \mathbf{y}_{0} \\ \mathbf{y}_{1} \\ \vdots \\ \mathbf{y}_{n-1} \end{bmatrix} = \begin{bmatrix} \mathbf{a}_{0,0} & \mathbf{a}_{0,1} & \cdots & \mathbf{a}_{0,n-1} \\ \mathbf{a}_{1,0} & \mathbf{a}_{1,1} & \cdots & \mathbf{a}_{1,n-1} \\ \cdots & \cdots & \cdots & \cdots \\ \mathbf{a}_{n-1,0} & \mathbf{a}_{n-1,1} & \cdots & \mathbf{a}_{n-1,n-1} \end{bmatrix} \begin{bmatrix} \mathbf{x}_{0} \\ \mathbf{x}_{1} \\ \vdots \\ \mathbf{x}_{n-1} \end{bmatrix} + \begin{bmatrix} \mathbf{b}_{0} \\ \mathbf{b}_{1} \\ \vdots \\ \mathbf{b}_{n-1} \end{bmatrix}$$ (i) Suppose the source nodes in the LCC are separated into two sets by the address bit at dimension 0, we can have the following two formulas. $$\begin{bmatrix} \mathbf{y}_0 \\ \mathbf{y}_1 \\ \vdots \\ \mathbf{y}_{n-1} \end{bmatrix} = \begin{bmatrix} \mathbf{a}_{0,0} & \mathbf{a}_{0,1} & \cdots & \mathbf{a}_{0,n-1} \\ \mathbf{a}_{1,0} & \mathbf{a}_{1,1} & \cdots & \mathbf{a}_{1,n-1} \\ \cdots & \cdots & \cdots & \cdots \\ \mathbf{a}_{n-1,0} & \mathbf{a}_{n-1,1} & \cdots & \mathbf{a}_{n-1,n-1} \end{bmatrix} \begin{bmatrix} 0 \\ \mathbf{x}_1 \\ \vdots \\ \mathbf{x}_{n-1} \end{bmatrix} + \begin{bmatrix} \mathbf{b}_0 \\ \mathbf{b}_1 \\ \vdots \\ \mathbf{b}_{n-1} \end{bmatrix}, \text{ and}$$ $$\begin{bmatrix} \boldsymbol{y}_0 \\ \boldsymbol{y}_1 \\ \vdots \\ \boldsymbol{y}_{n-1} \end{bmatrix} = \begin{bmatrix} \boldsymbol{a}_{0,0} & \boldsymbol{a}_{0,1} & \cdots & \boldsymbol{a}_{0,n-1} \\ \boldsymbol{a}_{1,0} & \boldsymbol{a}_{1,1} & \cdots & \boldsymbol{a}_{1,n-1} \\ \cdots & \cdots & \cdots & \cdots \\ \boldsymbol{a}_{n-1,0} & \boldsymbol{a}_{n-1,1} & \cdots & \boldsymbol{a}_{n-1,n-1} \end{bmatrix} \begin{bmatrix} 1 \\ \boldsymbol{x}_1 \\ \vdots \\ \boldsymbol{b}_{n-1} \end{bmatrix} + \begin{bmatrix} \boldsymbol{b}_0 \\ \boldsymbol{b}_1 \\ \vdots \\ \boldsymbol{b}_{n-1} \end{bmatrix}$$ They can be rewritten to be $$\begin{cases} \mathbf{y}_{0} = \mathbf{a}_{0,1} \mathbf{x}_{1} + \mathbf{a}_{0,2} \mathbf{x}_{2} + \dots + \mathbf{a}_{0,\mathbf{n}-1} \mathbf{x}_{\mathbf{n}-1} + \mathbf{b}_{0} \\ \mathbf{y}_{1} \\ \mathbf{y}_{2} \\ \vdots \\ \mathbf{y}_{\mathbf{n}-1} \end{bmatrix} = \begin{bmatrix} \mathbf{a}_{1,0} & \mathbf{a}_{1,1} & \dots & \mathbf{a}_{1,\mathbf{n}-1} \\ \mathbf{a}_{2,0} & \mathbf{a}_{2,1} & \dots & \mathbf{a}_{2,\mathbf{n}-1} \\ \vdots \\ \mathbf{a}_{\mathbf{n}-1,0} & \mathbf{a}_{\mathbf{n}-1,1} & \dots & \mathbf{a}_{\mathbf{n}-1,\mathbf{n}-1} \end{bmatrix} \begin{bmatrix} \mathbf{x}_{1} \\ \mathbf{x}_{2} \\ \vdots \\ \mathbf{x}_{\mathbf{n}-1} \end{bmatrix} + \begin{bmatrix} \mathbf{b}_{1} \\ \mathbf{b}_{2} \\ \vdots \\ \mathbf{b}_{\mathbf{n}-1} \end{bmatrix}, \text{ and}$$ $$\begin{bmatrix} \mathbf{y}_{0} = \mathbf{a}_{0,0} + \mathbf{a}_{0,1} \mathbf{x}_{1} + \mathbf{a}_{0,2} \mathbf{x}_{2} + \dots + \mathbf{a}_{0,n-1} \mathbf{x}_{n-1} + \mathbf{b}_{0} \\ \mathbf{y}_{1} \\ \mathbf{y}_{2} \\ \vdots \\ \mathbf{y}_{n-1} \end{bmatrix} = \begin{bmatrix} \mathbf{a}_{1,0} & \mathbf{a}_{1,1} & \dots & \mathbf{a}_{1,n-1} \\ \mathbf{a}_{2,0} & \mathbf{a}_{2,1} & \dots & \mathbf{a}_{2,n-1} \\ \dots & \dots & \dots & \dots \\ \mathbf{a}_{n-1,0} & \mathbf{a}_{n-1,1} & \dots & \mathbf{a}_{n-1,n-1} \end{bmatrix} \begin{bmatrix} \mathbf{x}_{1} \\ \mathbf{x}_{2} \\ \vdots \\ \mathbf{x}_{n-1} \end{bmatrix} + \begin{bmatrix} \mathbf{b}_{1} \\ \mathbf{a}_{2,0} \\ \vdots \\ \mathbf{a}_{n-1,0} \end{bmatrix} + \begin{bmatrix} \mathbf{b}_{1} \\ \mathbf{b}_{2} \\ \vdots \\ \mathbf{b}_{n-1} \end{bmatrix}$$ In an Origin system, two nodes whose addresses only differ at dimension 0 are directly connected through the router. For the reason, the hypercube topology of Origin 2000 is of (**n**-1) dimensions, from dimensions 1 to (**n**-1). Therefore, an **n**-dimensional LCC on an Origin system can be viewed as two LCCs on the (**n**-1)-dimensional hypercube as shown in the following two formulas. $$\begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_{n-1} \end{bmatrix} = \begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,n-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,n-1} \\ \vdots \\ a_{n-1,0} & a_{n-1,1} & \cdots & a_{n-1,n-1} \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_{n-1} \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_{n-1} \end{bmatrix}$$ (ii) $$\begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_{n-1} \end{bmatrix} = \begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,n-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,n-1} \\ \vdots & \vdots & \vdots \\ a_{n-1,0} & a_{n-1,1} & \cdots & a_{n-1,n-1} \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_{n-1} \end{bmatrix} + \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \vdots \\ \alpha_{n-1} \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_{n-1} \end{bmatrix}$$ (iii) In the above two formulas, $n = \mathbf{n} - 1$ , $x_i = \mathbf{x}_{(i+1)}$ , $y_i = \mathbf{y}_{(i+1)}$ , $\alpha_i = \mathbf{a}_{(i+1), 0}$ , $b_i = \mathbf{b}_{(i+1)}$ , and $a_{i,j} = \mathbf{a}_{(i+1), (j+1)}$ for $0 \le i$ , $j \le n - 1$ . For convenience, given a LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , as specified in formula (i), we shall use $\mathbf{LCC}_0(\mathbf{A}, \mathbf{b})$ and $\mathbf{LCC}_1(\mathbf{A}, \mathbf{b})$ to denote the two sub-LCCs as specified in formulas (ii) and (iii). **Lemma 1:** For any LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000, $T_{i+1}(\mathbf{A}_{\mathbf{n} \times \mathbf{n}}, \mathbf{b}) \le T_i(A, b) + T_i(A, a+b)$ for $0 \le i \le n-1$ , where $n = \mathbf{n} - 1$ , $A = \mathbf{A}_{L_{\mathbf{n}} - \{0\}, L_{\mathbf{n}} - \{0\}}$ , $\alpha = \mathbf{A}_{L_{\mathbf{n}} - \{0\}, \{0\}}$ , and $b = \mathbf{b}_{L_{\mathbf{n}} - \{0\}}$ . **Proof:** Consider **LCC**<sub>0</sub>( $\boldsymbol{A}$ , $\boldsymbol{b}$ ) on an n-dimensional hypercube, if a channel l at dimension i is used, then, from Theorem 1, the number of paths that contend for it is $T_i(A, b)$ . Similarly, consider **LCC**<sub>1</sub>( $\boldsymbol{A}$ , $\boldsymbol{b}$ ) on the n-dimensional hypercube, if l is used, the number of paths that contend for it is $T_i(A, \alpha+b)$ . Since l may be used by both of the two LCCs, by one of them, or by none of them, it can be easily derived that $T_{i+1}(\boldsymbol{A}, \boldsymbol{b}) \leq T_i(A, b) + T_i(A, \alpha+b)$ . From the above lemma and Theorem 4, we can obtain the following corollary, which gives an upper bound of the degree of channel contention by using reordering mapping on **n**-dimensional LCP. For **n**-dimensional LCS and LCG, such an upper bound is given in Corollary 4. Corollary 3: For any **n**-dimensional LCP, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000, there exists a permutation matrix $\mathbf{Q}_{\mathbf{n} \times \mathbf{n}}$ such that $T_i(\mathbf{Q}\mathbf{A}\mathbf{Q}^{-1}, \mathbf{Q}\mathbf{b}) \le 2$ for $1 \le i \le \mathbf{n} - 1$ . **Lemma 2:** For any binary matrix $\mathbf{A}_{\mathbf{n} \times \mathbf{n}}$ , if $rank(\mathbf{A}) < \mathbf{n}$ , then there exists a permutation matrix $\mathbf{Q}_{\mathbf{n} \times \mathbf{n}}$ such that $rank([\mathbf{Q}\mathbf{A}\mathbf{Q}^{-1}]_{L_{\mathbf{n}}=\{0\},L_{\mathbf{n}}}) = rank(\mathbf{A})$ . **Proof:** If $rank(\mathbf{A}) < \mathbf{n}$ , then there exists a row j which is a linear combination of the other rows. Let $\mathbf{Q}_{n \times n}$ be the permutation matrix exchanging rows 0 and j. Accordingly, $\mathbf{Q}^{-1}$ must be the permutation matrix exchanging columns 0 and j. Hence, in $[\mathbf{Q}\mathbf{A}\mathbf{Q}^{-1}]$ , row 0 is a linear combination of the other rows. Therefore, $rank([\mathbf{Q}\mathbf{A}\mathbf{Q}^{-1}]_{L_n=\{0\},L_n\}}) = rank(\mathbf{A})$ . Corollary 4: For any **n**-dimensional LCS/LCG, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000, there exists a permutation matrix $\mathbf{Q}_{\mathbf{n} \times \mathbf{n}}$ such that $T_i(\mathbf{Q}\mathbf{A}\mathbf{Q}^{-1}, \mathbf{Q}\mathbf{b}) \le 2^{\mathbf{n} - rank(\mathbf{A})}$ for $1 \le i \le \mathbf{n} - 1$ . **Proof:** From Lemma 2, there exists a $n \times n$ permutation matrix $\mathbf{Q}_1$ such that $rank(\left[\mathbf{Q}_1\mathbf{A}\mathbf{Q}_1^{-1}\right]_{L_n-\{0\},L_n}) = rank(\mathbf{A}).$ Let $\mathbf{B} = \mathbf{Q}_1 \mathbf{A} \mathbf{Q}_1^{-1}$ , $n = \mathbf{n} - 1$ , $B = \mathbf{B}_{L_{\mathbf{n}} - \{0\}, L_{\mathbf{n}} - \{0\}}$ , $\alpha = \mathbf{B}_{L_{\mathbf{n}} - \{0\}, \{0\}}$ , and $b = [\mathbf{Q}_1 \mathbf{b}]_{L_{\mathbf{n}} - \{0\}}$ . From Theorem 5, there exists a matrix $Q_{n \times n}$ such that $T_{i-1}(QBQ^{-1}, Qb) \leq 2^{(n-1)-rank(B)}$ and $$T_{i-1}(QBQ^{-1}, Q(\alpha+b)) \le 2^{(n-1)-rank(B)}$$ for $1 \le i \le n-1$ . Let $\mathbf{Q}_2 = \begin{bmatrix} 1 & 0 \\ 0 & Q \end{bmatrix}$ and $\mathbf{Q}_{n \times n} = \mathbf{Q}_2 \mathbf{Q}_1$ . Therefore, from Lemma 1, $T_i(\mathbf{QAQ^{-1}}, \mathbf{Qb}) \leq 2^{n-rank(B)}$ for $1 \leq i \leq n-1$ . Since $n-rank(B) = (n+1)-(1+rank(B)) \leq (n+1)-rank(\mathbf{B}) = n-rank(\mathbf{A})$ , it can be derived that $T_i(\mathbf{QAQ^{-1}}, \mathbf{Qb}) \leq 2^{n-rank(\mathbf{A})}$ The above lemmas and corollaries show that the proposed processor reorder mapping can be used for reducing channel contention when performing a LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin system. It can be observed that, for each of $\mathbf{LCC}_0(\mathbf{A}, \mathbf{b})$ and $\mathbf{LCC}_1(\mathbf{A}, \mathbf{b})$ , the proposed method can minimize the channel contention. However, since the two sub-LCCs are performed simultaneously on an Origin system, a channel may be used by both of them, and the degree of channel contention may be doubled. In order to find an optimal processor mapping for a LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin system, some lemmas and theorems are given in the following discussions. **Theorem 7:** Consider an **n**-dimensional LCC, y = Ax + b, on an Origin 2000. Let B = $$\boldsymbol{A}_{L_{\boldsymbol{n}}-\{0\},L_{\boldsymbol{n}}}, \text{ for } 1 \leq i \leq \boldsymbol{n}-1, \ T_{i}(\boldsymbol{A}, \boldsymbol{b}) = \begin{cases} 0 & \text{, when } \boldsymbol{x}_{i} = \boldsymbol{y}_{i} \\ 2^{i-rank}(B_{L_{i},L_{i}}) & \text{, otherwise.} \end{cases}$$ **Proof:** Let n=n-1, $A=A_{L_n-\{0\},L_n-\{0\}}$ , $\alpha=A_{L_n-\{0\},\{0\}}$ , and $b=b_{L_n-\{0\}}$ . The proof includes three parts, which are specified as follows. For $0 \le i \le n-1$ , 1. $$T_{i+1}(\mathbf{A}, \mathbf{b}) = 0$$ , If $A_{(i)} = I_{(i)}$ and $b_i = \alpha_i = 0$ . 2. $$T_{i+1}(\mathbf{A}, \mathbf{b}) = MAX(T_i(A, b), T_i(A, \alpha+b)) = 2^{i-rank(A_{L_{i+1},L_i})}, \text{ if } rank(B_{L_{i+1},L_{i+1}}) > 0$$ $rank(A_{L_{i+1},L_i}).$ 3. $T_{i+1}(\mathbf{A}, \mathbf{b}) = T_i(A, b) + T_i(A, \alpha + b) = 2^{i-rank(A_{I_{i+1},I_i})+1}$ , otherwise. After the above three items are proved to be true, the theorem can be directly derived. The detail of the proof is given below. - 1. If $A_{(i)} = I_{(i)}$ and $b_i = \alpha_i = 0$ , then $T_i(A, b) = T_i(A, \alpha + b) = 0$ . Hence, $T_{i+1}(A, b) = 0$ . - 2. For any channel $l = ([z_0 \ z_1 \ ... \ z_i \ ... \ z_{n-1}]^t, \ [z_0 \ z_1 \ ... \ \overline{z}_i \ ... \ z_{n-1}]^t)$ at dimension i, suppose that l is in the path of some source-destination pair in $\mathbf{LCC}_0(\mathbf{A}, \mathbf{b})$ , then we can have $$\begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_i \end{bmatrix} = \begin{bmatrix} z_0 \\ z_1 \\ \vdots \\ \bar{z}_i \end{bmatrix} = \begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,i-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,i-1} \\ \vdots & \vdots & \vdots \\ a_{i,0} & a_{i,1} & \cdots & a_{i,i-1} \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_{i-1} \end{bmatrix} + \begin{bmatrix} a_{0,i} & a_{0,i+1} & \cdots & a_{0,n-1} \\ a_{1,i} & a_{1,i+1} & \cdots & a_{1,n-1} \\ \vdots & \vdots & \vdots \\ z_{n-1} \end{bmatrix} \begin{bmatrix} z_i \\ b_1 \\ \vdots \\ b_i \end{bmatrix}$$ (ii') Similarly, suppose l is in the path of some source-destination pair in $LCC_1(\boldsymbol{A}, \boldsymbol{b})$ , then we can have $$\begin{bmatrix} y_{0} \\ y_{1} \\ \vdots \\ y_{i} \end{bmatrix} = \begin{bmatrix} z_{0} \\ z_{1} \\ \vdots \\ \overline{z}_{i} \end{bmatrix} = \begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,i-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,i-1} \\ \cdots & \cdots & \cdots & \cdots \\ a_{i,0} & a_{i,1} & \cdots & a_{i,i-1} \end{bmatrix} \begin{bmatrix} x_{0} \\ x_{1} \\ \vdots \\ x_{i-1} \end{bmatrix} + \begin{bmatrix} a_{0,i} & a_{0,i+1} & \cdots & a_{0,n-1} \\ a_{1,i} & a_{1,i+1} & \cdots & a_{1,n-1} \\ \cdots & \cdots & \cdots & \cdots \\ a_{i,n} & a_{i,i+1} & \cdots & a_{i,n-1} \end{bmatrix} \begin{bmatrix} z_{i} \\ z_{i+1} \\ \vdots \\ z_{n-1} \end{bmatrix} + \begin{bmatrix} \alpha_{0} \\ \alpha_{1} \\ \vdots \\ \alpha_{i} \end{bmatrix} + \begin{bmatrix} b_{0} \\ b_{1} \\ \vdots \\ b_{i} \end{bmatrix}$$ (iii') Let p be a solution of formula (ii'), and q be a solution of formula (iii'), then we can have $$\begin{bmatrix} z_0 \\ z_1 \\ \vdots \\ \overline{z}_i \end{bmatrix} = \begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,i-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,i-1} \\ \vdots & \vdots & \vdots \\ a_{i,0} & a_{i,1} & \cdots & a_{i,i-1} \end{bmatrix} \begin{bmatrix} p_0 \\ p_1 \\ \vdots \\ p_{i-1} \end{bmatrix} + \begin{bmatrix} a_{0,i} & a_{0,i+1} & \cdots & a_{0,n-1} \\ a_{1,i} & a_{1,i+1} & \cdots & a_{1,n-1} \\ \vdots & \vdots & \vdots \\ a_{i,i} & a_{i,i+1} & \cdots & a_{i,n-1} \end{bmatrix} \begin{bmatrix} z_i \\ z_{i+1} \\ \vdots \\ z_{n-1} \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_i \end{bmatrix}$$ (ii") Hence, subtracting (iii") from (ii"), we have $$0 = \begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,i-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,i-1} \\ \cdots & \cdots & \cdots & \cdots \\ a_{i,0} & a_{i,1} & \cdots & a_{i,i-1} \end{bmatrix} \begin{bmatrix} p_0 \\ p_1 \\ \vdots \\ p_{i-1} \end{bmatrix} - \begin{bmatrix} q_0 \\ q_1 \\ \vdots \\ q_{i-1} \end{bmatrix} - \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \vdots \\ \alpha_i \end{bmatrix}.$$ If $[\alpha_0 \ \alpha_1 \ \dots \ \alpha_i]^t$ is not a linear combination of the columns in $A_{L_{i+1},L_i}$ , then p and q do not exist. Hence, no l exists such that it is used in both $\mathbf{LCC}_0(\mathbf{A}, \mathbf{b})$ and $\mathbf{LCC}_1(\mathbf{A}, \mathbf{b})$ . Therefore, if $rank(B_{L_{i+1},L_{i+1}}) > rank(A_{L_{i+1},L_i})$ , then $T_{i+1}(\mathbf{A}, \mathbf{b}) = MAX(T_i(A, b), T_i(A, \alpha+b)) = 2^{i-rank(A_{L_{i+1},L_i})}$ . 3. If $rank(B_{L_{i+1},L_{i+1}}) = rank(A_{L_{i+1},L_i})$ , then $[\alpha_0 \ \alpha_1 \ \dots \ \alpha_i]^t$ is a linear combination of the columns in $A_{L_{i+1},L_i}$ . Therefore, these is at least one solution for r in the following formula $$0 = \begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,i-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,i-1} \\ \vdots & \vdots & \ddots & \vdots \\ a_{i,0} & a_{i,1} & \cdots & a_{i,i-1} \end{bmatrix} \begin{bmatrix} r_0 \\ r_1 \\ \vdots \\ r_{i-1} \end{bmatrix} - \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \vdots \\ \alpha_i \end{bmatrix}$$ (iv) Let r' be a solution of formula (iv), and p is a solution of formula (iii') for some channel z. We can prove that q=p-r' is also a solution of formula (iii'') for the same channel z. Therefore, we can derive $T_{i+1}(\mathbf{A}, \mathbf{b}) = T_i(A, b) + T_i(A, \alpha+b) = 2^{i-rank(A_{L_{i+1},L_i})+1}$ . $\square$ From the above theorem, we can find that the degree of channel contention is determined only by the ranks of sub-matrices of the binary matrix **A**. For the similar reason shown in Section 4, we may assume $T_i(\mathbf{A}, \mathbf{b}) > 0$ for $1 \le i \le \mathbf{n} - 1$ in the following discussion without loss of generality. The following example shows the degree of channel contention of reverse-flip communication on an Origin system according to Theorem 7. **Example** 7. Consider the reverse-flip communication on an Origin system with 32 nodes, $$\begin{bmatrix} \mathbf{y}_{0} \\ \mathbf{y}_{1} \\ \mathbf{y}_{2} \\ \mathbf{y}_{3} \\ \mathbf{y}_{4} \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} \mathbf{x}_{0} \\ \mathbf{x}_{1} \\ \mathbf{x}_{2} \\ \mathbf{x}_{3} \\ \mathbf{x}_{4} \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}. \text{ According to Theorem 7, the degree of channel}$$ contention at each dimension can be computed as follows: $T_1=2^{1-0}=2$ , $T_2=2^{2-0}=4$ , $T_3=2^{3-2}=2$ , and $T_4=2^{4-4}=1$ . In the following discussions, we shall show how to minimize the channel contention for performing an LCC on an Origin system. **Lemma 3:** Consider an **n**-dimensional LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000. If $\mathbf{Q}_{n \times n}$ is a non-singular matrix, then - 1. $T(\mathbf{QAQ}^{-1}, \mathbf{Qb}) \ge 2^{\mathbf{n}-1-rank(\mathbf{A})}$ for $rank(\mathbf{A}) \le \mathbf{n}-2$ , and - 2. $T(QAQ^{-1}, Qb) \ge 1 \text{ for } rank(A) > n-2.$ Proof: Let $B = [\mathbf{QAQ}^{-1}]_{L_{\mathbf{n}} = \{0\}, L_{\mathbf{n}}}$ . From Theorem 7, $T_{\mathbf{n}-1}(\mathbf{QAQ}^{-1}, \mathbf{Qb}) = 2^{\mathbf{n}-1-rank(B_{L_{\mathbf{n}-1}, L_{\mathbf{n}-1}})}$ . Since $\max_{1 \le i \le \mathbf{n}-1} \{i - rank(B_{L_{i}, L_{i}})\} \ge \mathbf{n} - 1 - rank(B_{L_{\mathbf{n}-1}, L_{\mathbf{n}-1}}) \ge MAX(0, (\mathbf{n} - 1-rank(\mathbf{A})))$ , it can also be obtained that $T(\mathbf{QAQ}^{-1}, \mathbf{Qb}) \ge 2^{\mathbf{n}-1-rank(\mathbf{A})}$ for $rank(\mathbf{A}) \le \mathbf{n}-2$ , and $T(\mathbf{QAQ}^{-1}, \mathbf{Qb}) \ge 1$ for $rank(\mathbf{A}) > \mathbf{n}-2$ . Lemma 3 gives a lower bound of the degree of channel contention for performing an LCC on an Origin system. The following theorems and their proofs will show the algorithms to achieve the lower bound. **Theorem 8:** For any binary matrix $\mathbf{A}_{\mathbf{n} \times \mathbf{n}}$ , if $rank(\mathbf{A}) \leq \mathbf{n}$ -2, then there exists a permutation matrix $\mathbf{Q}_{\mathbf{n} \times \mathbf{n}}$ such that $i\text{-}rank(B_{L_i,L_i}) \leq \mathbf{n}$ -1- $rank(\mathbf{A})$ for $1 \leq i \leq \mathbf{n}$ -1, where $B = [\mathbf{Q} \mathbf{A} \mathbf{Q}^{-1}]_{L_{\mathbf{n}} = \{0\}, L_{\mathbf{n}}}$ . **Proof:** We shall prove by induction on i. - (1) Consider $i = \mathbf{n} 1$ . - a. From Lemma 2, we can find a permutation matrix $\mathbf{Q}_1$ such that $rank(\left[\mathbf{Q}\mathbf{A}\mathbf{Q}^{-1}\right]_{L_n-\{0\},L_n}) = rank(\mathbf{A}).$ - b. Let $\mathbf{B} = \left[ \mathbf{Q}_1 \mathbf{A} \mathbf{Q}_1^{-1} \right]$ . Since $rank(\mathbf{B}_{L_n \{0\}, L_n}) = rank(\mathbf{A}) \leq \mathbf{n} 2$ , there exists a column $j \neq 0$ in $\mathbf{B}_{L_n \{0\}, L_n}$ such that column j is a linear combination of the other columns. If $j = \mathbf{n} 1$ , then let $\mathbf{Q}_2 = \mathbf{I}$ . Otherwise, Let $\mathbf{Q}_2$ be the permutation matrix exchanging rows j and $\mathbf{n} 1$ . Accordingly, $\mathbf{Q}_2^{-1}$ must be the permutation matrix exchanging column j and $\mathbf{n} 1$ . Hence, $rank(\left[\mathbf{Q}_2\mathbf{B}\mathbf{Q}_2^{-1}\right]_{L_n \{0\}, L_n \{\mathbf{n} 1\}}) = rank(\mathbf{A})$ . Let $\mathbf{Q}=\mathbf{Q}_2\mathbf{Q}_1$ , we can derive that $\mathbf{n}-1$ -rank $(B_{L_{\mathbf{n}-1},L_{\mathbf{n}-1}})=\mathbf{n}-1$ -rank $(\mathbf{A})$ , where $B=[\mathbf{Q}\mathbf{A}\mathbf{Q}^{-1}]_{L_{\mathbf{n}}-\{0\},L_{\mathbf{n}}}$ . (2) Suppose the theorem is true for $k \le i \le n-1$ . Therefore, there exists a permutation matrix $\mathbf{P}_1$ such that $i\text{-}rank(C_{L_1,L_1}) \le \mathbf{n}\text{-}1\text{-}rank(\mathbf{A})$ , where $C = \left[\mathbf{P}_1\mathbf{A}\mathbf{P}_1^{-1}\right]_{L_n=\{0\},L_n}$ . - (3) Now we shall prove that the theorem is true for i=k-1. If k-1- $rank(C_{L_{k-1},L_{k-1}}) > n-1$ - $rank(\mathbf{A})$ , then - (i) k-1- $rank(C_{L_{k-1},L_{k-1}}) > n$ -1- $rank(\mathbf{A}) \geq n$ -1-(n-2)= 1. Therefore, $rank(C_{L_{k-1},L_{k-1}}) < k$ -2. - (ii) $k\text{-}1\text{-}rank(\ C_{L_{k-1},L_{k-1}}\ )> \text{ $n$-}1\text{-}rank(\ A)\geq k\text{-}rank(\ C_{L_k,L_k}\ ).}$ Hence, $rank(\ C_{L_{k-1},L_{k-1}}\ )< rank(\ C_{L_k,L_k}\ )-1$ . Since $rank(\ C_{L_{k-1},L_{k-1}}\ )\geq rank(\ C_{L_k,L_k}\ )-2$ , it can be derived that $rank(\ C_{L_{k-1},L_{k-1}}\ )= rank(\ C_{L_k,L_k}\ )-2$ . Therefore, $rank(\ C_{L_{k-1},L_{k-1}}\ )= rank(\ C_{L_{k-1},L_k}\ )-1= rank(\ C_{L_k,L_k}\ )-2$ . For the above reasons, $rank(C_{L_{k-1},L_k}) < k$ -1, and there exists a column $j\neq 0$ in $C_{L_{k-1},L_k}$ such that column j is a linear combination of the other columns. Let $\mathbf{P}_2$ be the permutation matrix exchanging rows j and k-1 of $\mathbf{P}_1\mathbf{AP}_1^{-1}$ . Accordingly, $\mathbf{P}_2^{-1}$ must be the permutation matrix exchanging columns j and k-1 of $\mathbf{P}_1\mathbf{AP}_1^{-1}$ . The effect of $\mathbf{P}_2$ and $\mathbf{P}_2^{-1}$ on C is that they exchange the rows (j-1) and (k-2), and the columns j and k-1 on C. Hence, $rank(C_{L_k,L_k}) = rank(C_{L_k,L_k})$ for $k\leq i\leq n$ -1, and $rank(C_{L_{k-1},L_{k-1}}) = rank(C_{L_{k-1},L_{k-1}}) = rank(C_{L_k,L_k}) - 1$ , where $C = \begin{bmatrix} \mathbf{P}_2\mathbf{P}_1\mathbf{AP}_1^{-1}\mathbf{P}_2^{-1} \end{bmatrix}_{L_n=\{0\},L_n}$ . Since, k-1- $rank(C_{L_{k-1},L_{k-1}}) = k$ - $rank(C_{L_k,L_k}) \leq n$ -1-rank(A), we can prove that there exists a permutation matrix $\mathbf{Q} = \mathbf{P}_2\mathbf{P}_1$ such that i- $rank(B_{L_1,L_1}) \leq n$ -1-rank(A) for k-1 $\leq i\leq n$ -1, where $B = \begin{bmatrix} \mathbf{Q}\mathbf{A}\mathbf{Q}^{-1} \end{bmatrix}_{L_n=\{0\},L_n}$ . By mathematic induction, we can prove that i-rank $(B_{L_i,L_i}) \le \mathbf{n}$ -1-rank $(\mathbf{A})$ for $1 \le i \le \mathbf{n}$ -1. This completes the proof. Corollary 5: Consider an **n**-dimensional LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000. If $rank(\mathbf{A}) \leq \mathbf{n}$ -2, there exists a reordering mapping such that the new communication has minimum channel contention. From Theorem 8 and Corollary 5, we can design an optimal reordering mapping for any **n**-dimensional LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000, where $rank(\mathbf{A}) \leq \mathbf{n}$ -2. However, if $rank(\mathbf{A}) > \mathbf{n}$ -2, there exists some LCCs, which can not be optimized to be contention free by reordering mapping. Example 8 gives two examples of such LCCs. **Example 8**. Two LCCs that can not be optimized to be contention free by reordering mapping. 1. $$\begin{bmatrix} \mathbf{y}_{0} \\ \mathbf{y}_{1} \\ \mathbf{y}_{2} \\ \mathbf{y}_{3} \\ \mathbf{y}_{4} \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x}_{0} \\ \mathbf{x}_{1} \\ \mathbf{x}_{2} \\ \mathbf{x}_{3} \\ \mathbf{x}_{4} \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}. T_{1} = 2^{1-0} = 2, T_{2} = 2^{2-1} = 2, T_{3} = 2^{3-2} = 2, \text{ and } T_{4} = 2^{4-3} = 2.$$ 2. $$\begin{bmatrix} \mathbf{y}_{0} \\ \mathbf{y}_{1} \\ \mathbf{y}_{2} \\ \mathbf{y}_{3} \\ \mathbf{y}_{4} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x}_{0} \\ \mathbf{x}_{1} \\ \mathbf{x}_{2} \\ \mathbf{x}_{3} \\ \mathbf{x}_{4} \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}. T_{1} = 2^{1-0} = 2, T_{2} = 2^{2-1} = 2, T_{3} = 2^{3-2} = 2, \text{ and } T_{4} = 2^{4-3} = 2. \quad \Box$$ For each LCC in the above example, the degree of channel contention is 2, and no reordering mapping can reduce the channel contention. If further optimization is required, the linear mapping approach can be used so that the new LCCs are contention free. In the following discussions, we shall first consider that $rank(\mathbf{A}) = \mathbf{n}-1$ for an $\mathbf{n}$ -dimensional LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000. The method is given in Lemma 5. Then, the condition of $rank(\mathbf{A}) = \mathbf{n}$ is discussed. Lemma 6 and Lemma 7 give the result for $\mathbf{A} = \mathbf{I}$ and $\mathbf{A} \neq \mathbf{I}$ , respectively **Lemma 4:** For any binary matrix $A_{n \times n}$ , if rank(A) < n, then there exists a non-singular matrix $Q_{n \times n}$ such that $[QAQ^{-1}]_{(0)}=0$ . **Proof:** From Lemma 2, we can find a permutation matrix $Q_1$ such that $rank([Q_1AQ_1^{-1}]_{L_n-\{0\},L_n}) = rank(A)$ . Let $B = [Q_1AQ_1^{-1}]$ , then row 0 of B is a linear combination of the other rows. Suppose $B_{(0)} = c_1B_{(1)} + c_2B_{(2)} \dots + c_{n-1}B_{(n-1)}$ , and $Q_2$ be an $n \times n$ matrix, where $[Q_2]_{(0)} = [1 - c_1 - c_2 \dots - c_{n-1}]$ and $[Q_2]_{(i)} = [I_{n \times n}]_{(i)}$ for $1 \le i \le n-1$ . We can derive that $[Q_2BQ_2^{-1}]_{(0)} = 0$ . Form the above discussions, let $Q = Q_2Q_1$ , then Q is non-singular and $[QAQ^{-1}]_{(0)} = 0$ . **Lemma 5:** For any binary matrix $\mathbf{A}_{\mathbf{n} \times \mathbf{n}}$ , if $rank(\mathbf{A}) = \mathbf{n} - 1$ , then there exists a non-singular matrix $\mathbf{Q}_{\mathbf{n} \times \mathbf{n}}$ such that i- $rank(B_{L_i,L_i}) = 0$ for $1 \le i \le \mathbf{n} - 1$ , where $B = \left[ \mathbf{Q} \mathbf{A} \mathbf{Q}^{-1} \right]_{L_{\mathbf{n}} - \{0\},L_{\mathbf{n}}}$ . #### **Proof:** - (1) From Lemma 4, we can find a non-singular matrix $\mathbf{P}$ such that $[\mathbf{PAP}^{-1}]_{(0)}=0$ . - (2) Let $\mathbf{B} = [\mathbf{PAP}^{-1}]$ , then $rank(\mathbf{B}_{L_{\mathbf{n}} \{0\}, L_{\mathbf{n}}}) = rank(\mathbf{A})$ . We shall prove by induction that there exists a non-singular matrix $\mathbf{Q}$ such that $[\mathbf{QBQ}^{-1}]_{(0)} = 0$ and $i\text{-}rank(B_{L_i, L_i}) = 0$ for $1 \le i \le \mathbf{n} 1$ , where $B = [\mathbf{QBQ}^{-1}]_{L_{\mathbf{n}} \{0\}, L_{\mathbf{n}}}$ . - a. Consider i= **n**-1. Let C= $\textbf{B}_{L_{\textbf{n}}-\{0\},L_{\textbf{n}}}$ . If $rank(C_{L_{n-1},L_{n-1}}) \neq rank(\textbf{A})$ , then there is a column $j \neq \mathbf{n}$ -1 in C such that column j is a linear combination of the other columns, and column $\mathbf{n}$ -1 is linearly independent. Let $\mathbf{Q}$ be the non-singular matrix subtracting rows j from row $\mathbf{n}$ -1. Accordingly, $\mathbf{Q}^1$ must be the non-singular matrix adding column $\mathbf{n}$ -1 to column j. Hence, $[\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{(0)}=0$ , and $rank([\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{L_{\mathbf{n}}-\{0\},L_{\mathbf{n}}-\{\mathbf{n}-1\}}) = rank([\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{L_{\mathbf{n}}-\{0\},L_{\mathbf{n}}}) = rank(\mathbf{B}) = rank(\mathbf{A})$ . Therefore, n-1- $rank([\mathbf{B}_{L_{\mathbf{n}-1},L_{\mathbf{n}-1}}]) = 0$ for $1 \leq i \leq \mathbf{n}$ -1, where $B = [\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{L_{\mathbf{n}}-\{0\},L_{\mathbf{n}}}$ . - b. Suppose there exists a non-singular $\mathbf{n} \times \mathbf{n}$ matrix $\mathbf{Q}_1$ such that $[\mathbf{Q} \mathbf{B} \mathbf{Q}^{-1}]_{(0)} = 0$ and $i\text{-}rank(B_{L_1,L_1}) = 0$ for $k \leq i \leq \mathbf{n} 1$ , where $B = \left[\mathbf{Q}_1 \mathbf{B} \mathbf{Q}_1^{-1}\right]_{L_{\mathbf{n}} = \{0\},L_{\mathbf{n}}}$ . - c. If k-1- $rank(B_{L_{k-1},L_{k-1}}) \neq 0$ , then it must be true that k-1- $rank(B_{L_{k-1},L_{k-1}}) > 0$ . Hence, k-1- $rank(B_{L_{k-1},L_{k-1}}) > k$ - $rank(B_{L_k,L_k})$ . That is, $rank(B_{L_{k-1},L_{k-1}}) < rank(B_{L_k,L_k})$ -1. Hence, $rank(B_{L_{k-1},L_{k-1}}) = rank(B_{L_{k-1},L_k})$ -1= $rank(B_{L_k,L_k})$ -2. Therefore, there is a column $j \neq k$ in $B_{L_{k-1},L_k}$ such that column j is a linear combination of other columns. Let $\mathbf{Q}_2$ be the non-singular matrix subtracting rows j from row k. Accordingly, $\mathbf{Q}_2^{-1}$ must be the non-singular matrix adding column k to column j. Hence, let $\mathbf{Q} = \mathbf{Q}_2\mathbf{Q}_1$ , we can derive that $[\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{(0)} = 0$ , $rank([\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{L_{\mathbf{n}}-\{0\},L_{\mathbf{n}}}]_{L_{k-1},L_{k-1}}) = rank([\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{L_{\mathbf{n}}-\{0\},L_{\mathbf{n}}}]_{L_{k-1},L_{k-1}})$ and $rank([\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{L_{\mathbf{n}}-\{0\},L_{\mathbf{n}}}]_{L_{k-1},L_{k-1}}) = 0$ for Therefore, $\mathbf{Q}$ is a non-singular matrix, $[\mathbf{Q}\mathbf{B}\mathbf{Q}^{-1}]_{(0)} = 0$ , and i- $rank(B_{L_1,L_1}) = 0$ for $$k-1 \le i \le n-1$$ , where $B = [QBQ^{-1}]_{L_{n}=\{0\},L_{n}}$ . From the above discussion, there exists a non-singular matrix ( $\mathbf{QP}$ ) such that this theorem can be proved to be true. Lemma 6: Consider an **n**-dimensional LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000. If $\mathbf{A} = \mathbf{I}_{\mathbf{n} \times \mathbf{n}}$ then there exists a non-singular matrix $\mathbf{Q}_{\mathbf{n} \times \mathbf{n}}$ such that $[\mathbf{Q}\mathbf{x}]_i = [\mathbf{Q}\mathbf{y}]_i$ , for $1 \le i \le \mathbf{n} - 1$ . **Proof:** If $\mathbf{b} = 0$ , then $\mathbf{y} = \mathbf{l}\mathbf{x} + 0 = \mathbf{x}$ . If $\mathbf{b} \neq 0$ , obviously, we can find a $\mathbf{n} \times \mathbf{n}$ permutation matrix $\mathbf{P}_1$ such that $[\mathbf{P}_1 \mathbf{b}]_0 \neq 0$ , i.e., $[\mathbf{P}_1 \mathbf{b}]_0 = 1$ . Suppose $$\mathbf{P}_2 = \begin{bmatrix} 1 & 0 \\ -[\mathbf{P}_1 \mathbf{b}]_{L_{\mathbf{n}}-\{0\}} & \mathbf{I}_{(\mathbf{n}-1)\times(\mathbf{n}-1)} \end{bmatrix}$$ . We can derive that $[\mathbf{P}_2 \mathbf{P}_1 \mathbf{b}]_0 = 1$ and $[\mathbf{P}_2\mathbf{P}_1\mathbf{b}]_i = 0$ for $1 \le i \le \mathbf{n}$ -1. Form the above discussion, let $\mathbf{Q} = \mathbf{P}_2\mathbf{P}_1$ , then $\mathbf{Q}$ is non-singular and $\mathbf{Q}\mathbf{y} = \mathbf{Q}\mathbf{x} + \mathbf{Q}\mathbf{b}$ . Since $[\mathbf{Q}\mathbf{b}]_i = 0$ for $1 \le i \le \mathbf{n} - 1$ , we can derive $[\mathbf{Q}\mathbf{x}]_i = [\mathbf{Q}\mathbf{y}]_i$ , for $1 \le i \le \mathbf{n} - 1$ . **Lemma 7:** For any binary matrix $\mathbf{A}_{\mathbf{n}\times\mathbf{n}}$ , if $rank(\mathbf{A}) = \mathbf{n}$ and $\mathbf{A} \neq \mathbf{I}_{\mathbf{n}\times\mathbf{n}}$ , then there exists a non-singular matrix $\mathbf{Q}_{\mathbf{n}\times\mathbf{n}}$ such that $i\text{-}rank(B_{L_i,L_i}) = 0$ for $1 \leq i \leq \mathbf{n} - 1$ , where $B = [\mathbf{Q}\mathbf{A}\mathbf{Q}^{-1}]_{L_i = \{0\},L_i}$ . #### **Proof:** - (1) Since $\mathbf{A} \neq \mathbf{I}$ , there must exists an $\mathbf{n} \times \mathbf{n}$ permutation matrix $\mathbf{Q}_1$ such that $[\mathbf{Q}_1 \mathbf{A} \mathbf{Q}_1^{-1}]_{0,\mathbf{n}-1}$ = 1. - (2) Let $\mathbf{B} = \mathbf{Q}_1 \mathbf{A} \mathbf{Q}_1^{-1}$ . If $\mathbf{B}_{0,j} \neq 0$ , $j \neq n-1$ , then we can find an $\mathbf{n} \times \mathbf{n}$ matrix $\mathbf{P}$ , where $\mathbf{P}_{(i)} = \mathbf{P}_{(i)}$ I<sub>(i)</sub> for $0 \le i \le n-2$ and $P_{(n-1)} = [0 \ 0 \ \dots P_{n-1,j} = B_{0,j} \ \dots \ 0 \ 1]$ . Therefore, P is a non-singular matrix and can be used for adding rows j to row n-1. Accordingly, $P^{-1}$ must be the non-singular matrix for subtracting column n-1 from column j. Hence, we can derive $[PBP^{-1}]_{0,j} = 0$ . By applying above process recursively, we can obtain a non-singular matrix $Q_2 = \begin{bmatrix} I_{(n-1)\times(n-1)} & 0 \\ -[B]_{(0),L_{n-1}} & 1 \end{bmatrix}$ so that $[Q_2BQ_2^{-1}]_{(0)} = [0 \ 0 \ \dots \ 0 \ 1]$ . (3) Let $\mathbf{C} = \mathbf{Q}_2 \mathbf{B} \mathbf{Q}_2^{-1}$ . Similar to (2) in the proof of Lemma 5, we can prove that there exists a non-singular matrix $\mathbf{Q}_3$ such that $[\mathbf{Q}_3 \mathbf{C} \mathbf{Q}_3^{-1}]_{(0)} = [0 \ 0 \ \dots \ 0 \ 1]$ and $i\text{-}rank(C_{L_i,L_i})$ = 0 for $1 \le i \le \mathbf{n} - 1$ , where $C = [\mathbf{Q}_3 \mathbf{C} \mathbf{Q}_3^{-1}]_{L_{\mathbf{n}} - \{0\},L_{\mathbf{n}}}$ . From the above discussions, let $\mathbf{Q} = \mathbf{Q}_3 \mathbf{Q}_2 \mathbf{Q}_1$ , we can derive that $i\text{-}rank(B_{L_i,L_i}) = 0$ for $1 \le i \le n-1$ , where $B = \begin{bmatrix} \mathbf{Q} \mathbf{A} \mathbf{Q}^{-1} \end{bmatrix}_{L_n = \{0\}, L_n \}}$ . Corollary 6: Consider an **n**-dimensional LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000. If $rank(\mathbf{A}) \ge \mathbf{n}$ -1, there exists a linear mapping such that the new communication has no channel contention. **Corollary 7:** For any LCC on Origin 2000, there exists a linear mapping such that the new communication has minimum channel contention. Example 9. Consider the reverse-flip communication on an Origin system with 32 nodes, $$\begin{bmatrix} \mathbf{y}_0 \\ \mathbf{y}_1 \\ \mathbf{y}_2 \\ \mathbf{y}_3 \\ \mathbf{y}_4 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} \mathbf{x}_0 \\ \mathbf{x}_1 \\ \mathbf{x}_2 \\ \mathbf{x}_3 \\ \mathbf{x}_4 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}. \text{ According to Theorem 7, we can find a}$$ non-singular matrix $$\mathbf{Q} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & -1 & 1 & 0 & 0 \\ -1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix}$$ and $\mathbf{Q}^{-1} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix}$ . Hence, the new communication is $$\begin{bmatrix} \mathbf{y'}_0 \\ \mathbf{y'}_1 \\ \mathbf{y'}_2 \\ \mathbf{y'}_3 \\ \mathbf{y'}_4 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 & 0 \\ -1 & 1 & 1 & -1 & 0 \\ 0 & 1 & 0 & 0 & -1 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} \mathbf{x'}_0 \\ \mathbf{x'}_1 \\ \mathbf{x'}_2 \\ \mathbf{x'}_3 \\ \mathbf{x'}_4 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 0 \\ 0 \\ 1 \end{bmatrix}, \text{ and the degree of channel}$$ contention at each dimension can be computed as follows: $T_1=2^{1-1}=1$ , $T_2=2^{2-2}=1$ , $T_3=2^{3-3}=1$ , and $T_4=2^{4-4}=1$ . In the above discussion, we proposed the linear mapping approach for an $\mathbf{n}$ -dimensional LCC, $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ , on an Origin 2000, where $rank(\mathbf{A}) \ge \mathbf{n}$ -1. It can be observed that the new LCCs can be optimized to be contention free by linear mapping. All the discussions in this section clearly show the applicability of the processor mapping approaches on existing machines. ## 8. Conclusions In this report, we address the problem of minimizing the maximum number of paths contending for the same channel when performing LCC on *e*-cube wormhole-routed hypercubes. A new approach called processor reordering mapping is proposed to solve this problem. We have proved that, for any LCC, there exists a reordering mapping such that the new communication after processor reordering has minimum channel contention. An $O(n^3)$ algorithm is proposed to find such a mapping for an n-dimensional hypercube. As for a set of LCCs, an algorithm based on dynamic programming is proposed to search for an optimal reordering mapping. It can greatly reduce the search space and thus is feasible even for a large hypercube computer. Simulation results clearly show significant performance improvement provided by the proposed approach when compared with partially adaptive routing strategies. With these results, compiler techniques can be used to reduce the message latency without the need of extra hardware costs. ## References: - [1] *nCUBE 2 Supercomputers Manual*, NCUBE Company, 1990. - [2] Origin<sup>TM</sup> Servers Technical Report, Silicon Graphics, Inc., 1998. - [3] S. Abraham and K. Padmanabhan, "Performance of the Direct Binary *n*-Cube Network for Multiprocessors," *IEEE Transactions on Computers*, vol. 38, no. 7, pp. 1000-1011, Jul. 1989. - [4] R. Boppana and C. S. Raghavendra, "Optimal Self-Routing of Linear-Complement Permutations in Hypercubes," *The Fifth Distributed Memory Computing Conference*, pp. 800-808, Apr. 1990. - [5] H. L. Chen and H. S. Yihng, "Generalized Wormhole Routing Strategies in Hypercubes," *Journal of Information Science and Engineering*, vol. 10, pp. 387-341, 1994. - [6] A. A. Chien, "A Cost and Speed Model for *k*-ary *n*-cube Wormhole Routers," *IEEE Transactions on Parallel and Distributed Systems*. vol. 9, no. 2, pp. 150-162, Feb. 1998. - [7] G. M. Chiu, S. Chalasani, and C. S. Raghavendra, "Flexible, Fault-Tolerant Routing Criteria for Circuit-Switched Hypercubes," *Proc. IEEE 11th International Conference on Distributed Computing Systems*, pp.582-589, 1991. - [8] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, *Introduction to Algorithms*, The MIT Press, 1990. - [9] W. J. Dally, "Performance Analysis of *k*-ary *n*-cube Interconnection Networks," *IEEE Transactions on Computers*, vol. 39, no. 6, pp. 775-785, Jun. 1990. - [10] W. J. Dally and C. L. Seitz, "Deadlock-Free Message Routing in Multiprocessor Interconnection Networks," *IEEE Transactions on Computers*, vol. C-36, no. 5, pp. 547-553, May. 1987. - [11] C. J. Glass and L. M. Ni, "The Turn Model for Adaptive Routing," *Journal of the Association for Computing Machinery*, vol. 41, no. 5, pp. 874-902, Sep. 1994. - [12] R. C. Gonzalez and R. E. Woods, *Digital Image Processing*, Addison-Wesley Publishing Company, Inc., 1992. - [13] F. T. Leighton, *Introduction to Parallel Algorithm and Architectures: Arrays, Trees, Hypercubes*. Morgan Kaufmann Publishers, San Mateo, 1992. - [14] Q. Li, "Minimum Deadlock-Free Message Routing Restrictions in Binary Hypercubes," *Journal of Parallel and Distributed Computing*, vol. 15, no. 2, pp. 153-159, 1992. - [15] F. C. Lin and F. H. Wang "Minimum Deadlock-Free Message Routing Restrictions - in Binary Hypercubes," *Journal of Parallel and Distributed Computing*, vol. 29, no. 2, pp. 27-42, 1995. - [16] H. Masuyama, "Algorithms to realize an arbitrary BPC permutation in chordal ring networks and mesh connected networks," *IEICE Trans. Inf. Syst. (Japan)*, vol. E77-D, no. 10, pp. 1118-1129, Oct. 1994. - [17] H. Masuyama, Y. Morita and E. Masuyama, "A realization of an arbitrary BPC permutation in hypercube connected computer networks" *IEICE Trans. Inf. Syst.* (*Japan*), vol. E78-D, no. 4, pp. 428-435, Apr. 1995. - [18] R. J. McEliece, *Finite Fields for Computer Scientists and Engineers*. Kluwer Academic Publishers, 1987. - [19] D. Nassimi and S. Sahni, "An Optimal Routing Algorithm for Mesh-Connected Parallel Computers," *Journal of the Association for Computing Machinery*, vol. 27, no. 1, pp. 6-29, Jan. 1980. - [20] D. Nassimi and S. Sahni, "Optimal BPC Permutations on a Cube Connected SIMD Computer," *IEEE Transaction on Computers*, vol. C-31, no. 4, pp. 338-341, Apr. 1982. - [21] L. M. Ni and L. M. McKinley, "A Survey of Wormhole Routing Techniques in Direct Networks," *IEEE Computer*, vol. 26, pp. 62-76, Feb. 1993. - [22] Y. Saad and M. H. Schultz, "Topological Properties of Hypercubes," *IEEE Transaction on Computers*, vol. 37, no. 7, pp. 867-872, Jul. 1988.