Pseudocode
Not likely that we’ll have n^2 processes so we’ll modify the algorithm
Assume that P=q^2, where q is an integer s.t. q evenly divides n.
Let nbar = n/q and define A_ij to be the nbar x nbar submatrix of A whose
first entry is a_i*nbar,j*nbar. Define B_ij and C_ij in the same fashion.
/* my process row = i, my process column = j */
/* A_ij~A[i,j], B_ij~B[i,j], C_ij~C[i,j] */
for (stage=0;stage<q;stage++)
Broadcast A[i,kbar] across processor row i;
C[i,j] = C[i,j] + A[i,kbar] *B [kbar,j] ;
Send B[kbar,j] to destination;
Recv B[(kbar+1) mod q, j] from source;