Mindboards

Posted: **13 Jul 2011, 09:07**

Linusa,
The issue is that accessing elements using [j] isn't very efficient so accessing by [j] will be just as bad. (At least, not anywhere near the performance I'm looking for...) And there is performance differences, if you want to access [j+1] afterwards this will be more efficient in NXC than doing [i+1][j]. (In NXC multidimensional arrays should be treated as arrays of array pointers and not a linear clump of memory.)

Well, I just optimized the function instead and made it 3 times as fast.

Code: Select all

/*	Multiplies Matrices A and B and stores the result in C.
	Size of A is [N][M] and size of B is [M][K]
*/
/*/	Original
void MatrixMatrixMult(float A[][], float B[][], float &C[][], int N, int M, int K){
  int i, j, s;

  for (i=0; i<N; ++i) {
    for (j=0; j<K; ++j) {
       C[i][j]=0;
       for (s=0; s<M; ++s) {
         C[i][j]=C[i][j] + A[i][s]*B[s][j];
      }
    }
  }
}
//*/

/*/	Just optimize array-lookups
void MatrixMatrixMult(float A[][], float B[][], float &C[][], int N, int M, int K){
	for( unsigned int i=0; i<N; ++i ){
		//Speed up arrays
		float A_i[]; A_i = A[i];	//Compiler bug
		float C_i[]; C_i = C[i];
		
		for( unsigned int j=0; j<K; ++j ){
			float sum = 0;
			for( unsigned int s=0; s<M; ++s ){
				sum += A_i[s]*B[s][j];
			}
			C_i[j] = sum;
		}
		C[i] = C_i;
	}
}
//*/

/*/	Optimize loops
void MatrixMatrixMult(float A[][], float B[][], float &C[][], int N, int M, int K){
	int i = N;	lbl_Mult_start_i:	i--;	//Loop start
		//Speed up arrays
		float A_i[]; A_i = A[i];	//Compiler bug
		float C_i[]; C_i = C[i];
		
		int j = K;
		lbl_Mult_start_j: j--;	//Second loop start
			float sum = 0;
			
			int s = M;	lbl_Mult_start_s: s--;	//Third loop start
			
				sum += A_i[s]*B[s][j];
			
			asm{ brtst GT, lbl_Mult_start_s, s }
			
			C_i[j] = sum;
		
		asm{ brtst GT, lbl_Mult_start_j, j }
		
		C[i] = C_i;
		
	asm{ brtst GT, lbl_Mult_start_i, i }
}
//*/

/*/	Optimize index and replace
void MatrixMatrixMult(float A[][], float B[][], float &C[][], int N, int M, int K){
	int i = N;	lbl_Mult_start_i:	i--;	//Loop start
		//Speed up arrays
		float A_i[];
		float C_i[];
		asm{ index A_i, A, i }
		asm{ index C_i, C, i }
		
		int j = K;	lbl_Mult_start_j: j--;	//Second loop start
			float sum = 0;
			
			int s = M;	lbl_Mult_start_s: s--;	//Third loop start
				
				float arr_temp, sum_temp;
				asm{ index sum_temp, A_i, s }
				float arr_temp2[];
				asm{ index arr_temp2, B, s }
				asm{ index arr_temp, arr_temp2, j }
				sum += sum_temp * arr_temp;
			
			asm{ brtst GT, lbl_Mult_start_s, s }
			
			asm{ replace C_i, C_i, j, sum }
		
		asm{ brtst GT, lbl_Mult_start_j, j }
		
		asm{ replace C, C, i, C_i }
		
	asm{ brtst GT, lbl_Mult_start_i, i }
}
//*/

//*/	Minor optimizations and improvements
void MatrixMatrixMult(float A[][], float B[][], float &C[][], int N, int M, int K){
	float arr_temp[];
	//Set size of C
	ArrayInit( arr_temp, 0, K );
	ArrayInit( C, arr_temp, N );
	
	lbl_Mult_start_i:	N--;	//Loop start
		//Speed up arrays
		float A_i[];
		float C_i[];
		asm{ index A_i, A, N }
		asm{ index C_i, C, N }
		
		int j = K;	lbl_Mult_start_j: j--;	//Second loop start
			float sum = 0;
			
			int s = M;	lbl_Mult_start_s: s--;	//Third loop start
				
				//sum += A_i[s]*B[s][j];
				float temp, sum_temp;
				asm{ index sum_temp, A_i, s }
				asm{ index arr_temp, B, s }
				asm{ index temp, arr_temp, j }
				sum_temp *= temp;
				sum += sum_temp;
			
			asm{ brtst GT, lbl_Mult_start_s, s }
			
			asm{ replace C_i, C_i, j, sum }
		
		asm{ brtst GT, lbl_Mult_start_j, j }
		
		asm{ replace C, C, N, C_i }
		
	asm{ brtst GT, lbl_Mult_start_i, N }
}
//*/


task main() {
	float A[10][7], B[7][12], C[10][12];

	//Initialize variables
	A[0][0]=1;	A[0][1]=2;	A[0][2]=3;
	A[1][0]=3;	A[1][1]=4;	A[1][2]=4;
	A[2][0]=4;	A[2][1]=7;	A[2][2]=1;

	B[0][0]=3;	B[0][1]=1;	B[0][2]=4;
	B[1][0]=7;	B[1][1]=6;	B[1][2]=8;
	B[2][0]=4;	B[2][1]=5;	B[2][2]=9;

	TextOut(0,40,"Mul");
	
	unsigned long t0 = CurrentTick();
	repeat( 100 )
		MatrixMatrixMult(A,B,C, 10,7,12);
	
	NumOut(0,58, CurrentTick() - t0 );	//Show time taken
	
	//Show results
	NumOut(0,32,C[0][0]);	NumOut(30,32,C[0][1]);	NumOut(60,32,C[0][2]);
	NumOut(0,24,C[1][0]);	NumOut(30,24,C[1][1]);	NumOut(60,24,C[1][2]);
	NumOut(0,16,C[2][0]);	NumOut(30,16,C[2][1]);	NumOut(60,16,C[2][2]);

	while(1);
}

You can see every step in the optimization process, so if interested you can try each out and see how much/how little they matters.
One functional change in the last modification, it automatically resets C to the right array size. If you also want to reduce memory requirements while the function is not in use, add the following at the very end:

Code: Select all

ArrayInit( arr_temp, 0, 0 );
ArrayInit( A_i, 0, 0 );
ArrayInit( C_i, 0, 0 );

Posted: **13 Jul 2011, 09:13**

spillerec,
thanks for contributing, but I don't look through all of your codes.
Could you make a speed test of your new code compared to my old one, maybe multiplying a 10x15 matrix by a 15x20 matrix a few hundred of times?

Posted: **13 Jul 2011, 09:19**

My original tests:

Time for 100 runs with 3x3x3:
Original: 1881
Array lookup optimize: 1108
loop optimize: 815
index and replace: 666

Time for 100 runs with 10x7x12:
Original: 56113
Array lookup optimize: 27283
loop optimize: 21105
index and replace: 17422

Didn't write down the results for the last modification, but adds about 0.5% execution time for 10x7x12 (because of the two added ArrayInit() ).
I have included some test code in the main function btw.

Posted: **13 Jul 2011, 09:31**

what do you mean with 3x3x3 and 10x7x12?
3x3 * 3x3
and 10x7 * 7x12?

and could you please again write down only your optimized code for Matrix x Matrix multiplication?
I'll substitute and integrate it for my own code!

Posted: **13 Jul 2011, 09:37**

I mean NxMxK, so it is 3x3 * 3x3 and 10x7 * 7x12.

Code: Select all

void MatrixMatrixMult(float A[][], float B[][], float &C[][], int N, int M, int K){
	float arr_temp[];
	//Set size of C
	ArrayInit( arr_temp, 0, K );
	ArrayInit( C, arr_temp, N );
	
	lbl_Mult_start_i:	N--;	//Loop start
		//Speed up arrays
		float A_i[];
		float C_i[];
		asm{ index A_i, A, N }
		asm{ index C_i, C, N }
		
		int j = K;	lbl_Mult_start_j: j--;	//Second loop start
			float sum = 0;
			
			int s = M;	lbl_Mult_start_s: s--;	//Third loop start
				
				//sum += A_i[s]*B[s][j];
				float temp, sum_temp;
				asm{ index sum_temp, A_i, s }
				asm{ index arr_temp, B, s }
				asm{ index temp, arr_temp, j }
				sum_temp *= temp;
				sum += sum_temp;
			
			asm{ brtst GT, lbl_Mult_start_s, s }
			
			asm{ replace C_i, C_i, j, sum }
		
		asm{ brtst GT, lbl_Mult_start_j, j }
		
		asm{ replace C, C, N, C_i }
		
	asm{ brtst GT, lbl_Mult_start_i, N }
}

Posted: **13 Jul 2011, 09:44**

huuuh,
quite a complicated code for a quite simple MxM, but if it's faster - thanks, I'll try it!

edit:
ok, the result is ok anyway!

Posted: **13 Jul 2011, 11:56**

It looks complicated because most of it is rewritten into NBC. The total instruction count actually fell from 71 to 24 (using optimization level 2).
I try to stay away from NBC when possible for clarity but if the compiler doesn't do it properly I just can't help myself... In particular when seeing something like this:

Code: Select all

A_i = A[i];

ending up being

Code: Select all

mov temp, i
index arr_temp, A, temp
mov A_i, arr_temp

when it should have been

Code: Select all

index A_i, A, i

Posted: **13 Jul 2011, 12:05**

good that there are people who understand NBC-ish
I have enough to do to distinguish NXC-ish from C, too complicated for me to handle a third language too... :(

Anyway, at least we seem to complement us well :)

Posted: **13 Jul 2011, 20:47**

new version with Transposed Matrix and Rotate Matrix.
spillerrec, anything faster than this...?

Code: Select all

#define ArrayInit2D(array, tmp, init_val, dimx, dimy) { \
  ArrayInit(tmp, init_val, dimy);  \
  ArrayInit(array, tmp, dimx);     \
  ArrayInit(tmp,0,0);              \
}
                                                // Transposed / Transponierte
void MatrixTransp(float A[][], float &C[][], int R, int S) {
   int r, s;
   float tmp[];
   ArrayInit2D(C,tmp, 0, S,R);

   for (s=0; s<S; s++) {
     for (r=0; r<R; r++) {
       C[s][r]=A[r][s];
     }
   }
}

Posted: **13 Jul 2011, 22:53**

Here are my suggestions. The transpose matrix function seems to be pretty optimal - especially with the changes I made to the optimizer today and setting the optimization level to 3+, though this particular function doesn't generate a ton of useless code even before the changes I made today.

matrix09.zip: (2.07 KiB) Downloaded 253 times

John Hansen

Mindboards

NXC: Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library

Re: NXC: resolved - Matrix algebra library