OpenCL Matrix multiplication: inner product versus outer product -
i'm hoping familiar standard "naive" method of multiplying 2 (n x n
square simplicity) matrices. in c
is:
for(int = 0; < n; ++i) for(int j = 0; j < n; ++j) for(int k = 0; k < n; ++k) c[i*n + j] += a[i*n + k] * b[k*n + j];
the above method computes dot (inner) product of row of a
column of b
, easy implement in opencl follows:
__kernel void matmul_ocl( __global const float *a, __global const float *b, __global float *c, const int n ) { const int row = get_global_id(1); // row const int col = get_global_id(0); // col for(int = 0; < n; i++) c[row*n + col] += a[row*n + i]*b[i*n + col]; }
interchanging 2 inner-most loops of original c
implementation results in method computes outer products, i.e., computes rank-1 updates of rows of c
matrix:
for(int = 0; < n; ++i) for(int k = 0; k < n; ++k) for(int j = 0; j < n; ++j) c[i*n + j] += a[i*n + k] * b[k*n + j];
does know how implement above outer-product method in opencl? have 2 of attempts pasted below can't seem nail it
attempt 1
__kernel void matmul_ocl( __global const float *a, __global const float *b, __global float *c, const int n ) { const int row = get_global_id(1); // row const int col = get_global_id(0); // col __local float r; r = a[row*n + col]; barrier(clk_local_mem_fence); for(int = 0; < n; ++i) c[row*n + i] += r * b[col*n + i]; }
attempt 2
#define ts 1 __kernel void matmul_ocl( __global const float *a, __global const float *b, __global float *c, int n) { // thread coordinates const int row = get_local_id(1); // row const int col = get_local_id(0); // col // group tile coordinates const int = get_group_id(1); // row const int bx = get_group_id(0); // col += ts*by + ts*bx*n + n*row + (col); b += ts*by*n + n*row + (col); c += ts*bx*n + n*(row) + col; __global const float *blast = b + n; float c[2] = {0.0f,0.0f}; float* cptr = &c[0]; __local float bs[2]; { bs[0] = b[0]; bs[1] = b[n]; barrier(clk_local_mem_fence); *cptr += a[0] * bs[0]; *cptr++ += a[0] * bs[1]; b++; barrier(clk_local_mem_fence); } while( b < blast ); c[0] += c[0]; c[1] += c[1]; }
the opencl implementation of common algorithm maps outer 2 loops opencl ndrange implicit loops. works because outer 2 loops can safely run in parallel.
there few problems attempt 1:
- the __local variable r assigned different values multiple work-items simultaneously. there race condition here, value of r undefined. fixed making r private variable instead.
- the more serious problem there race condition in assignment of c. every value of col (ndrange dimension 0) running own loop on in parallel.
there isn't simple way around second issue. loop on k (in transposed version) cannot run in parallel. can map either outer loop or inner loop single dimensional ndrange in opencl.
Comments
Post a Comment