OpenCL Matrix multiplication: inner product versus outer product -


i'm hoping familiar standard "naive" method of multiplying 2 (n x n square simplicity) matrices. in c is:

for(int = 0; < n; ++i)     for(int j = 0; j < n; ++j)         for(int k = 0; k < n; ++k)             c[i*n + j] += a[i*n + k] * b[k*n + j]; 

the above method computes dot (inner) product of row of a column of b , easy implement in opencl follows:

__kernel void matmul_ocl(                         __global const float *a,                         __global const float *b,                         __global       float *c,                                  const int n                         ) {     const int row = get_global_id(1); // row     const int col = get_global_id(0); // col      for(int = 0; < n; i++)         c[row*n + col] += a[row*n + i]*b[i*n + col]; } 

interchanging 2 inner-most loops of original c implementation results in method computes outer products, i.e., computes rank-1 updates of rows of c matrix:

for(int = 0; < n; ++i)     for(int k = 0; k < n; ++k)         for(int j = 0; j < n; ++j)             c[i*n + j] += a[i*n + k] * b[k*n + j]; 

does know how implement above outer-product method in opencl? have 2 of attempts pasted below can't seem nail it

attempt 1

__kernel void matmul_ocl(                         __global const float *a,                         __global const float *b,                         __global       float *c,                                  const int n                         ) {     const int row = get_global_id(1); // row     const int col = get_global_id(0); // col      __local float r;      r = a[row*n + col];     barrier(clk_local_mem_fence);      for(int = 0; < n; ++i)         c[row*n + i] += r * b[col*n + i];  } 

attempt 2

#define ts 1 __kernel void matmul_ocl(                         __global const float *a,                         __global const float *b,                         __global float *c,                         int n) {     // thread coordinates     const int row = get_local_id(1); // row     const int col = get_local_id(0); // col      // group tile coordinates     const int = get_group_id(1); // row     const int bx = get_group_id(0); // col      += ts*by + ts*bx*n + n*row + (col);     b += ts*by*n + n*row + (col);     c += ts*bx*n + n*(row) + col;      __global const float *blast = b + n;      float c[2] = {0.0f,0.0f};     float* cptr = &c[0];      __local float bs[2];         {         bs[0] = b[0];         bs[1] = b[n];         barrier(clk_local_mem_fence);          *cptr += a[0] * bs[0];         *cptr++ += a[0] * bs[1];          b++;         barrier(clk_local_mem_fence);      } while( b < blast );           c[0] += c[0];         c[1] += c[1];  } 

the opencl implementation of common algorithm maps outer 2 loops opencl ndrange implicit loops. works because outer 2 loops can safely run in parallel.

there few problems attempt 1:

  • the __local variable r assigned different values multiple work-items simultaneously. there race condition here, value of r undefined. fixed making r private variable instead.
  • the more serious problem there race condition in assignment of c. every value of col (ndrange dimension 0) running own loop on in parallel.

there isn't simple way around second issue. loop on k (in transposed version) cannot run in parallel. can map either outer loop or inner loop single dimensional ndrange in opencl.


Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

qml - Is it possible to implement SystemTrayIcon functionality in Qt Quick application -

double exclamation marks in haskell -