![intel hd graphics 6000 opencl driver intel hd graphics 6000 opencl driver](https://cdn.videocardz.com/1/2021/10/Intel-Arc-Alchemist-DG2-512EU-Reference-Card.jpg)
There're also around 16M * 1024 * 2 floating point operations, likely more depending on how modulo is calculated, but the HD 6000 has floating point performance around 768 GFLOPS which shouldn't be a bottleneck.ġ6G reads of float values lead to 64G of memory being read execution of the kernel took 453945 μs to complete, giving an estimated local memory bandwidth of 151 GB/s.
![intel hd graphics 6000 opencl driver intel hd graphics 6000 opencl driver](https://www.notebookcheck.net/uploads/tx_nbc2/ivy-bridge-processor-graphics.jpg)
Overall, this makes a total of 16M * 1 = 16M writes and 16M * 1024 = 16G reads to the local memory. I queued the kernel for 16777216 / 16M iterations, with a work group size of 256 and a local buffer of 1024 floats, all zeroes except l. The fragment "(i * 445) % 1024" is used to ensure the local memory is randomly accessed performance is a little better (~30% speedup) than the figure mentioned at the end without the randomization. This is my testing setup, with kernel source: _kernel void vecAdd(_global float* results, const unsigned int n, _local float* loc)Īll the kernel does is take the local float array loc and add random values from it into a global output vector. I'm working on some local/global memory optimization in OpenCL after looking at this question from two years ago, I think I'm doing something wrong since local memory IO seems to be considerably slower than it should be.