Lately i've taken an interest in CUDA and it's associated programming interfaces. To test out just out powerful these video cards are for crunching large amounts of arithmetic, and thought the results needed to be shared.

The test code looks like this:

// includes, system #include <stdlib.h> // includes CUDA #include <cuda_runtime.h> #include <stdio.h> #include <assert.h> #include <cuda.h> // CUDA KERNEL __global__ void deviceKernel(unsigned long long *in_a, unsigned long long N) { unsigned long long idx = blockIdx.x*blockDim.x + threadIdx.x; if (idx<N) { unsigned long long x = in_a[idx]; unsigned long long smax = __double2ull_ru(__dsqrt_ru(__ull2double_ru(x))); int i; for (i=3;i<=smax;i+=2) { if ( x % i == 0 ) { in_a[idx] = 0; i=smax; } } } } // MAIN int main(int argc, char **argv) { unsigned long long *search_h, *result_h; // pointers to host memory unsigned long long *search_d; // pointer to device memory int search_max = atoll(argv[1]); // takes a cli argument int warpSize = 32; unsigned long long i, N ; N = ( search_max - 1 ) / 2; // search all odd numbers < search_max size_t size = N*sizeof(unsigned long long); search_h = (unsigned long long *)malloc(size); result_h = (unsigned long long *)malloc(size); cudaMalloc((void **) &search_d, size); // initialization of host data for (i=3; i<=search_max; i+=2) search_h[(i-1)/2 -1] = i; // setup search array // copy data from host to device cudaMemcpy(search_d, search_h, sizeof(unsigned long long)*N, cudaMemcpyHostToDevice); int blockSize = 8*warpSize; int nBlocks = N/blockSize + (N%blockSize == 0?0:1); deviceKernel <<< nBlocks, blockSize >>> (search_d, N); // fire off the kernel! cudaThreadSynchronize(); // wait for the kernel to finish cudaMemcpy(result_h, search_d, sizeof(unsigned long long)*N, cudaMemcpyDeviceToHost); free(search_h); free(result_h); cudaFree(search_d); return 1; }

Pretty straight forward. This code block copies a search list on the device, and block copies it back after all work is done. This is pretty obviously not the best way to do things, but for testing purposes, the copy operations are not the limiting factor.

So, what we see here is O(NlogN) performance for prime finding. What's amazing to me is that this code takes only one second to find every prime number less than ten million!

The graph shows that the runtime and copy operations are the limiting factor up to about one million prime numbers. From there, this naive implementation takes off, crunching an absolutely obscene amount of arithmetic, on "unsigned long long"'s no less!

The source for the test is here: source . Sorry, i hacked up the graph manually.