CUDA Tests

Lately i've taken an interest in CUDA and it's associated programming interfaces. To test out just out powerful these video cards are for crunching large amounts of arithmetic, and thought the results needed to be shared.

The test code looks like this:

// includes, system
#include <stdlib.h>
// includes CUDA
#include <cuda_runtime.h>
#include <stdio.h>
#include <assert.h>
#include <cuda.h>
__global__ void deviceKernel(unsigned long long *in_a, unsigned long long N) {
	unsigned long long idx = blockIdx.x*blockDim.x + threadIdx.x;
	if (idx<N) {
		unsigned long long x = in_a[idx]; 
		unsigned long long smax = __double2ull_ru(__dsqrt_ru(__ull2double_ru(x)));
		int i;
		for (i=3;i<=smax;i+=2) {
			if ( x % i == 0 ) {
				in_a[idx] = 0;
int main(int argc, char **argv) {
	unsigned long long *search_h, *result_h;           // pointers to host memory
	unsigned long long *search_d;                 // pointer to device memory
	int search_max = atoll(argv[1]);	    // takes a cli argument
	int warpSize = 32;
	unsigned long long i, N ;
	N = ( search_max - 1 ) / 2;		    // search all odd numbers < search_max
	size_t size = N*sizeof(unsigned long long);
	search_h = (unsigned long long *)malloc(size);
	result_h = (unsigned long long *)malloc(size);
	cudaMalloc((void **) &search_d, size);
	// initialization of host data
	for (i=3; i<=search_max; i+=2) search_h[(i-1)/2 -1] = i;   // setup search array
	// copy data from host to device
	cudaMemcpy(search_d, search_h, sizeof(unsigned long long)*N, cudaMemcpyHostToDevice);
	int blockSize = 8*warpSize;
	int nBlocks = N/blockSize + (N%blockSize == 0?0:1); 
	deviceKernel <<< nBlocks, blockSize >>> (search_d, N);	// fire off the kernel!
	cudaThreadSynchronize();	// wait for the kernel to finish
	cudaMemcpy(result_h, search_d, sizeof(unsigned long long)*N, cudaMemcpyDeviceToHost);
	free(search_h); free(result_h); cudaFree(search_d); 
	return 1;

Pretty straight forward. This code block copies a search list on the device, and block copies it back after all work is done. This is pretty obviously not the best way to do things, but for testing purposes, the copy operations are not the limiting factor.

Test results

So, what we see here is O(NlogN) performance for prime finding. What's amazing to me is that this code takes only one second to find every prime number less than ten million!
The graph shows that the runtime and copy operations are the limiting factor up to about one million prime numbers. From there, this naive implementation takes off, crunching an absolutely obscene amount of arithmetic, on "unsigned long long"'s no less!

The source for the test is here: source . Sorry, i hacked up the graph manually.