Search notes:


CUDA is a general purpose parallel computing architecture.
CUDA was introduced by Nvidia in November 2006.
At the core of CUDA are three key abstractions:

Host and device

A CUDA program assumes a system with a host and a device.
Both, the host and the device, have their own separate memory.

Device memory

Memory on the device can be allocated either as
  • linear memory, or as
  • CUDA arrays (which are memory layouts optimized for texture fetching)
Linear device memory can be allocated with cudaMalloc(), cudaMallocPitch() or cudaMalloc3D().
After allocating linear memory, data can be copied from the host memory to the device memory with cudeMemcpy().
When the device memory is not used anymore, it can be freed with cudaFree().

Verifying the presence of a CUDA-Capable GPU

$ lspci | grep -i nvidia
Possibly, the hardware db needs to be updated with update-pciids prior to execute the lspci command.


CUDA C allows to write programs to be executed on a GPU. It consists of
A CUDA C program is compiled with nvcc.


A kernel is a function that is executed on the device in parallel by CUDA threads.
A kernel can be called from the host only (compare with functions declared with __device__ which also run on the device, but can be called from the device only).
Kernels can be written either in
  • PTX (the CUDA instruction set), or
  • C
In a CUDA program, a kernel is defined using the keyword __global__.

Shared memory

Shared memory is expected to be much faster than global memory
Shared memory is allocated using the __shared__ qualifier.

Runtime / CUDA driver API

The runtime provides C functions that execute on the host which allow to
  • allocate and deallocate device memory
  • copy data between host and device memory
  • manage systems with multiple devices
  • etc.
The runtime uses the CUDA driver API.
The CUDA driver API can be directly used by the application.
The CUDA driver API exposes low level functionality such as
  • CUDA contexts (similar to processes in an operating system)
  • CUDA modules (similar to dynamically loaded libraries)

CUDA programs etc.

Error message: the provided PTX was compiled with an unsupported toolchain.

When I started playing with CUDA and nvcc, I got the error message the provided PTX was compiled with an unsupported toolchain. when I compiled the following program with nvcc -o err and executed it:
#include <stdio.h>

__global__ void testKernel(int nr) {
   printf("kernel received parameter '%d'\n", nr);

int main(void) {

   testKernel<<< 1,1 >>>(42);

   cudaError_t error = cudaGetLastError();

   if(error != cudaSuccess) {
      printf("CUDA error: %s\n", cudaGetErrorString(error));

   return 0;
When I used the compiler flag -arch=native, the error went away:
$ nvcc -arch=native -o err
$ ./err
kernel received parameter '42'

The following CUDA program shows some device properties (as found in the cudaDeviceProp struct).
#include <stdio.h>
#include <cuda.h>

int main() {

   int deviceCount;

   for (int dev = 0; dev < deviceCount; ++dev) {
      cudaDeviceProp prp;
      cudaGetDeviceProperties(&prp, dev);

      printf("Device %d (%s)\n", dev,;
      printf("  Compute capability: %d.%d.\n", prp.major, prp.minor);
      printf("  Multiprocessors:    %d\n"    , prp.multiProcessorCount);
      printf("  Concur. kernels:    %d\n"    , prp.concurrentKernels);
      printf("  32-bit regs/block:  %d\n"    , prp.regsPerBlock);
      printf("  Shared mem/block:   %d\n"    , prp.sharedMemPerBlock);
      printf("  L2 cache size:      %d\n"    , prp.l2CacheSize);
      printf("  Global memory:      %d\n"    , prp.totalConstMem);
//    printf("  ? :  %d\n"    , prp.reservedSharedMemPerBlock );


