CUDA (Compute Unified Device Architecture) is a parallel computing platform intended for general purpose computations on Nvidia GPUs. Most of modern deep learning is done in CUDA, and libraries like PyTorch and TensorFlow support it out of the box.

Essentially simplifies the interface between our programs and GPUs for non-graphics work.

Core idea: CPUs are good for parallel tasks when the tasks are limited. In DL work, matrices and vectors can have millions of elements so GPUs are preferable because they’re built to do many operations in parallel.

Core workflow

Our workflow looks like this:

  • Host preparation — prepares instructions for CUDA and sets up data in its own memory.
  • Data transfer — from host memory to device memory.
  • Kernel launch — host directs device to execute a kernel, and the GPU schedules and runs the kernel.
  • Post-processing — results are transferred back to the host for further processing.

https://www.dailydoseofds.com/implementing-massively-parallelized-cuda-programs-from-scratch-using-cuda-programming/

Using CUDA

Most deep learning libraries support parallelisation with CUDA.

  • PyTorch has an interface that requires moving objects into GPU memory.

How can we make sure our workflows are using CUDA? On Windows systems, we can run nvidia-smi in a terminal while a batch is running. We can also view it in the Task Manager, where we switch one of the panels in the device view for the GPU to CUDA.