GPU Basics: OpenACC Tutorial

OpenAcc is programming tool for parallel computing designed to simplify parallel programming of heterogeneous CPU/GPU systems. It utilizes sets of compiler directives to enable the compiler to generate cuda code for the GPU. Most of the compiler directives are entered in the user’s code as comments. This example shows a minimal conversion of a Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be briefly discussed. This tutorial will also show you how to compile run and submit jobs for CADES gpu nodes.

Prepare Your Programming Environment

Cades uses modules to control the paths for your software. For example, if no modules are loaded, the LD_LIBRARY_PATH, which is the default library path accessed to check for available dynamic and shared libraries, is set to this default:

$echo $LD_LIBRARY_PATH
/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/openmpi/1.10.3/centos7.2_pgi15.7.0/lib

However, if the pgi/19.4 module is loaded, LD_LIBRARY_PATH is reset to hold the pgi/19.4 libraries.

$ module load pgi/19.4
$ echo $LD_LIBRARY_PATH
/software/tools/compilers/pgi/2019.194/linux86-64/19.4/lib:/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/openmpi/1.10.3/centos7.2_pgi15.7.0/lib

The ‘module show pgi/19.4’ will show you what other paths the module sets.

As a best practice on CADES SHPC condo, always clear your modules before running or compiling code. This is done with ‘module purge’. To use OpenACC on CADES Open SHPC condo, you will need to use the pgi/19.4 compiler module.

Prepare your environment:

$ module purge
$ module load pgi/19.4

VecAdd.c

VecAdd.c adds two vectors A and B to produce C, where Ci = Ai + Bi. This version is by Adam Simpson and can be downloaded from OLCF Git Hub.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 10000;

    // Input vectors
    double *restrict a;
    double *restrict b;
    // Output vector
    double *restrict c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }  

    // sum component wise and save result into vector c
    #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
    for(i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0.0;
    for(i=0; i<n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes to VecAdd.c

There are two main differences in this code that make it vectorize well and be able to run on the GPU. 1. The restrict key word:

// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;

The restrict keyword is a non-enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.

The pragma:

// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for(i=0; i<n; i++) {
    c[i] = a[i] + b[i];
}

The compiler will analyze code following a pragma to see if it can be run on the GPU. Areas of code where many operations can be done at once, without depending on each other are good candidates for the GPU.

In this case, the acc kernels directive will make the sperate iterations of the loop run in parallel on the GPU and memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.

There are many different OpenAcc directives that can be used to move code to the GPU. This is just one example. To see more go to https://www.openacc.org

Compiling VecAdd.c with OpenAcc

To compile this code with the pgcc compiler and OpenAcc use the -acc flag:

$ module purge
$ module load pgi/19.4
$ pgcc -acc vecAdd.c -o VecAdd.c

Minfo

The pgi compiler -Minfo flag allow you to see what the compiler is doing with your code:

Recompile with the -Minfo flag:

$ pgcc -acc -Minfo vecAdd.c -o vecAdd.o
main:
     32, Generating copyin(a[:n])
         Generating copyout(c[:n])
         Generating copyin(b[:n])
     33, Loop is parallelizable
         Generating Tesla code
         33, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Running

Your CADES home directory sits on the Network File System (NFS). HPC code needs to run in the Lustre Parallel file system.

Copy vecAdd.c and vecAdd.o to your Lustre directory. Below the example of the CADE Open Lustre and Birthright condo are used.

cp vecAdd.c /lustre/or-scratch/cades-birthright/user_id
cp vecAdd.o /lustre/or-scratch/cades-birthright/user_id

One way to run this is to start an interactive job on the compute nodes with srun:

$srun -A birthright -p gpu -N 1 -n 1 -c 2 --gres=gpu:2 --mem=8G -t 1:00:00 --pty /bin/bash
srun: job 7034 queued and waiting for resources

Note that the, partition for using the gpu must be specified with -p gpu and --gres==gpu:2 must be specified.

When the job starts:

$ module purge
$ module load pgi/19.4
$ mpirun ./vecAdd.o

final result: 1.000000

Did the code really run on the GPU?

There are a few ways to check if your job really ran on the GPU.

nvprof

Nvprof is nvida's built-in profiler. It will show you that your code is running on the GPU and also give you performance information about the code.

To use nvprof issue:

mpirun nvprof ./vecAdd.o

If the code ran on the GPU you will get a result like this:

==119959== NVPROF is profiling process 119959, command: ./vecAddC.o
final result: 1.000000
==119959== Profiling application: ./vecAddC.o
==119959== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 61.77%  27.040us         2  13.520us  12.320us  14.720us  [CUDA memcpy HtoD]
 29.17%  12.768us         1  12.768us  12.768us  12.768us  [CUDA memcpy DtoH]
  9.06%  3.9680us         1  3.9680us  3.9680us  3.9680us  main_33_gpu

==119959== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 59.74%  139.67ms         1  139.67ms  139.67ms  139.67ms  cuDevicePrimaryCtxRetain
 21.15%  49.448ms         1  49.448ms  49.448ms  49.448ms  cuDevicePrimaryCtxRelease
 15.39%  35.987ms         1  35.987ms  35.987ms  35.987ms  cuMemHostAlloc
  3.19%  7.4473ms         1  7.4473ms  7.4473ms  7.4473ms  cuMemFreeHost
  0.21%  487.30us         1  487.30us  487.30us  487.30us  cuMemAllocHost
  0.20%  458.15us         4  114.54us  5.1670us  219.86us  cuMemAlloc
  0.04%  97.782us         1  97.782us  97.782us  97.782us  cuModuleLoadData
  0.03%  62.684us         1  62.684us  62.684us  62.684us  cuLaunchKernel
  0.02%  54.570us         2  27.285us  7.5190us  47.051us  cuMemcpyHtoDAsync
  0.01%  23.821us         3  7.9400us  2.7650us  12.937us  cuStreamSynchronize
  0.01%  12.297us         1  12.297us  12.297us  12.297us  cuMemcpyDtoHAsync
  0.00%  9.2830us         1  9.2830us  9.2830us  9.2830us  cuStreamCreate
  0.00%  7.6830us         3  2.5610us     675ns  3.5170us  cuPointerGetAttributes
  0.00%  6.6520us         2  3.3260us  1.1020us  5.5500us  cuEventCreate
  0.00%  5.8830us         1  5.8830us  5.8830us  5.8830us  cuEventRecord
  0.00%  4.2680us         3  1.4220us     177ns  3.6380us  cuDeviceGetCount
  0.00%  4.2390us        20     211ns     103ns     777ns  cuDeviceGetAttribute
  0.00%  4.1070us        12     342ns     108ns     863ns  cuDeviceGet
  0.00%  3.8610us         6     643ns     124ns  1.7480us  cuCtxSetCurrent
  0.00%  3.2310us         1  3.2310us  3.2310us  3.2310us  cuEventSynchronize
  0.00%  3.1580us         1  3.1580us  3.1580us  3.1580us  cuModuleGetFunction
  0.00%     869ns         4     217ns     106ns     507ns  cuDeviceComputeCapability
  0.00%     331ns         1     331ns     331ns     331ns  cuDriverGetVersion
  0.00%     261ns         1     261ns     261ns     261ns  cuCtxGetCurrent

==119959== OpenACC (excl):
Time(%)      Time     Calls       Avg       Min       Max  Name
 98.76%  36.448ms         1  36.448ms  36.448ms  36.448ms  acc_enter_data@vecAdd.c:32
  0.34%  124.50us         1  124.50us  124.50us  124.50us  acc_device_init@vecAdd.c:32
  0.19%  69.908us         1  69.908us  69.908us  69.908us  acc_enqueue_launch@vecAdd.c:33 (main_33_gpu)
  0.19%  68.985us         2  34.492us  9.3040us  59.681us  acc_enqueue_upload@vecAdd.c:32
  0.16%  58.884us         1  58.884us  58.884us  58.884us  acc_wait@vecAdd.c:37
  0.14%  52.340us         1  52.340us  52.340us  52.340us  acc_exit_data@vecAdd.c:32
  0.11%  39.090us         1  39.090us  39.090us  39.090us  acc_enqueue_download@vecAdd.c:37
  0.05%  17.071us         1  17.071us  17.071us  17.071us  acc_compute_construct@vecAdd.c:32
  0.04%  16.552us         1  16.552us  16.552us  16.552us  acc_wait@vecAdd.c:32
  0.03%  9.5830us         1  9.5830us  9.5830us  9.5830us  acc_wait@vecAdd.c:33
  0.00%       0ns         3       0ns       0ns       0ns  acc_delete@vecAdd.c:37
  0.00%       0ns         3       0ns       0ns       0ns  acc_alloc@vecAdd.c:32
  0.00%       0ns         3       0ns       0ns       0ns  acc_create@vecAdd.c:32

Note that in the first line of the timing information, you can see that about 62% of the code’s running time was spent coping memory from the "CPU host" to the "GPU device". To really make code run faster using the GPU, memory should be move to and from the GPU as little as is strictly needed.

Running OpenACC code with a batch script

A batch script is the most convenient way to submit code to the compute nodes. The following is an example for submitting vecAdd.o to the GPU using the birthright condo.

run_vecadd.sbatch

#!/bin/bash

#SBATCH -A birthright
#SBATCH -p -gpu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --gres=gpu:2
#SBATCH -J gpu-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH -o ./%j-gpu-output.txt
#SBATCH -e ./%j-gpu-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>

module purge
module load pgi/19.4
mpirun ./vecAdd.o

The --gpus option above is used to specify the number of GPU resources your job needs. An additional specifier can be added to indicate the type of GPU that is needed if there are multiple models available. To specify K80 GPUs, the above option would change to --gpus=k80:2.

New Arguments

 --gres=gpu:2 : Reserve nodes with the specified gpu resources

Submit the job script

To submit your job script to the compute nodes issue:

sbatch run_vecadd.sbatch