GPU Basics: OpenACC Tutorial
OpenAcc is programming tool for parallel computing designed to simplify parallel programming of heterogeneous CPU/GPU systems. It utilizes sets of compiler directives to enable the compiler to generate cuda code for the GPU. Most of the compiler directives are entered in the user’s code as comments. This example shows a minimal conversion of a Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be briefly discussed. This tutorial will also show you how to compile run and submit jobs for CADES gpu nodes.
Prepare Your Programming Environment
Cades uses modules to control the paths for your software. For example, if no modules are loaded, the LD_LIBRARY_PATH, which is the default library path accessed to check for available dynamic and shared libraries, is set to this default:
$echo $LD_LIBRARY_PATH
/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/openmpi/1.10.3/centos7.2_pgi15.7.0/lib
However, if the pgi/19.4 module is loaded, LD_LIBRARY_PATH is reset to hold the pgi/19.4 libraries.
$ module load pgi/19.4
$ echo $LD_LIBRARY_PATH
/software/tools/compilers/pgi/2019.194/linux86-64/19.4/lib:/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/openmpi/1.10.3/centos7.2_pgi15.7.0/lib
The ‘module show pgi/19.4’ will show you what other paths the module sets.
As a best practice on CADES SHPC condo, always clear your modules before running or compiling code. This is done with ‘module purge’. To use OpenACC on CADES Open SHPC condo, you will need to use the pgi/19.4 compiler module.
Prepare your environment:
$ module purge
$ module load pgi/19.4
VecAdd.c
VecAdd.c adds two vectors A and B to produce C, where Ci = Ai + Bi. This version is by Adam Simpson and can be downloaded from OLCF Git Hub.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main( int argc, char* argv[] )
{
// Size of vectors
int n = 10000;
// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);
// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
int i;
for(i=0; i<n; i++) {
a[i] = sin(i)*sin(i);
b[i] = cos(i)*cos(i);
}
// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for(i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0.0;
for(i=0; i<n; i++) {
sum += c[i];
}
sum = sum/n;
printf("final result: %f\n", sum);
// Release memory
free(a);
free(b);
free(c);
return 0;
}
Changes to VecAdd.c
There are two main differences in this code that make it vectorize well and be able to run on the GPU. 1. The restrict key word:
// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;
The restrict keyword is a non-enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.
- The pragma:
// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for(i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
The compiler will analyze code following a pragma to see if it can be run on the GPU. Areas of code where many operations can be done at once, without depending on each other are good candidates for the GPU.
In this case, the acc kernels directive will make the sperate iterations of the loop run in parallel on the GPU and memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.
There are many different OpenAcc directives that can be used to move code to the GPU. This is just one example. To see more go to https://www.openacc.org
Compiling VecAdd.c with OpenAcc
To compile this code with the pgcc compiler and OpenAcc use the -acc
flag:
$ module purge
$ module load pgi/19.4
$ pgcc -acc vecAdd.c -o VecAdd.c
Minfo
The pgi compiler -Minfo
flag allow you to see what the compiler is doing with your code:
Recompile with the -Minfo flag:
$ pgcc -acc -Minfo vecAdd.c -o vecAdd.o
main:
32, Generating copyin(a[:n])
Generating copyout(c[:n])
Generating copyin(b[:n])
33, Loop is parallelizable
Generating Tesla code
33, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Running
Your CADES home directory sits on the Network File System (NFS). HPC code needs to run in the Lustre Parallel file system.
Copy vecAdd.c and vecAdd.o to your Lustre directory. Below the example of the CADE Open Lustre and Birthright condo are used.
cp vecAdd.c /lustre/or-scratch/cades-birthright/user_id
cp vecAdd.o /lustre/or-scratch/cades-birthright/user_id
One way to run this is to start an interactive job on the compute nodes with srun:
$srun -A birthright -p gpu -N 1 -n 1 -c 2 --gres=gpu:2 --mem=8G -t 1:00:00 --pty /bin/bash
srun: job 7034 queued and waiting for resources
Note that the, partition for using the gpu must be specified with -p gpu
and --gres==gpu:2
must be specified.
When the job starts:
$ module purge
$ module load pgi/19.4
$ mpirun ./vecAdd.o
final result: 1.000000
Did the code really run on the GPU?
There are a few ways to check if your job really ran on the GPU.
nvprof
Nvprof is nvida's built-in profiler. It will show you that your code is running on the GPU and also give you performance information about the code.
To use nvprof issue:
mpirun nvprof ./vecAdd.o
If the code ran on the GPU you will get a result like this:
==119959== NVPROF is profiling process 119959, command: ./vecAddC.o
final result: 1.000000
==119959== Profiling application: ./vecAddC.o
==119959== Profiling result:
Time(%) Time Calls Avg Min Max Name
61.77% 27.040us 2 13.520us 12.320us 14.720us [CUDA memcpy HtoD]
29.17% 12.768us 1 12.768us 12.768us 12.768us [CUDA memcpy DtoH]
9.06% 3.9680us 1 3.9680us 3.9680us 3.9680us main_33_gpu
==119959== API calls:
Time(%) Time Calls Avg Min Max Name
59.74% 139.67ms 1 139.67ms 139.67ms 139.67ms cuDevicePrimaryCtxRetain
21.15% 49.448ms 1 49.448ms 49.448ms 49.448ms cuDevicePrimaryCtxRelease
15.39% 35.987ms 1 35.987ms 35.987ms 35.987ms cuMemHostAlloc
3.19% 7.4473ms 1 7.4473ms 7.4473ms 7.4473ms cuMemFreeHost
0.21% 487.30us 1 487.30us 487.30us 487.30us cuMemAllocHost
0.20% 458.15us 4 114.54us 5.1670us 219.86us cuMemAlloc
0.04% 97.782us 1 97.782us 97.782us 97.782us cuModuleLoadData
0.03% 62.684us 1 62.684us 62.684us 62.684us cuLaunchKernel
0.02% 54.570us 2 27.285us 7.5190us 47.051us cuMemcpyHtoDAsync
0.01% 23.821us 3 7.9400us 2.7650us 12.937us cuStreamSynchronize
0.01% 12.297us 1 12.297us 12.297us 12.297us cuMemcpyDtoHAsync
0.00% 9.2830us 1 9.2830us 9.2830us 9.2830us cuStreamCreate
0.00% 7.6830us 3 2.5610us 675ns 3.5170us cuPointerGetAttributes
0.00% 6.6520us 2 3.3260us 1.1020us 5.5500us cuEventCreate
0.00% 5.8830us 1 5.8830us 5.8830us 5.8830us cuEventRecord
0.00% 4.2680us 3 1.4220us 177ns 3.6380us cuDeviceGetCount
0.00% 4.2390us 20 211ns 103ns 777ns cuDeviceGetAttribute
0.00% 4.1070us 12 342ns 108ns 863ns cuDeviceGet
0.00% 3.8610us 6 643ns 124ns 1.7480us cuCtxSetCurrent
0.00% 3.2310us 1 3.2310us 3.2310us 3.2310us cuEventSynchronize
0.00% 3.1580us 1 3.1580us 3.1580us 3.1580us cuModuleGetFunction
0.00% 869ns 4 217ns 106ns 507ns cuDeviceComputeCapability
0.00% 331ns 1 331ns 331ns 331ns cuDriverGetVersion
0.00% 261ns 1 261ns 261ns 261ns cuCtxGetCurrent
==119959== OpenACC (excl):
Time(%) Time Calls Avg Min Max Name
98.76% 36.448ms 1 36.448ms 36.448ms 36.448ms acc_enter_data@vecAdd.c:32
0.34% 124.50us 1 124.50us 124.50us 124.50us acc_device_init@vecAdd.c:32
0.19% 69.908us 1 69.908us 69.908us 69.908us acc_enqueue_launch@vecAdd.c:33 (main_33_gpu)
0.19% 68.985us 2 34.492us 9.3040us 59.681us acc_enqueue_upload@vecAdd.c:32
0.16% 58.884us 1 58.884us 58.884us 58.884us acc_wait@vecAdd.c:37
0.14% 52.340us 1 52.340us 52.340us 52.340us acc_exit_data@vecAdd.c:32
0.11% 39.090us 1 39.090us 39.090us 39.090us acc_enqueue_download@vecAdd.c:37
0.05% 17.071us 1 17.071us 17.071us 17.071us acc_compute_construct@vecAdd.c:32
0.04% 16.552us 1 16.552us 16.552us 16.552us acc_wait@vecAdd.c:32
0.03% 9.5830us 1 9.5830us 9.5830us 9.5830us acc_wait@vecAdd.c:33
0.00% 0ns 3 0ns 0ns 0ns acc_delete@vecAdd.c:37
0.00% 0ns 3 0ns 0ns 0ns acc_alloc@vecAdd.c:32
0.00% 0ns 3 0ns 0ns 0ns acc_create@vecAdd.c:32
Note that in the first line of the timing information, you can see that about 62% of the code’s running time was spent coping memory from the "CPU host" to the "GPU device". To really make code run faster using the GPU, memory should be move to and from the GPU as little as is strictly needed.
Running OpenACC code with a batch script
A batch script is the most convenient way to submit code to the compute nodes. The following is an example for submitting vecAdd.o to the GPU using the birthright condo.
run_vecadd.sbatch
#!/bin/bash
#SBATCH -A birthright
#SBATCH -p -gpu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --gres=gpu:2
#SBATCH -J gpu-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH -o ./%j-gpu-output.txt
#SBATCH -e ./%j-gpu-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
module purge
module load pgi/19.4
mpirun ./vecAdd.o
The --gpus
option above is used to specify the number of GPU resources your job needs. An additional specifier can be added to indicate the type of GPU that is needed if there are multiple models available. To specify K80 GPUs, the above option would change to --gpus=k80:2
.
New Arguments
--gres=gpu:2 : Reserve nodes with the specified gpu resources
Submit the job script
To submit your job script to the compute nodes issue:
sbatch run_vecadd.sbatch