Compile and run an OpenMP GPU application

Compile an OpenMP target offload code

For this tutorial, we will consider the basic saxpy code. The source code of this example is presented below and we consider that the source file name is saxpy_gpu.c.

Source code for this example

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

void saxpy(int n, float a, float *x, float *y) {
  double elapsed = -1.0 * omp_get_wtime();

  // We don't need to map the variable a as scalars are firstprivate by default
  #pragma omp target teams distribute parallel for \
                     map(to:x[0:n]) map(tofrom:y[0:n])
  for(int i = 0; i < n; i++) {
    y[i] = a * x[i] + y[i];
  }

  elapsed += omp_get_wtime();
  printf("saxpy done in %6.3lf seconds.\n", elapsed);
}

int main() {
  int n = 2000000;
  float *x = (float*) malloc(n*sizeof(float));
  float *y = (float*) malloc(n*sizeof(float));
  float alpha = 2.0;

  #pragma omp parallel for
  for (int i = 0; i < n; i++) {
     x[i] = 1;
     y[i] = i;
  }

  saxpy(n, alpha, x, y);

  free(x);
  free(y);

  return 0;
}

To compile the code, we will use the Clang compiler which was compiled to support OpenMP target offload targeting NVIDIA GPUs. To have access to the compiler, we load the corresponding module

module load Clang/16.0.6-GCCcore-11.3.0-CUDA-11.7.0

The code can then be compiled with the following command:

clang -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o saxpy_gpu saxpy_gpu.c

where -fopenmp-targets=nvptx64-nvidia-cuda enable OpenMP target offload for NVIDIA GPUS.

An alternative is to use the (newer) --offload-arch compiler flag to enable OpenMP target offloading.

clang -O3 -fopenmp --offload-arch=sm_80 -o saxpy_gpu saxpy_gpu.c

where we have to specify the target GPU architecture. For the Lucia NVIDIA A100 GPUs the compute capability is 8.0, so we use sm_80 as argument for the --offload-arch option.

Note

When compiling, Clang will print the following warning:

clang-16: warning: CUDA version 11.7 is only partially supported 
[-Wunknown-cuda-version]

This warning is harmless and can be ignored.

The resulting saxpy_gpu application can be executed directly on the login node:

 $ ./saxpy_gpu
saxpy done in  0.004 seconds.

However, the login nodes do not have any GPU. As a consequence, the code was run on the CPU. This absence of GPUs can be highlighted if we use the OMP_TARGET_OFFLOAD environment variable with a value of MANDATORY. By using this variable, we specify to the OpenMP runtime that using GPU is mandatory.

If we use the OMP_TARGET_OFFLOAD environment variable and execute on the login node, the execution fails:

 $ OMP_TARGET_OFFLOAD=MANDATORY ./saxpy_gpu
Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted (core dumped)

To execute our example on the GPU, we need to allocate a GPU on a compute node. This is described in the next section.

Running the saxpy example

Note

Lucia uses the Slurm scheduler like NIC5. As a consequence, most of the content of the Slurm chapter apply to Lucia.

To submit a job to Lucia you need to specify which project needs to be "billed" for your job. It was not the case on NIC5. This is done by using the --account directive. The name of the project used for the course is ulghpsc:

#SBATCH --account=ulghpsc

Warning

If you don't specify an account the submission of your job will be denied:

sbatch: error: You must specify an account!
sbatch: error: Batch job submission failed: Invalid account or 
        account/partition combination specified

To use the GPU nodes, we need to use the gpu partition with the --partition directive and allocate a GPU with the --gpus directive.

#SBATCH --partition=gpu
#SBATCH --gpus=1

Warning

Using the gpu partition requires to allocate a GPU to the job or submission will be denied

sbatch: error: QOSMinGRES
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy
(job submit limit, user's size and/or time limits)

Below is an example job to run the saxpy example on a GPU compute node of Lucia. We store this job script in a file with name lucia_gpu.job.

Source code for this example

#!/bin/bash
#SBATCH --job-name="saxpy GPU"
#SBATCH --output=saxpy_gpu.out
#SBATCH --partition=gpu
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --gpus=1
#SBATCH --time=15:00
#SBATCH --account=ulghpsc

module load Clang/16.0.6-GCCcore-11.3.0-CUDA-11.7.0

export OMP_TARGET_OFFLOAD=MANDATORY

./saxpy_gpu

Note that we use the OMP_TARGET_OFFLOAD environment variable to make the application fail if GPU offloading is not possible. Using this variable is no really required, but it's a way to be on the safe side and make sure that indeed the application will run on a GPU.

To submit this job, we use the sbatch command like on NIC5:

sbatch lucia_gpu.job

Get an interactive session on a compute node

For development purposes, when you want to quicky test your code, submitting a job to have access to a GPU might not be practical. To make you work easy, you can create an interactive on a GPU node using the following command.

srun --partition=gpu --account=ulghpsc --time=01:00:00 --gpus=1 --pty $SHELL

where we request to have access to one GPU for one hour. You should have an output looking like this:

srun: job XXXXXXX queued and waiting for resources
...
srun: job XXXXXXX has been allocated resources
(JOB_ID: XXXXXXX) user@cnaXXX:~ #

You should now be on a compute node. You can check that a GPU is available using the nvidia-smi command.

 $ nvidia-smi
Fri Nov 17 15:09:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   27C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

when you have finished your test, you can end the session by using the exit command.

Interactive jobs need to queue like regular jobs

Interactive jobs are put in the queue like regular jobs (submitted with sbatch). In means that if all GPU compute nodes are allocated, you need to wait in order to get an allocation.