DAY3:閱讀CUDA C編程接口

2021-03-02 GPUS開發者

我們正帶領大家開始閱讀英文的《CUDA C Programming Guide》,今天是第三天，我們將用三天時間來學習CUDA 的編程接口。希望在接下來的97天裡，您可以學習到原汁原味的CUDA，同時能養成英文閱讀的習慣。

3. Programming Interface

CUDA C provides a simple path for users familiar with the C programming language to easily write programs for execution by the device.

It consists of a minimal set of extensions to the C language and a runtime library.

The core language extensions have been introduced in DAY2:閱讀CUDA C Programming Guide之編程模型. They allow programmers to define a kernel as a C function and use some new syntax to specify the grid and block dimension each time the function is called. Any source file that contains some of these extensions must be compiled with nvcc .

The runtime is introduced in Compilation Workflow. It provides C functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc. A complete description of the runtime can be found in the CUDA reference manual.

The runtime is built on top of a lower-level C API, the CUDA driver API, which is also accessible by the application. The driver API provides an additional level of control by exposing lower-level concepts such as CUDA contexts - the analogue of host processes for the device - and CUDA modules - the analogue of dynamically loaded libraries for the device. Most applications do not use the driver API as they do not need this additional level of control and when using the runtime, context and module management are implicit, resulting in more concise code. The driver API is introduced in Driver API and fully described in the reference manual.

3.1. Compilation with NVCC

Kernels can be written using the CUDA instruction set architecture, called PTX, which is described in the PTX reference manual. It is however usually more effective to use a high-level programming language such as C. In both cases, kernels must be compiled into binary code by nvcc to execute on the device.

nvcc is a compiler driver that simplifies the process of compiling C or PTX code: It provides simple and familiar command line options and executes them by invoking【調用】 the collection of tools that implement the different compilation stages. This section gives an overview of nvcc workflow and command options. A complete description can be found in the nvcc user manual.

3.1.1. Compilation Workflow3.1.1.1. Offline Compilation【離線編譯】

Source files compiled with nvcc can include a mix of host code (i.e., code that executes on the host) and device code (i.e., code that executes on the device). nvcc's basic workflow consists in separating device code from host code and then:

· compiling the device code into an assembly form (PTX code) and/or binary form (cubin object),

· and modifying the host code by replacing the <<<...>>> syntax introduced in Kernels (and described in more details in Execution Configuration) by the necessary CUDA C runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object.

The modified host code is output either as C code that is left to be compiled using another tool or as object code directly by letting nvcc invoke the host compiler during the last compilation stage.

Applications can then:

· Either link to the compiled host code (this is the most common case),

· Or ignore the modified host code (if any) and use the CUDA driver API (see Driver API) to load and execute the PTX code or cubin object.

3.1.1.2. Just-in-Time Compilation

Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver. This is called just-in-time compilation【即時編譯】. Just-in-time compilation increases application load time, but allows the application to benefit from any new compiler improvements coming with each new device driver. It is also the only way for applications to run on devices that did not exist at the time the application was compiled, as detailed in Application Compatibility.

When the device driver just-in-time compiles some PTX code for some application, it automatically caches a copy of the generated binary code in order to avoid repeating the compilation in subsequent invocations of the application. The cache - referred to as compute cache - is automatically invalidated when the device driver is upgraded, so that applications can benefit from the improvements in the new just-in-time compiler built into the device driver.

Environment variables are available to control just-in-time compilation as described in CUDA Environment Variables

3.1.2. Binary Compatibility【二進位兼容性】

Binary code is architecture-specific. A cubin object is generated using the compiler option -code that specifies the targeted architecture: For example, compiling with -code=sm_35 produces binary code for devices of compute capability 3.5. Binary compatibility is guaranteed from one minor revision to the next one, but not from one minor revision to the previous one or across major revisions. In other words, a cubin object generated for compute capability X.y will only execute on devices of compute capability X.z where z≥y.

3.1.3. PTX Compatibility【PTX兼容性】

Some PTX instructions are only supported on devices of higher compute capabilities. For example, Warp Shuffle Functions are only supported on devices of compute capability 3.0 and above. The -arch compiler option specifies the compute capability that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for example, must be compiled with -arch=compute_30 (or higher).

PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability. Note that a binary compiled from an earlier PTX version may not make use of some hardware features. For example, a binary targeting devices of compute capability 7.0 (Volta) compiled from PTX generated for compute capability 6.0 (Pascal) will not make use of Tensor Core instructions, since these were not available on Pascal. As a result, the final binary may perform worse than would be possible if the binary were generated using the latest version of PTX.

3.1.4. Application Compatibility

To execute code on devices of specific compute capability, an application must load binary or PTX code that is compatible with this compute capability as described in Binary Compatibility and PTX Compatibility. In particular, to be able to execute code on future architectures with higher compute capability (for which no binary code can be generated yet), an application must load PTXcode that will be just-in-time compiled for these devices (see Just-in-Time Compilation).

Which PTX and binary code gets embedded in a CUDA C application is controlled by the -arch and -code compiler options or the -gencode compiler option as detailed in the nvcc user manual. For example,

embeds binary code compatible with compute capability 3.5 and 5.0 (first and second -gencode options) and PTX and binary code compatible with compute capability 6.0 (third -gencodeoption).

Host code is generated to automatically select at runtime the most appropriate code to load and execute, which, in the above example, will be:

· 3.5 binary code for devices with compute capability 3.5 and 3.7,

· 5.0 binary code for devices with compute capability 5.0 and 5.2,

· 6.0 binary code for devices with compute capability 6.0 and 6.1,

· PTX code which is compiled to binary code at runtime for devices with compute capability 7.0 and higher.

x.cu can have an optimized code path that uses warp shuffle operations, for example, which are only supported in devices of compute capability 3.0 and higher. The __CUDA_ARCH__ macro can be used to differentiate various code paths based on compute capability. It is only defined for device code. When compiling with -arch=compute_35 for example, __CUDA_ARCH__ is equal to 350.

Applications using the driver API must compile code to separate files and explicitly load and execute the most appropriate file at runtime.

The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU. For code relying on specific behavior of SIMT scheduling in previous architecures, Independent Thread Scheduling may alter the set of participating threads, leading to incorrect results. To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal's thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.

The nvcc user manual lists various shorthand for the -arch, -code, and -gencode compiler options. For example, -arch=sm_35 is a shorthand for -arch=compute_35-code=compute_35,sm_35 (which is the same as -gencodearch=compute_35,code=\'compute_35,sm_35\').

3.1.5. C/C++ Compatibility

The front end【前端】 of the compiler processes CUDA source files according to C++ syntax rules【語法規則】.Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code as described in C/C++ Language Support.

3.1.6. 64-Bit Compatibility

The 64-bit version of nvcc compiles device code in 64-bit mode (i.e., pointers are 64-bit). Device code compiled in 64-bit mode is only supported with host code compiled in 64-bit mode.

Similarly, the 32-bit version of nvcc compiles device code in 32-bit mode and device code compiled in 32-bit mode is only supported with host code compiled in 32-bit mode.

The 32-bit version of nvcc can compile device code in 64-bit mode also using the -m64 compiler option.

The 64-bit version of nvcc can compile device code in 32-bit mode also using the -m32 compiler option.

just-in-time compilation縮寫為JIT，中文也叫「及時翻譯」或者「及時編譯」。具體的說法是在即將要被執行前的瞬間被編譯。（反義詞叫AOT。Ahead Of Time)。從你的角度看，普通編譯發生在當下編譯者的機器上。JIT編譯發生了以後發布給用戶，在用戶的機器上進行有。或者有一個未來的時間，例如新一代的顯卡發布了，因為編譯者現在的機器上，在開發的時候，還沒有新卡，編譯器也不知道未來如何給新卡編譯。採用JIT就不怕了，未來的編譯器集成在未來的顯卡驅動中，到時候在JIT編譯即可。這樣就解決了時間上的矛盾。而且如果將來有一天，編譯器技術發生了進步，JIT編譯可以在開發完成後很多年，甚至開發者都已經掛了的情況下（例如團隊解散），依然能享受未來的更先進編譯技術。因為它不是普通編譯那樣一次完成的，而是在將來在用戶的機器上再即時的完成，所以這就是為何叫「即時編譯」（Just in time）

Binary code is architecture-specific,這說的是SASS，SASS（Shader ASSembly的縮寫）是每種架構的卡是固定的。為一種卡編譯出來的SASS（例如cubin）只能在這種架構的卡上用。不像PTX那樣通用。（二進位兼容性就像你的CPU。你的一個exe可能是10年前的。但CPU是今年出的，但這個CPU卻依然可以運行當年的exe），GPU只能在PTX級別上保持兼容性，普通的SASS代碼不能保持，除非是同一代架構的卡。等於你買了v5的CPU，只能運行v5上編譯的exe，不能運行之前的，也不能運行之後的。

PTX Compatibility即PTX兼容性。PTX有幾個不同的版本。越往後的驅動或者卡，
支持的PTX版本越高。低版本的PTX寫的東西，能在高版本下運行。這樣就保持了對老代碼的兼容性。而不像是二進位的SASS，一代就只能在一代上運行。不能在老一代上，也不能上新一代上運行。這是SASS或者說二進位發布的最大壞處。PTX可以持續在未來的新卡上運行（JIT麼），你可以直接將PTX理解成一種虛擬機和之上的虛擬指令。

Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code 在HOST代碼中，具有完整的C++支持（也就是普通的CPU上）；在DEVICE代碼中，只有部分C++（的特性）被完全支持（也就是在GPU上）。

Device code compiled in 64-bit mode is only supported with host code compiled in 64-bit mode.

GPU端如果是64-bit，CPU端也必須是。這個看起來很正常，為何要特別說明？？因為CUDA 3.2和之前的版本，支持混合模式。允許一部分是64-bit，一部分是32-bit的。後來發現這對很多人造成了困擾。於是直接要求都必須是統一的了。這也是CUDA易用性的體驗。例如OpenCL就不要求這點。所以CUDA可以很容易的將結構體（裡面含有各種和字長相關的東西（32-bit或者64-bit）之類的在GPU和CPU上傳遞。而OpenCL很難做到這種。

有不明白的地方，請在本文後留言

或者在我們的技術論壇bbs.gpuworld.cn上發帖

DAY3:閱讀CUDA C編程接口

相關焦點

CUDA之CUDA編程模型概述(一)

讓 Windows 的 R 用上 CUDA

C++ 使用類調用 CUDA 核函數

解讀CUDA C Programming Guide 第三章第2節-Device Memory

Numba:用CUDA加速的高性能Python編譯器

CUDA編程學習系列1

解讀CUDA C Programming Guide 第三章第2節之Initialization

系統性學習CUDA編程的推薦資料

GPU並行編程:熟練使用CUDA C語言

推薦幾個不錯的CUDA入門教程(非廣告)

《NVIDIA CUDA 開發者系列培訓》筆記(二)

Numba:基於CUDA加速的高性能Python

【編程基礎】Java Comparator接口的使用

DAY2:閱讀CUDA C Programming Guide之編程模型

【CUDA學習筆記】第一篇:一個基本的CUDA C程序(附配置方法和安裝包下載方式)

【CUDA學習筆記】第五篇:內存以及案例解釋(附案例代碼下載方式)

python+C、C++混合編程的應用

C++中是如何調用C接口的?

PyTorch 源碼解讀之 torch.cuda.amp: 自動混合精度詳解

cuda入門:如何進行矩陣乘法優化