DAY3:閱讀CUDA C編程接口

2021-03-02 GPUS開發者

我們正帶領大家開始閱讀英文的《CUDA C Programming Guide》,今天是第三天,我們將用三天時間來學習CUDA 的編程接口。希望在接下來的97天裡,您可以學習到原汁原味的CUDA,同時能養成英文閱讀的習慣。


3. Programming Interface

CUDA C provides a simple path for users familiar with the C programming language to easily write programs for execution by the device.

It consists of a minimal set of extensions to the C language and a runtime library.

The core language extensions have been introduced in DAY2:閱讀CUDA C Programming Guide之編程模型. They allow programmers to define a kernel as a C function and use some new syntax to specify the grid and block dimension each time the function is called.  Any source file that contains some of these extensions must be compiled with nvcc .

The runtime is introduced in Compilation Workflow. It provides C functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc. A complete description of the runtime can be found in the CUDA reference manual.

The runtime is built on top of a lower-level C API, the CUDA driver API, which is also accessible by the application. The driver API provides an additional level of control by exposing lower-level concepts such as CUDA contexts - the analogue of host processes for the device - and CUDA modules - the analogue of dynamically loaded libraries for the device. Most applications do not use the driver API as they do not need this additional level of control and when using the runtime, context and module management are implicit, resulting in more concise code. The driver API is introduced in Driver API and fully described in the reference manual.

3.1. Compilation with NVCC

Kernels can be written using the CUDA instruction set architecture, called PTX, which is described in the PTX reference manual. It is however usually more effective to use a high-level programming language such as C. In both cases, kernels must be compiled into binary code by nvcc to execute on the device.

nvcc is a compiler driver that simplifies the process of compiling C or PTX code: It provides simple and familiar command line options and executes them by invoking【調用】 the collection of tools that implement the different compilation stages. This section gives an overview of nvcc workflow and command options. A complete description can be found in the nvcc user manual.

3.1.1. Compilation Workflow3.1.1.1. Offline Compilation【離線編譯】

Source files compiled with nvcc can include a mix of host code (i.e., code that executes on the host) and device code (i.e., code that executes on the device). nvcc's basic workflow consists in separating device code from host code and then:

· compiling the device code into an assembly form (PTX code) and/or binary form (cubin object),

· and modifying the host code by replacing the <<<...>>> syntax introduced in Kernels (and described in more details in Execution Configuration) by the necessary CUDA C runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object.

The modified host code is output either as C code that is left to be compiled using another tool or as object code directly by letting nvcc invoke the host compiler during the last compilation stage.

Applications can then:

· Either link to the compiled host code (this is the most common case),

· Or ignore the modified host code (if any) and use the CUDA driver API (see Driver API) to load and execute the PTX code or cubin object.

3.1.1.2. Just-in-Time Compilation

Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver. This is called just-in-time compilation【即時編譯】. Just-in-time compilation increases application load time, but allows the application to benefit from any new compiler improvements coming with each new device driver. It is also the only way for applications to run on devices that did not exist at the time the application was compiled, as detailed in Application Compatibility.

When the device driver just-in-time compiles some PTX code for some application, it automatically caches a copy of the generated binary code in order to avoid repeating the compilation in subsequent invocations of the application. The cache - referred to as compute cache - is automatically invalidated when the device driver is upgraded, so that applications can benefit from the improvements in the new just-in-time compiler built into the device driver.

Environment variables are available to control just-in-time compilation as described in CUDA Environment Variables

3.1.2. Binary Compatibility【二進位兼容性】

Binary code is architecture-specific. A cubin object is generated using the compiler option -code that specifies the targeted architecture: For example, compiling with -code=sm_35 produces binary code for devices of compute capability 3.5. Binary compatibility is guaranteed from one minor revision to the next one, but not from one minor revision to the previous one or across major revisions. In other words, a cubin object generated for compute capability X.y will only execute on devices of compute capability X.z where z≥y.

3.1.3. PTX Compatibility【PTX兼容性】

Some PTX instructions are only supported on devices of higher compute capabilities. For example, Warp Shuffle Functions are only supported on devices of compute capability 3.0 and above. The -arch compiler option specifies the compute capability that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for example, must be compiled with -arch=compute_30 (or higher).

PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability. Note that a binary compiled from an earlier PTX version may not make use of some hardware features. For example, a binary targeting devices of compute capability 7.0 (Volta) compiled from PTX generated for compute capability 6.0 (Pascal) will not make use of Tensor Core instructions, since these were not available on Pascal. As a result, the final binary may perform worse than would be possible if the binary were generated using the latest version of PTX.

3.1.4. Application Compatibility

To execute code on devices of specific compute capability, an application must load binary or PTX code that is compatible with this compute capability as described in Binary Compatibility and PTX Compatibility. In particular, to be able to execute code on future architectures with higher compute capability (for which no binary code can be generated yet), an application must load PTXcode that will be just-in-time compiled for these devices (see Just-in-Time Compilation).

Which PTX and binary code gets embedded in a CUDA C application is controlled by the -arch and -code compiler options or the -gencode compiler option as detailed in the nvcc user manual. For example,

embeds binary code compatible with compute capability 3.5 and 5.0 (first and second -gencode options) and PTX and binary code compatible with compute capability 6.0 (third -gencodeoption).

Host code is generated to automatically select at runtime the most appropriate code to load and execute, which, in the above example, will be:

· 3.5 binary code for devices with compute capability 3.5 and 3.7,

· 5.0 binary code for devices with compute capability 5.0 and 5.2,

· 6.0 binary code for devices with compute capability 6.0 and 6.1,

· PTX code which is compiled to binary code at runtime for devices with compute capability 7.0 and higher.

x.cu can have an optimized code path that uses warp shuffle operations, for example, which are only supported in devices of compute capability 3.0 and higher. The __CUDA_ARCH__ macro can be used to differentiate various code paths based on compute capability. It is only defined for device code. When compiling with -arch=compute_35 for example, __CUDA_ARCH__ is equal to 350.

Applications using the driver API must compile code to separate files and explicitly load and execute the most appropriate file at runtime.

The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU. For code relying on specific behavior of SIMT scheduling in previous architecures, Independent Thread Scheduling may alter the set of participating threads, leading to incorrect results. To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal's thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.

The nvcc user manual lists various shorthand for the -arch, -code, and -gencode compiler options. For example, -arch=sm_35 is a shorthand for -arch=compute_35-code=compute_35,sm_35 (which is the same as -gencodearch=compute_35,code=\'compute_35,sm_35\').

3.1.5. C/C++ Compatibility

The front end【前端】 of the compiler processes CUDA source files according to C++ syntax rules【語法規則】.Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code as described in C/C++ Language Support.

3.1.6. 64-Bit Compatibility

The 64-bit version of nvcc compiles device code in 64-bit mode (i.e., pointers are 64-bit). Device code compiled in 64-bit mode is only supported with host code compiled in 64-bit mode.

Similarly, the 32-bit version of nvcc compiles device code in 32-bit mode and device code compiled in 32-bit mode is only supported with host code compiled in 32-bit mode.

The 32-bit version of nvcc can compile device code in 64-bit mode also using the -m64 compiler option.

The 64-bit version of nvcc can compile device code in 32-bit mode also using the -m32 compiler option.

 just-in-time compilation縮寫為JIT,中文也叫「及時翻譯」或者「及時編譯」。具體的說法是在即將要被執行前的瞬間被編譯。(反義詞叫AOT。Ahead Of Time)。從你的角度看,普通編譯發生在當下編譯者的機器上。JIT編譯發生了以後發布給用戶,在用戶的機器上進行有。或者有一個未來的時間,例如新一代的顯卡發布了,因為編譯者現在的機器上,在開發的時候,還沒有新卡,編譯器也不知道未來如何給新卡編譯。採用JIT就不怕了,未來的編譯器集成在未來的顯卡驅動中,到時候在JIT編譯即可。這樣就解決了時間上的矛盾。而且如果將來有一天,編譯器技術發生了進步,JIT編譯可以在開發完成後很多年,甚至開發者都已經掛了的情況下(例如團隊解散),依然能享受未來的更先進編譯技術。因為它不是普通編譯那樣一次完成的,而是在將來在用戶的機器上再即時的完成,所以這就是為何叫「即時編譯」(Just in time)

Binary code is architecture-specific,這說的是SASS,SASS(Shader ASSembly的縮寫)是每種架構的卡是固定的。為一種卡編譯出來的SASS(例如cubin)只能在這種架構的卡上用。不像PTX那樣通用。(二進位兼容性就像你的CPU。你的一個exe可能是10年前的。但CPU是今年出的,但這個CPU卻依然可以運行當年的exe),GPU只能在PTX級別上保持兼容性,普通的SASS代碼不能保持,除非是同一代架構的卡。等於你買了v5的CPU,只能運行v5上編譯的exe,不能運行之前的,也不能運行之後的。

PTX Compatibility即PTX兼容性。PTX有幾個不同的版本。越往後的驅動或者卡,
支持的PTX版本越高。低版本的PTX寫的東西,能在高版本下運行。這樣就保持了對老代碼的兼容性。而不像是二進位的SASS,一代就只能在一代上運行。不能在老一代上,也不能上新一代上運行。這是SASS或者說二進位發布的最大壞處。PTX可以持續在未來的新卡上運行(JIT麼),你可以直接將PTX理解成一種虛擬機和之上的虛擬指令。

  Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code 在HOST代碼中,具有完整的C++支持(也就是普通的CPU上); 在DEVICE代碼中,只有部分C++(的特性)被完全支持(也就是在GPU上)。

 Device code compiled in 64-bit mode is only supported with host code compiled in 64-bit mode.

 GPU端如果是64-bit,CPU端也必須是。這個看起來很正常,為何要特別說明?? 因為CUDA 3.2和之前的版本,支持混合模式。允許一部分是64-bit,一部分是32-bit的。 後來發現這對很多人造成了困擾。於是直接要求都必須是統一的了。 這也是CUDA易用性的體驗。 例如OpenCL就不要求這點。 所以CUDA可以很容易的將結構體(裡面含有各種和字長相關的東西(32-bit或者64-bit)之類的在GPU和CPU上傳遞。 而OpenCL很難做到這種。

有不明白的地方,請在本文後留言

或者在我們的技術論壇bbs.gpuworld.cn上發帖

相關焦點

  • CUDA之CUDA編程模型概述(一)
    推薦閱讀:1.CUDA之異構計算與CUDAAbstract: 本文介紹CUDA編程模型的簡要結構
  • 讓 Windows 的 R 用上 CUDA
    這其實非常類似於 CUDA 編程中的內存問題,我們只需要在兩邊分別開闢內存,然後將內存中的數據複製一下,相當於寫一個 cudaMalloc 和 cudaMemcpy 。 這樣相當於在全局創建了很多的變量,如果使用一些方法將這些全局變量統一管理會更好。 可以將這些全局變量保存在一個結構體中,工廠函數返回一個指向這個結構體的指針,並為這個結構體成員創建初始值。
  • C++ 使用類調用 CUDA 核函數
    在add.h中封裝一個函數AddNum(),調用kernel.cuh中的函數AddKernel(int *a, int *b, int *c, int DX),然後在kernel.cu文件中使用AddKernel(...)調用相加核函數Add(int *a, int *b, int *c, int DX)1、add.h文件定義了一個CTest的類,包括3個指針(數組
  • 解讀CUDA C Programming Guide 第三章第2節-Device Memory
    為了更好地學習理解CUDA編程,故將此章節解讀細分為14個部分。本篇介紹Device Memory。,CUDA編程模型假設了一個由host和device組成的系統,它們有各自的內存。linear memory也可以通過cudaMallocPitch()和cudaMalloc3D()來分配。這兩個函數是推薦用來分配2D和3D數組的。它們可以自動padding,滿足內存對齊的要求,提高內存讀寫效率。例如下面的代碼展示了分配一個width*height 的2D float型數組。
  • Numba:用CUDA加速的高性能Python編譯器
    理想情況下,Python程式設計師希望在不使用另一種程式語言的情況下使其現有的Python代碼更快,當然,許多人也希望使用加速器來獲得更高的性能。Numba團隊的另一個項目叫做pyculib,它提供了一個Python接口,用於CUDA cuBLAS(dense linear algebra,稠密線性代數),cuFFT(Fast Fourier Transform,快速傅立葉變換),和cuRAND(random number generation,隨機數生成)庫。
  • CUDA編程學習系列1
    float *x, *y; cudaMallocManaged(&x, N*sizeof(float)); cudaMallocManaged(&y, N*sizeof(float)); ...
  • 解讀CUDA C Programming Guide 第三章第2節之Initialization
    本書旨在介紹進行CUDA並行優化的C編程指導。
  • 系統性學習CUDA編程的推薦資料
    對C++工程師或者程式設計師人群簡單入門介紹CUDA編程。https://developer.nvidia.com/blog/even-easier-introduction-cuda/針對初學者,對統一內存的說明。
  • GPU並行編程:熟練使用CUDA C語言
    這個方法對計算機也適用,假設你想添加兩個向量v(x,y,z)和u(x,y,z),這裡v=(1,2,3),u=(4,5,6),那麼v+u=(1,2,3)+(4,5,6)=(1+4,2+5,3+6)=(5,7,9),你自己可以算一下,一次計算一個,但正如你可能看到的,這個問題可以分解成多個更小的問題,你可以讓一個「人」將x分向量一起加起來,另一個「人」將y分向量一起加起來,第三個「人」
  • 推薦幾個不錯的CUDA入門教程(非廣告)
    NVIDIA CUDA C++ Programming Guide「地址:」https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html這是英偉達官方的CUDA編程教程,但是我英文一般
  • 《NVIDIA CUDA 開發者系列培訓》筆記(二)
    如圖所示,我們可以看到這個grid中是一個二維的結構,橫向有3個,縱向有2個,第三個維度只有一個單位所以是1,因此我們定義一個grid初始化為(3,2,1),表示它是一個二維的結構,x軸的維度是3,y軸的維度是2。同理在每一個block,我們同樣可以定義一個類型,x軸維度是5,y軸維度是3,就有block(5,3,1)。這樣我們就定義好了配置文件。
  • Numba:基於CUDA加速的高性能Python
    所需的編程工作可以像添加函數裝飾器來指示Numba為GPU編譯一樣簡單。例如,以下代碼中的@vectorize裝飾器在運行時生成標量函數的一個編譯的,向量化版本Add函數,以便可以在GPU上並行處理數據數組。
  • 【編程基礎】Java Comparator接口的使用
    在實際編程中我們經常會用到集合或者數組,有的時候你需要對這個集合中的元素就行排序,那這個時候就用到了Comparator接口,先看一下接口的原型
  • DAY2:閱讀CUDA C Programming Guide之編程模型
    我們正帶領大家開始閱讀英文的《CUDA C Programming Guide》,今天是第二天,希望在接下來的98天裡,您可以學習到原汁原味的
  • 【CUDA學習筆記】第一篇:一個基本的CUDA C程序(附配置方法和安裝包下載方式)
    1、CUDA的簡介2、GPU架構和CUDA介紹3、CUDA架構4、開發環境說明和配置5、開始第一個Hello CUDA程序    5.1、VS2017創建NVIDIA CUDA項目    5.2、VS2017中輸入 CUDA代碼(附.cu代碼以及講解)    5.3、VS2017生成並開始執行
  • 【CUDA學習筆記】第五篇:內存以及案例解釋(附案例代碼下載方式)
    1、常量內存2、紋理內存3、向量點乘和矩陣乘法的例子    3.1、向量點乘    3.2、矩陣相乘參考   NVIDIA GPU卡從邏輯上對用戶提供了紋理內存支持2D和3D的紋理讀取操作,在你的CUDA程序裡面使用紋理內存可沒有那麼輕易,特別是對那些並非編程專家的人來說。
  • python+C、C++混合編程的應用
    有的語言專注於簡單高效,比如python,內建的list,dict結構比c/c++易用太多,但同樣為了安全、易用,語言也犧牲了部分性能。在有些領域,比如通信,性能很關鍵,但並不意味這個領域的coder只能苦苦掙扎於c/c++的陷阱中,比如可以使用多種語言混合編程。
  • C++中是如何調用C接口的?
    首先提供一個C接口://來源:公眾號【編程珠璣】 // 博客:https://www.yanbinghu.com //test.c #include"test.h" void testCfun() {     printf("I am c fun\n");     return; } 為了簡化,我們在這裡就不將它做成靜態庫或者動態庫了,有興趣的可以參考
  • PyTorch 源碼解讀之 torch.cuda.amp: 自動混合精度詳解
    本文是對torch.cuda.amp工作機制,和 module 中接口使用方法介紹,以及在算法角度上對 amp 不掉點原因進行分析,最後補充一點對 amp 存儲消耗的解釋。1.2.1.1 autocast算子PyTorch中,只有 CUDA 算子有資格被 autocast,而且只有 「out-of-place」 才可以被 autocast,例如:a.addmm(b, c)是可以被 autocast,但是a.addmm_(b, c)和a.addmm(b, c, out=d)不可以 autocast。
  • cuda入門:如何進行矩陣乘法優化
    在 GeForce 8800GT 上實際執行的結果如下:  Max error: 2.01484e-006 Average error: 3.36637e-007  Time used: 1.1560 (1.73 GFLOPS)  可以看到兩個問題:  很明顯的,執行效率相當低落。  最大相對誤差偏高。