解讀CUDA C Programming Guide 第三章第2節-Device Memory

2021-03-02 AliceWanderAI

本書旨在介紹進行CUDA並行優化的C編程指導。共5章，內容分別是：

Introduction

Programming Model

Programming Interface

Hardware Implementation

Performance Guidelines

本章主要內容包括：

Compilation with NVCC

CUDA C Runtime

Versioning and Compatibility

Compute Modes

Mode Switches

Tesla Compute Cluster Mode for Windows

由於第三章第2小節包含的內容非常多，同時也是學習的重點。為了更好地學習理解CUDA編程，故將此章節解讀細分為14個部分。本篇介紹Device Memory。

Initialization

Device Memory

Device Memory L2 Access Management

Shared Memory

Page-locked Host Memory

Asynchronous Concurrent Execution

Multi-Device System

Unified Virtual Address Space

Interprocess Communication

Error Checking

Call Stack

Texture and Surface Memory

Graphics Interoperability

External Resource Interoperability

在異構編程中提到過，CUDA編程模型假設了一個由host和device組成的系統，它們有各自的內存。這裡的device memory包括global memory， constant memory和texture memory.

device memory的分配方式有兩種，linear memory和CUDA arrays。

CUDA arrays是為紋理獲取而進行的優化，此處不細表。

linear memory分配的是單一地址空間。地址空間的大小取決於CPU，GPU的計算能力。不同計算能力的GPU有不同的地址空間，如下圖。

linear memory通常是使用cudaMalloc()來分配，cudaFree()來釋放，host和device間的數據傳輸使用cudaMemcpy()，用cudaMemset()賦值。

例如一個向量相加的例子，vectors向量需要從host傳到device，然後進行計算。

__global__ void VecAdd(float* A, float* B, float* C, int N){    int i = blockDim.x * blockIdx.x + threadIdx.x;    if (i < N)    C[i] = A[i] + B[i];}int main(){    int N = ...;    size_t size = N * sizeof(float);        float* h_A = (float*)malloc(size);    float* h_B = (float*)malloc(size);        ...        float* d_A;    cudaMalloc(&d_A, size);    float* d_B;    cudaMalloc(&d_B, size);    float* d_C;    cudaMalloc(&d_C, size);        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);        int threadsPerBlock = 256;    int blocksPerGrid =    (N + threadsPerBlock - 1) / threadsPerBlock;    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);            cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);        cudaFree(d_A);    cudaFree(d_B);    cudaFree(d_C);        ...}
linear memory也可以通過cudaMallocPitch()和cudaMalloc3D()來分配。這兩個函數是推薦用來分配2D和3D數組的。它們可以自動padding，滿足內存對齊的要求，提高內存讀寫效率。
例如下面的代碼展示了分配一個width*height 的2D float型數組。
int width = 64, height = 64;float* devPtr;size_t pitch;cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height){    for (int r = 0; r < height; ++r)    {        float* row = (float*)((char*)devPtr + r * pitch);        for (int c = 0; c < width; ++c)         {            float element = row[c];        }    }}
例如下面的代碼展示了分配一個width*height*depth 的2D float型數組。
int width = 64, height = 64, depth = 64;cudaExtent extent = make_cudaExtent(width * sizeof(float), height, depth);cudaPitchedPtr devPitchedPtr;cudaMalloc3D(&devPitchedPtr, extent);MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);__global__ void MyKernel(cudaPitchedPtr devPitchedPtr, int width, int height, int depth){    char* devPtr = devPitchedPtr.ptr;    size_t pitch = devPitchedPtr.pitch;    size_t slicePitch = pitch * height;    for (int z = 0; z < depth; ++z)     {        char* slice = devPtr + z * slicePitch;        for (int y = 0; y < height; ++y)         {            float* row = (float*)(slice + y * pitch);            for (int x = 0; x < width; ++x)             {                float element = row[x];            }        }    }}
注意：
為了避免分配過多的內存而使得系統性能受到影響，要求用戶的分配參數是根據實際情況來定。如果分配失敗，你可以使用cudaMallocHost()，cudaHostRegister()，或者返回error來告訴用戶有多少內存分配失敗了。

相關焦點

解讀CUDA C Programming Guide 第三章第2節之Initialization

CUDA C RuntimeVersioning and CompatibilityCompute ModesMode SwitchesTesla Compute Cluster Mode for Windows由於第三章第2小節包含的內容非常多，同時也是學習的重點。
DAY2:閱讀CUDA C Programming Guide之編程模型

2. Programming ModelThis chapter introduces the main concepts behind the CUDA programming model by outlining how they are exposed in C.
CUDA系列學習(二)(轉)

Kernel跑在device memory上，所以runtime提供了分配，釋放，複製 device memory 和device <-->host 間transfer data的函數。5.1 global arraysglobal arrays:5.2 global variables聲明前加標識符__device__，表示變量要放在device上了 e.g.
CUDA在MFC中的聯調方法實例

通過查看deviceQuery.cu的屬性，通過觀察「命令行」：　　"$(CUDA_BIN_PATH)\nvcc.exe" -ccbin "$(VCInstallDir)bin" -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I
推薦幾個不錯的CUDA入門教程(非廣告)

NVIDIA CUDA C++ Programming Guide「地址：」https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html這是英偉達官方的CUDA編程教程，但是我英文一般
【翻譯】Sklearn與TensorFlow機器學習實用指南 ——第12章設備和伺服器上的分布式TensorFlow(上)

章機器學習概覽（上）【翻譯】Sklearn 與 TensorFlow 機器學習實用指南 —— 第1章機器學習概覽（下）【翻譯】Sklearn 與 TensorFlow 機器學習實用指南 —— 第2章一個完整的機器學習項目（上）【翻譯】Sklearn 與 TensorFlow 機器學習實用指南 —— 第2章一個完整的機器學習項目（中）【
讓 Windows 的 R 用上 CUDA

// kernel.cu#include "cuda_runtime.h"#include "device_launch_parameters.h"#include <stdio.h>#include "kernel.h"__global__ void addKernel(int *c, const int *a,
《NVIDIA CUDA 開發者系列培訓》筆記(二)

如圖所示，我們可以看到這個grid中是一個二維的結構，橫向有3個，縱向有2個，第三個維度只有一個單位所以是1，因此我們定義一個grid初始化為(3,2,1)，表示它是一個二維的結構，x軸的維度是3，y軸的維度是2。同理在每一個block，我們同樣可以定義一個類型，x軸維度是5，y軸維度是3，就有block(5,3,1)。這樣我們就定義好了配置文件。
PyTorch 源碼解讀之 torch.cuda.amp: 自動混合精度詳解

2.1.1 autocast算子PyTorch中，只有 CUDA 算子有資格被 autocast，而且只有「out-of-place」才可以被 autocast，例如：a.addmm(b, c)是可以被 autocast，但是a.addmm_(b, c)和a.addmm(b, c, out=d)不可以 autocast。
cuda入門:如何進行矩陣乘法優化

這樣的計算方式，總共需要讀取 2*n3 次內存。如果讓一個 row 只需要讀入一次的話，就可以減到為 n3+n2 次。　　第一個改良　　和我們的第一個 CUDA 程序一樣，我們可以利用 shared memory 來儲存每個 row 的數據。
DAY3:閱讀CUDA C編程接口

Programming InterfaceCUDA C provides a simple path for users familiar with the C programming language to easily write programs for execution by the device.
【CUDA學習筆記】第五篇:內存以及案例解釋(附案例代碼下載方式)

N * sizeof(float), cudaMemcpyHostToDevice);cudaMemcpyToSymbol(constant_f, &h_f, sizeof(int), 0, cudaMemcpyHostToDevice);cudaMemcpyToSymbol(constant_g, &h_g, sizeof(int));gpu_constant_memory <
C++ 使用類調用 CUDA 核函數

SetParameter(); void AddNum(); void Show(); void Evolution();};void CTest::SetParameter(){ cudaMallocManaged(&a, sizeof(int) * DX); cudaMallocManaged(&b, sizeof(int) * DX);
寫CUDA到底難在哪?

並行思想其中難上加難的東西是數據分組（partitioning），這節是絕對影響性能最厲害地方，以及勸退大量靠算法刷題山來的孩子們。我們寫一個普通程序，一般去考慮的首先是寫一個樸素（naive）實現，然後再去看這些地方哪裡有可以優化的算法，時間降低時間空間複雜度的要求。而並行計算中，最重要的一點是為數據分組成多個小數據塊，每個線程（進程）再去實現SPMD或者SIMD/T。
Numba:用CUDA加速的高性能Python編譯器

請注意，mandel_kernel函數使用Numba提供的cuda.threadIdx，cuda.blockIdx，cuda.blockDim和cuda.gridDim架構來計算當前線程的全局X和Y像素索引。
CUDA優化的冷知識12 |一些規避的坑和優化的要點(續)

這一系列文章面向CUDA開發者來解讀《CUDA C Best Practices Guide》（CUDA C最佳實踐指南

解讀CUDA C Programming Guide 第三章第2節-Device Memory

相關焦點

解讀CUDA C Programming Guide 第三章第2節之Initialization

DAY2:閱讀CUDA C Programming Guide之編程模型

CUDA系列學習(二)(轉)

CUDA在MFC中的聯調方法實例

推薦幾個不錯的CUDA入門教程(非廣告)

【翻譯】Sklearn與TensorFlow機器學習實用指南 ——第12章 設備和伺服器上的分布式TensorFlow(上)

讓 Windows 的 R 用上 CUDA

《NVIDIA CUDA 開發者系列培訓》筆記(二)

PyTorch 源碼解讀之 torch.cuda.amp: 自動混合精度詳解

cuda入門:如何進行矩陣乘法優化

DAY3:閱讀CUDA C編程接口

【CUDA學習筆記】第五篇:內存以及案例解釋(附案例代碼下載方式)

C++ 使用類調用 CUDA 核函數

寫CUDA到底難在哪?

Numba:用CUDA加速的高性能Python編譯器

CUDA優化的冷知識12 |一些規避的坑和優化的要點(續)

【翻譯】Sklearn與TensorFlow機器學習實用指南 ——第12章設備和伺服器上的分布式TensorFlow(上)