本書旨在介紹進行CUDA並行優化的C編程指導。共5章,內容分別是:
Introduction
Programming Model
Programming Interface
Hardware Implementation
Performance Guidelines
本章主要內容包括:
Compilation with NVCC
CUDA C Runtime
Versioning and Compatibility
Compute Modes
Mode Switches
Tesla Compute Cluster Mode for Windows
由於第三章第2小節包含的內容非常多,同時也是學習的重點。為了更好地學習理解CUDA編程,故將此章節解讀細分為14個部分。本篇介紹Device Memory。
Initialization
Device Memory
Device Memory L2 Access Management
Shared Memory
Page-locked Host Memory
Asynchronous Concurrent Execution
Multi-Device System
Unified Virtual Address Space
Interprocess Communication
Error Checking
Call Stack
Texture and Surface Memory
Graphics Interoperability
External Resource Interoperability
在異構編程中提到過,CUDA編程模型假設了一個由host和device組成的系統,它們有各自的內存。這裡的device memory包括global memory, constant memory和texture memory.
device memory的分配方式有兩種,linear memory和CUDA arrays。
CUDA arrays是為紋理獲取而進行的優化,此處不細表。
linear memory分配的是單一地址空間。地址空間的大小取決於CPU,GPU的計算能力。不同計算能力的GPU有不同的地址空間,如下圖。
linear memory通常是使用cudaMalloc()來分配,cudaFree()來釋放,host和device間的數據傳輸使用cudaMemcpy(),用cudaMemset()賦值。
例如一個向量相加的例子,vectors向量需要從host傳到device,然後進行計算。
__global__ void VecAdd(float* A, float* B, float* C, int N){ int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i];}int main(){ int N = ...; size_t size = N * sizeof(float); float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); ... float* d_A; cudaMalloc(&d_A, size); float* d_B; cudaMalloc(&d_B, size); float* d_C; cudaMalloc(&d_C, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); ...}linear memory也可以通過cudaMallocPitch()和cudaMalloc3D()來分配。這兩個函數是推薦用來分配2D和3D數組的。它們可以自動padding,滿足內存對齊的要求,提高內存讀寫效率。
例如下面的代碼展示了分配一個width*height 的2D float型數組。
int width = 64, height = 64;float* devPtr;size_t pitch;cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height){ for (int r = 0; r < height; ++r) { float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) { float element = row[c]; } }}例如下面的代碼展示了分配一個width*height*depth 的2D float型數組。
int width = 64, height = 64, depth = 64;cudaExtent extent = make_cudaExtent(width * sizeof(float), height, depth);cudaPitchedPtr devPitchedPtr;cudaMalloc3D(&devPitchedPtr, extent);MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);__global__ void MyKernel(cudaPitchedPtr devPitchedPtr, int width, int height, int depth){ char* devPtr = devPitchedPtr.ptr; size_t pitch = devPitchedPtr.pitch; size_t slicePitch = pitch * height; for (int z = 0; z < depth; ++z) { char* slice = devPtr + z * slicePitch; for (int y = 0; y < height; ++y) { float* row = (float*)(slice + y * pitch); for (int x = 0; x < width; ++x) { float element = row[x]; } } }}注意:
為了避免分配過多的內存而使得系統性能受到影響,要求用戶的分配參數是根據實際情況來定。如果分配失敗,你可以使用cudaMallocHost(),cudaHostRegister(),或者返回error來告訴用戶有多少內存分配失敗了。