（精）CUDA编程模型.pptVIP

下载本文档

12
0
约2.21千字
约 16页
2017-01-05 发布于北京
举报
版权申诉

（精）CUDA编程模型.ppt

1、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。。
2、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
3、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。
4、该文档为VIP文档，如果想要下载，成为VIP会员后，下载免费。
5、成为VIP后，下载本文档将扣除1次下载权益。下载后，不支持退款、换文档。如有疑问请联系我们。
6、成为VIP后，您将拥有八大权益，权益包括：VIP文档下载权益、阅读免打扰、文档格式转换、高级专利检索、专属身份标志、高级客服、多端互通、版权登记。
7、VIP文档为合作方或网友上传，每下载1次，网站将根据用户上传文档的质量评分、类型等，对文档贡献者给予高额补贴、流量扶持。如果你也想贡献VIP文档。上传文档

CUDA/GPU 编程模型周斌 @ NVIDIA USTC 2014年10月内容 CPU和GPU互动模式 GPU线程组织模型（不停强化） GPU存储模型基本的编程问题 CPU-GPU交互各自的物理内存空间通过PCIE总线互连(8GB/s～16GB/s) 交互开销较大 ? NVIDIA Corporation GPU存储器层次架构（硬件）访存速度 Register – dedicated HW - single cycle Shared Memory – dedicated HW - single cycle Local Memory – DRAM, no cache - *slow* Global Memory – DRAM, no cache - *slow* Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Instruction Memory (invisible) – DRAM, cached GPU架构回顾 GPU线程组织模型线程组织架构说明一个Kernel具有大量线程线程被划分成线程块‘blocks’ 一个block内部的线程共享 ‘Shared Memory’ 可以同步 ‘_syncthreads()’ Kernel启动一个‘grid’，包含若干线程块用户设定线程和线程块具有唯一的标识 GPU线程映射关系 GPU内存和线程等关系 Thread 线程私有Local Memory Block 每个block Shared Memory Kernel 0 . . . 每个设备共有的Global Memory . . . Kernel 1 Sequential Kernels 设备(GPU0) 存储器设备(GPU1) 存储器主机端存储器 cudaMemcpy() Global Memory Constant Memory Texture Memory Block (0,0) Shared Memory Registers Registers Local Memory Local Memory Thread (0,0) Thread (1,0) Block (1,0) Shared Memory Registers Registers Local Memory Local Memory Thread (0,0) Thread (1,0) Host * 编程模型常规意义的GPU用于处理图形图像操作于像素，每个像素的操作都类似可以应用SIMD (single instruction multiple data) * SIMD (Single Instruction Multiple Data) 也可以认为是数据并行分割 Instruction a[] = a[] + k a[0] a[n-1] a[n-2] a[1] ALUs * Single Instruction Multiple Thread (SIMT) GPU版本的 SIMD 大量线程模型获得高度并行线程切换获得延迟掩藏多个线程执行相同指令流 GPU上大量线程承载和调度 CUDA编程模式：Extended C Declspecs global, device, shared, local, constant 关键词 threadIdx, blockIdx Intrinsics __syncthreads 运行期API Memory, symbol, execution management 函数调用 __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() ... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve100, 10 (myimage); CUDA 函数声明执行位置调用位置 __device_