存储层次结构设计.ppt

存储层次结构设计.ppt

The University of Adelaide, School of Computer Science * Chapter 2 — Instructions: Language of the Computer * * SM: Streaming Multiprocessor PTX: Parallel Thread Execution * * * In SIMD, you need to specify the data array + an instruction (on which to operate the data on) + THE INSTRUCTION WIDTH. Eg: You might want to add 2 integer arrays of length 16, then a SIMD instruction would look like (the instruction has been cooked-up by me for demo) add.16 arr1 arr2 However, SIMT doesnt bother about the instruction width. So, essentially, you could write the above example as: arr1[i] + arr2[i] and then launch as many threads as the length of the array, as you want. Note that, if the array size was, let us say, 32, then SIMD EXPECTS you to explicitly call two such add.16 instructions! Whereas, this is not the case with SIMT. * With a large number of shader threads multiplexed on the same execution re- sources, our architecture employs fine-grained multithreading where individual threads are interleaved by the fetch unit to proactively hide the potential latency of stalls before they occur. As illustrated by Figure, warps are issued fairly in a round-robin queue. When a thread is blocked by a memory request, shader core simply removes that thread’s warp from the pool of “ready” warps and thereby allows other threads to proceed while the memory system processes its request. With a large number of threads (1024 per shader core) interleaved on the same pipeline, FGMT effectively hides the latency of most memory operations since the pipeline is occupied with instructions from other threads while memory operations complete. also hides the pipeline latency so that data bypassing logic can potentially be omitted to save area with minimal impact on performance. simplify the dependency check logic design by restricting each thread to have at most one instruction running in the pipeline at any time. * * Reconv. PC : 重新汇聚地址 * * 30 * 32 * 32 = 30 * 1024 = 30K fragments 64KB reg

文档评论(0)

1亿VIP精品文档

相关文档