TPUTCACHE HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT….pptVIP

  • 3
  • 0
  • 约3.44千字
  • 约 24页
  • 2017-12-24 发布于湖北
  • 举报

TPUTCACHE HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT….ppt

TPUTCACHE HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT…

* TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux * Our Problem We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors Memory: Large register files/scratchpad in overlay Low latency, local data Trivial (large DMA): burst to/from DDR Non-trivial? Scatter/Gather Data dependent store/load vscatter adr_ptr, idx_vect, data_vect for i in 1..N adr_ptr[idx_vect[i]] = data_vect[i] Random narrow (32-bit) accesses Waste bandwidth on DDR interfaces * * If Data Fits on the FPGA… BRAMs with interconnect network General network… Not customized per application Shared: all masters - all slaves Memory mapped BRAM Double-pump (2x clk) if possible Banking/LVT/etc. for further ports * Example BRAM system * But if data doesn’t fit… (oversimplified) * So Let’s Use a Cache But a throughput focused cache Low latency data held in local memories Amortize latency over multiple accesses Focus on bandwidth Replace on-chip memory or augment memory controller? Data fits on-chip Want BRAM like speed, bandwidth Low overhead compared to shared BRAM Data doesn’t fit on-chip Use ‘leftover’ BRAMs for performance * * TputCache Design Goals Fmax near BRAM Fmax Fully pipelined Support multiple outstanding misses Write coalescing Associativity * TputCache Architecture Replay based architecture Reinsert misses back into pipeline Separate line fill/evict logic in background Token FIFO for completing requests in order No MSHRs for tracking misses Fewer muxes (only single replay request mux) 6 stage pipeline - 6 outstanding misses Good performance with high hit rate Common case fast * TputCache Architecture * Cache Hit * Cache Miss * Evict/Fill Logic * Area Fmax Results Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV 423MHz compared to 490MHz BRAM fmax on Stratix IV Minor degredation with increasing size, associativity 13% to 35% extr

您可能关注的文档

文档评论(0)

1亿VIP精品文档

相关文档