Tiny but Mighty: Designing and Realizing Scalable Latency Tolerance for Manycore SoCs

Orenes-Vera, M.; Manocha, A.; Balkind, J.; Gao, F.; Aragón, J.L.; Wentzlaff, D.; Martonosi, M.

Publication:
Tiny but Mighty: Designing and Realizing Scalable Latency Tolerance for Manycore SoCs

Files

MAPLE-ISCA2022-final.pdf(1.3 MB)

Date

2022-06-18

Authors

Orenes-Vera, M. ; Manocha, A. ; Balkind, J. ; Gao, F. ; Aragón, J.L. ; Wentzlaff, D. ; Martonosi, M.

Publisher

ACM and IEEE

publication.page.department

Ingeniería y Tecnología de Computadores

DOI

https://doi.org/10.1145/3470496.3527400

item.page.type

info:eu-repo/semantics/lecture
info:eu-repo/semantics/lecture

Abstract

Modern computing systems employ significant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network and graph analytic applications where indirect memory accesses (IMAs) challenge the memory hierarchy. Decades of prior art have proposed hardware and software mechanisms to mitigate IMA latency, but they fail to analyze real-chip considerations, especially when used in SoCs and manycores. In this paper, we revisit many of these techniques while taking into account manycore integration and verification. We present the first system implementation of latency tolerance hardware that provides significant speedups without requiring any memory hierarchy or processor tile modifications. This is achieved through a Memory Access Parallel-Load Engine (MAPLE), integrated through the Network-on-Chip (NoC) in a scalable manner. Our hardware-software co-design allows programs to perform long-latency memory accesses asynchronously from the core, avoiding pipeline stalls, and enabling greater memory parallelism (MLP). In April 2021 we taped out a manycore chip that includes tens of MAPLE instances for efficient data supply. MAPLE demonstrates a full RTL implementation of out-of-core latency-mitigation hardware, with virtual memory support and automated compilation targetting it. This paper evaluates MAPLE integrated with a dual-core FPGA prototype running applications with full SMP Linux, and demonstrates geomean speedups of 2.35× and 2.27× over software-based prefetching and decoupling, respectively. Compared to state-of-the-art hardware, it provides geomean speedups of 1.82× and 1.72× over prefetching and decoupling techniques.

publication.page.subject

Computer systems organization , Multicore architectures , Reconfigurable computing , Heterogeneous systems , Memory , Latency tolerance , Decoupling , Modular RTL

Citation

Proc. of the 49th IEEE/ACM International Symposium on Computer Architecture (ISCA), New York, NY, USA, pp. 817-830, ISBN: 978-1-4503-8610-4, Junio 2022

URI

http://hdl.handle.net/10201/138304

Collections

Artículos

Full item page

Ir a Estadísticas

Este ítem está sujeto a una licencia Creative Commons. http://creativecommons.org/licenses/by/4.0/

Publication:
Tiny but Mighty: Designing and Realizing Scalable Latency Tolerance for Manycore SoCs

Files

Date

relationships.isAuthorOfPublication

relationships.isSecondaryAuthorOf

relationships.isDirectorOf

Authors

item.page.secondaryauthor

item.page.director

Publisher

publication.page.editor

publication.page.department

DOI

item.page.type

Description

Abstract

publication.page.subject

Citation

URI

item.page.embargo

Collections

Publication: Tiny but Mighty: Designing and Realizing Scalable Latency Tolerance for Manycore SoCs

Files

Date

relationships.isAuthorOfPublication

relationships.isSecondaryAuthorOf

relationships.isDirectorOf

Authors

item.page.secondaryauthor

item.page.director

Publisher

publication.page.editor

publication.page.department

DOI

item.page.type

Description

Abstract

publication.page.subject

Citation

URI

item.page.embargo

Collections

Publication:
Tiny but Mighty: Designing and Realizing Scalable Latency Tolerance for Manycore SoCs