A complexity-effective local delta prefetcher

Navarro-Torres, Agustín; Panda, Biswabandan; Alastruey-Benedé, Jesús; Ibáñez, Pablo; Viñals-Yúfera, Víctor; Ros Bardisa, Alberto

Publication:
A complexity-effective local delta prefetcher

Files

anavarrotorres-tc25.pdf(2.62 MB)

Date

2025-01-31

Authors

Navarro-Torres, Agustín ; Panda, Biswabandan ; Alastruey-Benedé, Jesús ; Ibáñez, Pablo ; Viñals-Yúfera, Víctor ; Ros Bardisa, Alberto

Publisher

Institute of Electrical and Electronics Engineers

publication.page.department

Ingeniería y Tecnología de Computadores

DOI

https://doi.org/10.1109/TC.2025.3533086

item.page.type

info:eu-repo/semantics/article

Description

© 2025, IEEE all right reserved. This manuscript version is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by/4.0/. This document is the Accepted version of a Published Work that appeared in final form in IEEE Transactions on Computers. To access the final edited and published work see https://doi.org/10.1109/TC.2025.3533086

Abstract

Data prefetching is crucial for performance in modern processors by effectively masking long-latency memory accesses. Over the past decades, numerous data prefetching mechanisms have been proposed, which have continuously reduced the access latency to the memory hierarchy. Several state-of-the-art prefetchers, namely Instruction Pointer Classifier Prefetcher (IPCP) and Berti, target the first-level data cache, and thus, they are able to completely hide the miss latency for timely prefetched cache lines. Berti exploits timely local deltas to achieve high accuracy and performance. This paper extends Berti with a larger evaluation and with extra optimizations on top of the previous conference paper. The result is a complexity-effective version of Berti that outperforms it for a large amount of workloads and simplifies its control logic. The key for those advancements is a simple mechanism for learning timely deltas without the need to track the fetch latency of each cache miss. Our experiments conducted with a wide range of workloads (CVP traces by Qualcomm, SPEC CPU2017, and GAP) show performance improvements by 4.0% over a mainstream stride prefetcher, and by a non-negligible 1.4% over the previously published version of Berti requiring similar storage.

publication.page.subject

Data prefetching , Hardware prefetching , First-level cache , Stride , Local deltas , Accuracy , Timeliness

Citation

IEEE Transactions on Computers 2025

URI

http://hdl.handle.net/10201/151244

Collections

Artículos

Full item page

Ir a Estadísticas

Este ítem está sujeto a una licencia Creative Commons. http://creativecommons.org/licenses/by/4.0/

Publication:
A complexity-effective local delta prefetcher

Files

Date

relationships.isAuthorOfPublication

relationships.isSecondaryAuthorOf

relationships.isDirectorOf

Authors

item.page.secondaryauthor

item.page.director

Publisher

publication.page.editor

publication.page.department

DOI

item.page.type

Description

Abstract

publication.page.subject

Citation

URI

item.page.embargo

Collections

Publication: A complexity-effective local delta prefetcher

Files

Date

relationships.isAuthorOfPublication

relationships.isSecondaryAuthorOf

relationships.isDirectorOf

Authors

item.page.secondaryauthor

item.page.director

Publisher

publication.page.editor

publication.page.department

DOI

item.page.type

Description

Abstract

publication.page.subject

Citation

URI

item.page.embargo

Collections

Publication:
A complexity-effective local delta prefetcher