Browsing by Subject "Write operations"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- PublicationRestrictedTemporarily unauthorized stores: write first, ask for permission later(IEEE Computer Society, 2024-12-03) Cebrian, Juan M.; Jahre, Magnus; Ros Bardisa, Alberto; Ingeniería y Tecnología de Computadoresx86 processors implement a total store order (x86-TSO) consistency model, which requires stores to update memory in a sequenced manner. The latency of stores is then hidden by the store buffer (SB), which holds stores until the write is performed. On a long latency cache miss, however, stores block the SB, eventually stalling the processor and degrading performance. Contemporary industrial high-performance processors deal with this situation by overprovisioning the size of the SB, but this comes at the cost of energy and latency overheads. In this work, we remove the stalls caused by stores blocked at the head of the SB while reusing existing processor resources, either improving performance when SB size is kept constant or maintaining performance while reducing SB size. Our proposal, Temporarily Unauthorized Stores (TUS), achieves this by extending the functionality of 1) the write combining buffers, to allow them to coalesce stores while maintaining x86- TSO consistency, and 2) immediately write data to the first-level cache upon a miss (i.e., providing an always-hit illusion) but temporarily keeping the written data invisible to the cache coherence protocol, i.e., these stores are temporarily unauthorized. TUS makes temporarily unauthorized stores visible in x86- TSO order without speculation or rollbacks once write permission is obtained. In essence, TUS logically transforms the write combining buffers and the first-level cache into an “extension” of the SB. TUS improves performance by up to 26 % (3.2 % on average) while reducing the total energy-delay-product (EDP) by up to 35.9% (6.4% on average) for SB-bound benchmarks with a 114-entry SB compared to our baseline architecture with an SB of the same size. When configured with a 32-entry SB, TUS yields a performance improvement of 2 % over a 114-entry SB baseline while reducing SB energy per search by a factor of 2 x, SB area by 21 %, and store-to-Ioad forwarding latency from 5 to 3 cycles.