Browsing by Subject "Multi core architectures"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- PublicationRestrictedHardware cache locking for all memory updates(IEEE Computer Society, 2024) Asgharzadeh, Ashkan; Gómez Hernández, Eduardo José; Cebrián, Juan M.; Kaxiras, Stefanos; Ros Bardisa, Alberto; Ingeniería y Tecnología de ComputadoresMany applications need to perform operations thatinvolve reading a value from memory, modifying it, and thenwriting it back. Multiple architectures provide hardware supportfor these operations via read-modify-write (RMW) instructions.The primary benefit is that the read can request a cacheline withwrite permissions, reducing coherence protocol overhead sincethe write will find the cacheline with appropriate permissions.RMWs can be either atomic or non-atomic. Atomic RMWs, usedfor synchronization, commonly require (i) locking the cacheline toguarantee atomicity by preventing invalidations and (ii) enforcingserialization of instructions in the program (e.g., via memoryfences), which may cause performance degradation based onthe implemented memory consistency model. Non-atomic RMWs,while not requiring such strict measures, should only be used indata-race free code sections. However, other cores may invalidatea cacheline during a non-atomic RMW (e.g., due to false sharing),flushing the pipeline and causing the loss of write permissionsobtained by the read, which is detrimental to performance.In this work, we propose a microarchitectural mechanismthat enables non-atomic RMWs to fetch the cacheline lockingit, thus preventing other cores from “stealing” the cachelinewhile allowing them to run concurrently with other instructionsin the same core. Our proposal enables concurrent hardwarecache locking for multiple non-atomic RMWs while guaranteeingdeadlock freedom and no programmer/compiler intervention.We also propose alock-chainingmechanism to allow multipleconsecutive memory updates to the same cacheline up to apredefined maximum (to prevent starvation and load imbalance).Our evaluation using gem5 full-system simulator shows that foran eight-core configuration, our proposal improves performanceby up to 5.36% (2.05% on average), requiring just 45 bytes ofstorage per core.
- PublicationRestrictedTemporarily unauthorized stores: write first, ask for permission later(IEEE Computer Society, 2024-12-03) Cebrian, Juan M.; Jahre, Magnus; Ros Bardisa, Alberto; Ingeniería y Tecnología de Computadoresx86 processors implement a total store order (x86-TSO) consistency model, which requires stores to update memory in a sequenced manner. The latency of stores is then hidden by the store buffer (SB), which holds stores until the write is performed. On a long latency cache miss, however, stores block the SB, eventually stalling the processor and degrading performance. Contemporary industrial high-performance processors deal with this situation by overprovisioning the size of the SB, but this comes at the cost of energy and latency overheads. In this work, we remove the stalls caused by stores blocked at the head of the SB while reusing existing processor resources, either improving performance when SB size is kept constant or maintaining performance while reducing SB size. Our proposal, Temporarily Unauthorized Stores (TUS), achieves this by extending the functionality of 1) the write combining buffers, to allow them to coalesce stores while maintaining x86- TSO consistency, and 2) immediately write data to the first-level cache upon a miss (i.e., providing an always-hit illusion) but temporarily keeping the written data invisible to the cache coherence protocol, i.e., these stores are temporarily unauthorized. TUS makes temporarily unauthorized stores visible in x86- TSO order without speculation or rollbacks once write permission is obtained. In essence, TUS logically transforms the write combining buffers and the first-level cache into an “extension” of the SB. TUS improves performance by up to 26 % (3.2 % on average) while reducing the total energy-delay-product (EDP) by up to 35.9% (6.4% on average) for SB-bound benchmarks with a 114-entry SB compared to our baseline architecture with an SB of the same size. When configured with a 32-entry SB, TUS yields a performance improvement of 2 % over a 114-entry SB baseline while reducing SB energy per search by a factor of 2 x, SB area by 21 %, and store-to-Ioad forwarding latency from 5 to 3 cycles.