DigitalUM :: Browsing by Subject "Memory management"

Browsing by Subject "Memory management"

Now showing 1 - 4 of 4

Open Access
Architectural Support for Optimizing Huge Page Selection Within the OS
(2023-10-30) Manocha, A.; Yan, Z.; Tureci, E.; Aragón, J.L.; Nellans, D.; Martonosi, M.; Ingeniería y Tecnología de Computadores
Irregular, memory-intensive applications often incur high translation lookaside buffer (TLB) miss rates that result in significant address translation overheads. Employing huge pages is an effective way to reduce these overheads, however in real systems the number of available huge pages can be limited when system memory is nearly full and/or fragmented. Thus, huge pages must be used selectively to back application memory. This work demonstrates that choosing memory regions that incur the most TLB misses for huge page promotion best reduces address translation overheads. We call these regions High reUse TLB-sensitive data (HUBs). Unlike prior work which relies on expensive per-page software counters to identify promotion regions, we propose new architectural support to identify these regions dynamically at application runtime. We propose a promotion candidate cache (PCC) that identifies HUB candidates based on hardware page table walks after a lastlevel TLB miss. This small, fixed-size structure tracks huge pagealigned regions (consisting of 𝑁 base pages), ranks them based on observed page table walk frequency, and only keeps the most frequently accessed ones. Evaluated on applications of various memory intensity, our approach successfully identifies application pages incurring the highest address translation overheads. Our approach demonstrates that with the help of a PCC, the OS only needs to promote 4% of the application footprint to achieve more than 75% of the peak achievable performance, yielding 1.19-1.33× speedups over 4KB base pages alone. In real systems where memory is typically fragmented, the PCC outperforms Linux’s page promotion policy by 14% (when 50% of total memory is fragmented) and 16% (when 90% of total memory is fragmented) respectively.
Restricted
Chaining transactions for effective concurrency management in hardware transactional memory
(IEEE Computer Society, 2024-12-03) Nicolas Conesa, Víctor; Titos Gil, Rubén; Fernández Pascual, Ricardo; Acacio, Manuel E.; Ros Bardisa, Alberto; Ingeniería y Tecnología de Computadores
Hardware Transactional Memory (HTM) offers the opportunity to ease parallel programming. However, driven by hardware limitations, commercial implementations eschew the complexity involved in early sophisticated proposals from academia, and, among other things, opt for simple conflict resolution policies that inevitably increase transaction aborts. To increase thread level parallelism, previous works propose conflict resolution schemes that, instead of aborting, add a second level of speculation consisting in using not-yet-committed data from another transaction. This policy, which we refer to as requester-speculates, has not yet been considered in the context of the kind of best-effort HTM support provided by commercial processors. This work proposes CHAining TransactionS (CHATS), a simple yet effective realization of the requester-speculates con-flict resolution policy in which cyclic dependencies between transactions are avoided and the commit ordering respects the dependencies that transactions make once speculative values are communicated. The ultimate result is a best-effort HTM implementation that forces a partial order between transactions in a way that ensures effective utilization of forwarded data and that gets away from the complexity of previous proposals. Simulations using gem5 demonstrate the effectiveness of CHATS in both commercial-like setups and academic state-of-the-art best-effort systems (22% and 16% reduction in execution time, on average, respectively). These improvements are achieved by requiring less than 280 bytes of extra storage.
Open Access
Static Instruction Scheduling for High Performance on Limited Hardware
(IEEE, 2018-04-01) Tran, Kim-Anh; Carlson, Trevor E.; Koukos, Konstantinos; Själander, Magnus; Spiliopoulos, Vasileios; Kaxiras, Stefanos; Jimborean, Alexandra; Ingeniería y Tecnología de Computadores
Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14 percent for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications' high-performance.
Open Access
TCOR: A Tile Cache with Optimal Replacement
(IEEE, 2022-04-06) Joseph, D.; Aragón, J.L.; Parcerisa, J.M.; González, A,; Ingeniería y Tecnología de Computadores
Cache Replacement Policies are known to have an important impact on hit rates. The OPT replacement policy [27] has been formally proven as optimal for minimizing misses. Due to its need to look far ahead for future memory accesses, it is often reduced to a yardstick for measuring the efficacy of other practical caches. In this paper, we bring the OPT to life, in architectures for mobile GPUs, for which energy efficiency is of great consequence. We also mold other factors in the memory hierarchy to enhance its impact. The end results are a 13.8% decrease in the memory hierarchy energy consumption and an increased throughput in the Tiling Engine. We also observe a 5.5% decrease in the total GPU energy and a 3.7% increase in frames per second (FPS).

Browsing by Subject "Memory management"

Results Per Page

Sort Options