Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Ham, Hyungkyu; Hong, Jeongmin; Park, Geonwoo; Shin, Yunseon; Woo, Okkyun; Yang, Wonhyuk; Bae, Jinhoon; Park, Eunhyeok; Sung, Hyojin; Lim, Euicheol; Kim, Gwangsun

Full-text links:

Download:

Current browse context:

cs.AR

< prev | next >

new | recent | 2404

Change to browse by:

Computer Science > Hardware Architecture

Title: Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Authors: Hyungkyu Ham, Jeongmin Hong, Geonwoo Park, Yunseon Shin, Okkyun Woo, Wonhyuk Yang, Jinhoon Bae, Eunhyeok Park, Hyojin Sung, Euicheol Lim, Gwangsun Kim

(Submitted on 30 Apr 2024)

Abstract: To overcome the memory capacity wall of large-scale AI and big data applications, Compute Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of processors. While its CXL.mem protocol stack minimizes interconnect latency, CXL memory accesses can still result in significant slowdowns for memory-bound applications. While near-data processing (NDP) in CXL memory can overcome such limitations, prior works propose application-specific HW units that are not suitable for practical CXL memory-based systems that should support various applications. On the other hand, existing CPU or GPU cores are not cost-effective for NDP because they are not optimized for memory-bound applications. In addition, the communication between the host processor and CXL controller for NDP offloading should achieve low latency, but the CXL$.$io (or PCIe) protocol incurs $\mu$s-scale latency and is not suitable for fine-grain NDP.
To achieve high-performance NDP end-to-end, we propose a low-overhead general-purpose NDP architecture for CXL memory referred to as Memory-Mapped NDP (M$^2$NDP), which comprises memory-mapped functions (M$^2$func) and memory-mapped $\mu$threading (M$^2\mu$thr). The M$^2$func is a CXL.mem-compatible low-overhead communication mechanism between the host processor and NDP controller in the CXL memory. The M$^2\mu$thr enables low-cost, general-purpose NDP unit design by introducing lightweight $\mu$threads that support highly concurrent execution of NDP kernels with minimal resource wastage. By combining them, our M$^2$NDP achieves significant speedups for various applications, including in-memory OLAP, key-value store, large language model, recommendation model, and graph analytics by up to 128$\times$ (11.5$\times$ overall) and reduces energy by up to 87.9\% (80.1\% overall) compared to a baseline CPU or GPU host with passive CXL memory.

Subjects:	Hardware Architecture (cs.AR)
Cite as:	arXiv:2404.19381 [cs.AR]
	(or arXiv:2404.19381v1 [cs.AR] for this version)

Submission history

From: Gwangsun Kim [view email]
[v1] Tue, 30 Apr 2024 09:14:12 GMT (990kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.19381

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Hardware Architecture

Title: Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Submission history