Scaling Up Memory Disaggregated Applications With Smart

Abstract

Recent developments in RDMA networks are leading to the trend of memory disaggregation. However, the performance of each compute node is still limited by the network, especially when it needs to perform a large number of concurrent fine-grained remote accesses. According to our evaluations, existing IOPS-bound disaggregated applications do not scale well beyond 32 cores, and therefore do not take full advantage of today’s many-core machines.

After an in-depth analysis of the internal architecture of RNIC, we found three major scale-up bottlenecks that limit the throughput of today’s disaggregated applications: (1) implicit contention of doorbell registers, (2) cache trashing caused by excessive outstanding work requests, and (3) wasted IOPS from unsuccessful CAS retries. However, the solutions to these problems involve many low-level details that are not familiar to application developers. To ease the burden on developers, we propose Smart, an RDMA programming framework that hides the above details by providing an interface similar to one-sided RDMA verbs.

We take 44 and 16 lines of code to refactor the state-of-theart disaggregated hash table (RACE) and persistent transaction processing system (FORD) with Smart, improving their throughput by up to 132.4× and 5.2×, respectively. We have also refactored Sherman (a recent disaggregated B+Tree) with Smart and an additional speculative lookup optimization (48 lines of code changed), which changes its memory access pattern from bandwidth-bound to IOPS-bound and leads to a speedup of 2.0×. Smart is publicly available at https://github.com/madsys-dev/smart.

Publication
The ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2024