We gratefully acknowledge support from
the Simons Foundation and member institutions.

Distributed, Parallel, and Cluster Computing

New submissions

[ total of 16 entries: 1-16 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Wed, 1 May 24

[1]  arXiv:2404.19139 [pdf, other]
Title: HMTRace: Hardware-Assisted Memory-Tagging based Dynamic Data Race Detection
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Data race, a category of insidious software concurrency bugs, is often challenging and resource-intensive to detect and debug. Existing dynamic race detection tools incur significant execution time and memory overhead while exhibiting high false positives. This paper proposes HMTRace, a novel Armv8.5-A memory tag extension (MTE) based dynamic data race detection framework, emphasizing low compute and memory requirements while maintaining high accuracy and precision. HMTRace supports race detection in userspace OpenMP- and Pthread-based multi-threaded C applications. HMTRace showcases a combined f1-score of 0.86 while incurring a mean execution time overhead of 4.01% and peak memory (RSS) overhead of 54.31%. HMTRace also does not report false positives, asserting all reported races.

[2]  arXiv:2404.19143 [pdf, other]
Title: Workload Intelligence: Punching Holes Through the Cloud Abstraction
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Today, cloud workloads are essentially opaque to the cloud platform. Typically, the only information the platform receives is the virtual machine (VM) type and possibly a decoration to the type (e.g., the VM is evictable). Similarly, workloads receive little to no information from the platform; generally, workloads might receive telemetry from their VMs or exceptional signals (e.g., shortly before a VM is evicted). The narrow interface between workloads and platforms has several drawbacks: (1) a surge in VM types and decorations in public cloud platforms complicates customer selection; (2) essential workload characteristics (e.g., low availability requirements, high latency tolerance) are often unspecified, hindering platform customization for optimized resource usage and cost savings; and (3) workloads may be unaware of potential optimizations or lack sufficient time to react to platform events.
In this paper, we propose a framework, called Workload Intelligence (WI), for dynamic bi-directional communication between cloud workloads and cloud platform. Via WI, workloads can programmatically adjust their key characteristics, requirements, and even dynamically adapt behaviors like VM priorities. In the other direction, WI allows the platform to programmatically inform workloads about upcoming events, opportunities for optimization, among other scenarios. Because of WI, the cloud platform can drastically simplify its offerings, reduce its costs without fear of violating any workload requirements, and reduce prices to its customers on average by 48.8%.

[3]  arXiv:2404.19209 [pdf, other]
Title: AdaOper: Energy-efficient and Responsive Concurrent DNN Inference on Mobile Devices
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Deep neural network (DNN) has driven extensive applications in mobile technology. However, for long-running mobile apps like voice assistants or video applications on smartphones, energy efficiency is critical for battery-powered devices. The rise of heterogeneous processors in mobile devices today has introduced new challenges for optimizing energy efficiency. Our key insight is that partitioning computations across different processors for parallelism and speedup doesn't necessarily correlate with energy consumption optimization and may even increase it. To address this, we present AdaOper, an energy-efficient concurrent DNN inference system. It optimizes energy efficiency on mobile heterogeneous processors while maintaining responsiveness. AdaOper includes a runtime energy profiler that dynamically adjusts operator partitioning to optimize energy efficiency based on dynamic device conditions. We conduct preliminary experiments, which show that AdaOper reduces energy consumption by 16.88% compared to the existing concurrent method while ensuring real-time performance.

[4]  arXiv:2404.19429 [pdf, other]
Title: Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
Comments: 11 pages, 16 figures. Published in MLSys'24
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of extended all-to-all communication latency during the training process. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. Yet, these methods frequently fall short of achieving sufficient overlap, consequently restricting the potential for performance enhancements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.

[5]  arXiv:2404.19612 [pdf, other]
Title: Quantum Cloud Computing: Trends and Challenges
Comments: 9 pages, 6 figures
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)

Quantum computing (QC) is a new paradigm that will revolutionize various areas of computing, especially cloud computing. QC, still in its infancy, is a costly technology capable of operating in highly isolated environments due to its rapid response to environmental factors. For this reason, it is still a challenging technology for researchers to reach. Integrating QC into an isolated remote server, like a cloud, and making it available to users can overcome these problems. Furthermore, experts predict that QC, with its ability to swiftly resolve complex and computationally intensive operations, will offer significant benefits in systems that process large amounts of data, like cloud computing. This article presents the vision and challenges for the quantum cloud computing (QCC) paradigm that will emerge with the integration of quantum and cloud computing. Next, we present the advantages of QC over classical computing applications. We analyze the effects of QC on cloud systems, such as cost, security, and scalability. Besides all of these advantages, we highlight research gaps in QCC, such as qubit stability and efficient resource allocation. This article identifies QCC's advantages and challenges for future research, highlighting research gaps.

[6]  arXiv:2404.19634 [pdf, other]
Title: DF Louvain: Fast Incrementally Expanding Approach for Community Detection on Dynamic Graphs
Authors: Subhajit Sahu
Comments: 22 pages, 15 figures, 3 tables
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Social and Information Networks (cs.SI)

Community detection is the problem of recognizing natural divisions in networks. A relevant challenge in this problem is to find communities on rapidly evolving graphs. In this report we present our Parallel Dynamic Frontier (DF) Louvain algorithm, which given a batch update of edge deletions and insertions, incrementally identifies and processes an approximate set of affected vertices in the graph with minimal overhead, while using a novel approach of incrementally updating weighted-degrees of vertices and total edge weights of communities. We also present our parallel implementations of Naive-dynamic (ND) and Delta-screening (DS) Louvain. On a server with a 64-core AMD EPYC-7742 processor, our experiments show that DF Louvain obtains speedups of 179x, 7.2x, and 5.3x on real-world dynamic graphs, compared to Static, ND, and DS Louvain, respectively, and is 183x, 13.8x, and 8.7x faster, respectively, on large graphs with random batch updates. Moreover, DF Louvain improves its performance by 1.6x for every doubling of threads.

[7]  arXiv:2404.19638 [pdf, other]
Title: SpComm3D: A Framework for Enabling Sparse Communication in 3D Sparse Kernels
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Existing 3D algorithms for distributed-memory sparse kernels suffer from limited scalability due to reliance on bulk sparsity-agnostic communication. While easier to use, sparsity-agnostic communication leads to unnecessary bandwidth and memory consumption. We present SpComm3D, a framework for enabling sparsity-aware communication and minimal memory footprint such that no unnecessary data is communicated or stored in memory. SpComm3D performs sparse communication efficiently with minimal or no communication buffers to further reduce memory consumption. SpComm3D detaches the local computation at each processor from the communication, allowing flexibility in choosing the best accelerated version for computation. We build 3D algorithms with SpComm3D for the two important sparse ML kernels: Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse matrix-matrix multiplication (SpMM). Experimental evaluations on up to 1800 processors demonstrate that SpComm3D has superior scalability and outperforms state-of-the-art sparsity-agnostic methods with up to 20x improvement in terms of communication, memory, and runtime of SDDMM and SpMM. The code is available at: https://github.com/nfabubaker/SpComm3D

[8]  arXiv:2404.19717 [pdf, other]
Title: Automated, Reliable, and Efficient Continental-Scale Replication of 7.3 Petabytes of Climate Simulation Data: A Case Study
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

We report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) climate simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee. This movement of some 29 million files, twice, undertaken in order to establish new ESGF nodes at ANL and ORNL, was performed largely automatically by a simple replication tool, a script that invoked Globus to transfer large bundles of files while tracking progress in a database. Under the covers, Globus organized transfers to make efficient use of the high-speed Energy Sciences network (ESnet) and the data transfer nodes deployed at participating sites, and also addressed security, integrity checking, and recovery from a variety of transient failures. This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure.

Cross-lists for Wed, 1 May 24

[9]  arXiv:2404.19019 (cross-list from cs.DS) [pdf, other]
Title: Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage Clustering
Comments: To appear at SPAA 2024
Subjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC)

Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree $T$, the SLD of $T$ is a binary dendrogram that summarizes the $n-1$ clusterings obtained by contracting the edges of $T$ in order of weight. Existing algorithms for computing the SLD all require $\Omega(n\log n)$ work where $n = |T|$. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem.
In this paper, we design faster parallel algorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires $O(n \log h)$ work and $O(\log^2 n \log^2 h)$ depth, where $h$ is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves $O(n\log h)$ work and $O(h \log n)$ depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice.

[10]  arXiv:2404.19331 (cross-list from cs.PF) [pdf, other]
Title: Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs
Subjects: Performance (cs.PF); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)

Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision transformers (ViTs). However, they have a lower compute-to-memory-access ratio than standard convolutions, making their memory accesses often the performance bottleneck. This paper explores fusing depthwise and pointwise convolutions to overcome the memory access bottleneck. The focus is on fusing these operators on GPUs. The prior art on GPU-based fusion suffers from one or more of the following: (1) fusing either a convolution with an element-wise or multiple non-convolutional operators, (2) not explicitly optimizing for memory accesses, (3) not supporting depthwise convolutions. This paper proposes Fused Convolutional Modules (FCMs), a set of novel fused depthwise and pointwise GPU kernels. FCMs significantly reduce pointwise and depthwise convolutions memory accesses, improving execution time and energy efficiency. To evaluate the trade-offs associated with fusion and determine which convolutions are beneficial to fuse and the optimal FCM parameters, we propose FusePlanner. FusePlanner consists of cost models to estimate the memory accesses of depthwise, pointwise, and FCM kernels given GPU characteristics. Our experiments on three GPUs using representative CNNs and ViTs demonstrate that FCMs save up to 83% of the memory accesses and achieve speedups of up to 3.7x compared to cuDNN. Complete model implementations of various CNNs using our modules outperform TVMs' achieving speedups of up to 1.8x and saving up to two-thirds of the energy.

[11]  arXiv:2404.19725 (cross-list from cs.LG) [pdf, other]
Title: Fairness Without Demographics in Human-Centered Federated Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

Federated learning (FL) enables collaborative model training while preserving data privacy, making it suitable for decentralized human-centered AI applications. However, a significant research gap remains in ensuring fairness in these systems. Current fairness strategies in FL require knowledge of bias-creating/sensitive attributes, clashing with FL's privacy principles. Moreover, in human-centered datasets, sensitive attributes may remain latent. To tackle these challenges, we present a novel bias mitigation approach inspired by "Fairness without Demographics" in machine learning. The presented approach achieves fairness without needing knowledge of sensitive attributes by minimizing the top eigenvalue of the Hessian matrix during training, ensuring equitable loss landscapes across FL participants. Notably, we introduce a novel FL aggregation scheme that promotes participating models based on error rates and loss landscape curvature attributes, fostering fairness across the FL system. This work represents the first approach to attaining "Fairness without Demographics" in human-centered FL. Through comprehensive evaluation, our approach demonstrates effectiveness in balancing fairness and efficacy across various real-world applications, FL setups, and scenarios involving single and multiple bias-inducing factors, representing a significant advancement in human-centered FL.

Replacements for Wed, 1 May 24

[12]  arXiv:2310.14821 (replaced) [pdf, other]
Title: Mysticeti: Reaching the Limits of Latency with Uncertified DAGs
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR)
[13]  arXiv:2311.11645 (replaced) [pdf, other]
Title: Exploding AI Power Use: an Opportunity to Rethink Grid Planning and Management
Comments: Accepted by ACM e-Energy '24: the 15th ACM International Conference on Future and Sustainable Energy Systems
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
[14]  arXiv:2404.09285 (replaced) [pdf, other]
Title: Egret: Reinforcement Mechanism for Sequential Computation Offloading in Edge Computing
Comments: Submitted to IEEE TSC
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
[15]  arXiv:2111.03683 (replaced) [pdf, ps, other]
Title: On Homomorphism Graphs
Subjects: Logic (math.LO); Distributed, Parallel, and Cluster Computing (cs.DC); Combinatorics (math.CO)
[16]  arXiv:2309.11924 (replaced) [pdf, other]
Title: Generic Selfish Mining MDP for DAG Protocols
Authors: Patrik Keller
Comments: v2: update authors, fix my implementation of Roi's model
Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
[ total of 16 entries: 1-16 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2404, contact, help  (Access key information)