Mixed-precision CA-SGD outer iteration with per-kernel precision slots

Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

Distributed SGD is limited by communication rather than computation, since each iteration requires an AllReduce across processes. We study mixed-precision communication-avoiding SGD (CA-SGD) for generalized linear models on NVIDIA GPUs, decomposing the local rounding error of one CA-SGD outer iteration into nine independent precision choices that depend on the hardware only through its low-precision unit roundoffs. On NERSC Perlmutter A100 GPUs, mixed-precision CA-SGD matches FP32 SGD loss within 0.5% and reaches 5.1-6.8x speedup over FP32 SGD on the epsilon, SUSY, HIGGS, synth, and Poisson-synth datasets.

June 2026 · Aditya Devarakonda, Irene Simó Muñoz, Giulia Guidi
Parallel workflow

Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems.

January 2026 · Julian Bellavita, Matthew Rubino, Nakul Iyer, Andrew Chang, Aditya Devarakonda, Flavio Vella, Giulia Guidi
DistShap distributed Shapley value explanation pipeline

DistShap: Scalable GNN Explanations with Distributed Shapley Values

We propose DistShap, a parallel algorithm that distributes Shapley value-based explanations of graph neural network predictions across multiple GPUs. DistShap samples subgraphs in a distributed setting, executes GNN inference in parallel across GPUs, and solves a distributed least squares problem to compute edge importance scores, scaling to GNN models with millions of features on up to 128 GPUs.

June 2025 · Selahattin Akkas, Aditya Devarakonda, Ariful Azad