Communication Avoidance

Mixed-precision CA-SGD outer iteration with per-kernel precision slots

Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

Distributed SGD is limited by communication rather than computation, since each iteration requires an AllReduce across processes. We study mixed-precision communication-avoiding SGD (CA-SGD) for generalized linear models on NVIDIA GPUs, decomposing the local rounding error of one CA-SGD outer iteration into nine independent precision choices that depend on the hardware only through its low-precision unit roundoffs. On NERSC Perlmutter A100 GPUs, mixed-precision CA-SGD matches FP32 SGD loss within 0.5% and reaches 5.1-6.8x speedup over FP32 SGD on the epsilon, SUSY, HIGGS, synth, and Poisson-synth datasets.

Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems.

Communication-avoiding s-step dual coordinate descent

Scalable Dual Coordinate Descent for Kernel Methods

We develop scalable dual coordinate descent (DCD) and block dual coordinate descent (BDCD) methods for kernel support vector machines and kernel ridge regression. We derive s-step variants that reduce communication frequency by a tunable factor of s while computing the same solution in exact arithmetic, achieving strong scaling speedups of up to 9.8x over existing methods on up to 512 cores. This paper received the Outstanding Paper Award at HPC Asia 2025.

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

This work generalizes 1D s-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) that attains a continuous performance trade-off between the two baseline algorithms. We present theoretical analysis of the convergence, computation, communication, and memory trade-offs, and a C++/MPI implementation that achieves speedups of up to 5.3x over s-step SGD and up to 121x over FedAvg on a Cray EX system.

Strong scaling comparison between SGD and CA-SGD

Avoiding Communication in Logistic Regression

This work introduces Communication-Avoiding SGD (CA-SGD) for distributed-memory logistic regression. CA-SGD reorganizes stochastic gradient computations to communicate every $s$ iterations instead of every iteration and achieves speedups of up to 4.97x over SGD on a high-performance InfiniBand cluster without altering convergence behavior or accuracy.

Avoiding Communication in Primal and Dual Block Coordinate Descent Methods

This work develops communication-avoiding variants of primal and dual block coordinate descent for regularized least-squares problems. The variants communicate every $s$ iterations instead of every iteration and attain strong-scaling speedups up to 6.1x on a Cray XC30 supercomputer.

Reducing Communication in Proximal Newton Methods for Sparse Least Squares Problems

This work proposes RC-SFISTA with iteration-overlapping and Hessian reuse for sparse least-squares problems. The method reduces latency costs by a factor of $k$ and demonstrates speedups up to 12x compared to ProxCoCoA on MPI and Spark implementations evaluated on 1 to 512 nodes.

Avoiding Communication in Proximal Methods for Convex Optimization Problems

This technical report studies communication-avoiding proximal methods for large-scale convex optimization problems. The methods use iteration overlap and Hessian reuse to reduce latency costs while preserving the bandwidth profile of the baseline proximal algorithms.