Distributed-Memory

Parallel rank-adaptive HOOI for Tucker decomposition

Parallel Rank-Adaptive Higher Order Orthogonal Iteration

Higher Order Orthogonal Iteration (HOOI) is an iterative algorithm that computes a Tucker decomposition of an input tensor. We present distributed-memory parallel, rank-adaptive variants of HOOI that adaptively determine the core tensor ranks rather than requiring them as fixed inputs, using efficient parallel tensor-times-matrix (TTM) and SVD kernels to scale Tucker decomposition to large tensors.

DistShap distributed Shapley value explanation pipeline

DistShap: Scalable GNN Explanations with Distributed Shapley Values

We propose DistShap, a parallel algorithm that distributes Shapley value-based explanations of graph neural network predictions across multiple GPUs. DistShap samples subgraphs in a distributed setting, executes GNN inference in parallel across GPUs, and solves a distributed least squares problem to compute edge importance scores, scaling to GNN models with millions of features on up to 128 GPUs.

Communication-avoiding s-step dual coordinate descent

Scalable Dual Coordinate Descent for Kernel Methods

We develop scalable dual coordinate descent (DCD) and block dual coordinate descent (BDCD) methods for kernel support vector machines and kernel ridge regression. We derive s-step variants that reduce communication frequency by a tunable factor of s while computing the same solution in exact arithmetic, achieving strong scaling speedups of up to 9.8x over existing methods on up to 512 cores. This paper received the Outstanding Paper Award at HPC Asia 2025.

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

This work generalizes 1D s-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) that attains a continuous performance trade-off between the two baseline algorithms. We present theoretical analysis of the convergence, computation, communication, and memory trade-offs, and a C++/MPI implementation that achieves speedups of up to 5.3x over s-step SGD and up to 121x over FedAvg on a Cray EX system.

Strong scaling comparison between SGD and CA-SGD

Avoiding Communication in Logistic Regression

This work introduces Communication-Avoiding SGD (CA-SGD) for distributed-memory logistic regression. CA-SGD reorganizes stochastic gradient computations to communicate every $s$ iterations instead of every iteration and achieves speedups of up to 4.97x over SGD on a high-performance InfiniBand cluster without altering convergence behavior or accuracy.

Avoiding Communication in Primal and Dual Block Coordinate Descent Methods

This work develops communication-avoiding variants of primal and dual block coordinate descent for regularized least-squares problems. The variants communicate every $s$ iterations instead of every iteration and attain strong-scaling speedups up to 6.1x on a Cray XC30 supercomputer.

Reducing Communication in Proximal Newton Methods for Sparse Least Squares Problems

This work proposes RC-SFISTA with iteration-overlapping and Hessian reuse for sparse least-squares problems. The method reduces latency costs by a factor of $k$ and demonstrates speedups up to 12x compared to ProxCoCoA on MPI and Spark implementations evaluated on 1 to 512 nodes.

Strong scaling and speedups for SA-accCD

Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

This work extends communication-avoiding Krylov subspace techniques to first-order block coordinate descent methods for support vector machines and proximal least-squares problems. The synchronization-avoiding variants reduce latency by a tunable factor of $s$ and attain speedups up to 5.1x on a Cray XC30 supercomputer.

Avoiding Communication in Proximal Methods for Convex Optimization Problems

This technical report studies communication-avoiding proximal methods for large-scale convex optimization problems. The methods use iteration overlap and Hessian reuse to reduce latency costs while preserving the bandwidth profile of the baseline proximal algorithms.