Stochastic Gradient Descent

Mixed-precision CA-SGD outer iteration with per-kernel precision slots

Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

Distributed SGD is limited by communication rather than computation, since each iteration requires an AllReduce across processes. We study mixed-precision communication-avoiding SGD (CA-SGD) for generalized linear models on NVIDIA GPUs, decomposing the local rounding error of one CA-SGD outer iteration into nine independent precision choices that depend on the hardware only through its low-precision unit roundoffs. On NERSC Perlmutter A100 GPUs, mixed-precision CA-SGD matches FP32 SGD loss within 0.5% and reaches 5.1-6.8x speedup over FP32 SGD on the epsilon, SUSY, HIGGS, synth, and Poisson-synth datasets.

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

This work generalizes 1D s-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) that attains a continuous performance trade-off between the two baseline algorithms. We present theoretical analysis of the convergence, computation, communication, and memory trade-offs, and a C++/MPI implementation that achieves speedups of up to 5.3x over s-step SGD and up to 121x over FedAvg on a Cray EX system.

Strong scaling comparison between SGD and CA-SGD

Avoiding Communication in Logistic Regression

This work introduces Communication-Avoiding SGD (CA-SGD) for distributed-memory logistic regression. CA-SGD reorganizes stochastic gradient computations to communicate every $s$ iterations instead of every iteration and achieves speedups of up to 4.97x over SGD on a high-performance InfiniBand cluster without altering convergence behavior or accuracy.

CIFAR-100 speedup and test-error comparison for adaptive and fixed batch sizes

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

AdaBatch adaptively increases the batch size during training to preserve the convergence behavior of small batches while improving computational efficiency. The method is evaluated with AlexNet, ResNet, and VGG on CIFAR-10, CIFAR-100, and ImageNet and improves performance by up to 6.25x on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1% relative to fixed batch sizes.