Research Foundations¶
Two compression schemes make residential-bandwidth training viable. Subspace Networks reduce the activations that travel between pipeline stages. Async SPARTA reduces the parameter sync that keeps same-stage replicas in agreement. They operate on different axes of the communication graph: SSN on the pipeline-parallel axis, SPARTA on the data-parallel axis.
Subspace Networks¶
Paper
Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism
Pipeline parallelism's cost is the bandwidth between adjacent stages. Forward activations flow downstream; activation gradients flow back along the same edges. Both are sized by the hidden state: batch × sequence × hidden. Uncompressed, this makes WAN-bandwidth training infeasible: a single sample's activation saturates a residential uplink for seconds, and the pipeline becomes communication-bound long before compute is.
Subspace Networks (SSN) integrate compression into the architecture itself. At every stage boundary, the hidden state is constrained to a low-rank subspace before crossing to the next stage, and the receiving stage reconstructs the full state from the subspace coefficients. The transformer block is modified so the forward-backward signal stays consistent through the projection: what the next stage receives, and what gradient flows back, is exactly what the model's bulk linear path expects. The compression is therefore lossless with respect to the backpropagated gradient signal, in contrast to lossy compression schemes that accumulate compression error across pipeline stages.
Two empirical observations make this work. First, the projection matrices of large pretrained transformers exhibit rank collapse: their effective rank is far below their nominal dimension, so constraining them to a shared learned low-rank subspace from the start costs little. Second, the residual stream is preserved at full rank where the architecture needs it; only the bulk linear path is run in the compressed subspace. The recursive structure of transformer blocks is reused so the same low-rank parameters can be shared across layers.
A concrete example: a 8.5B-parameter transformer with a 5K embedding dimension, 32 layers, FP32 activations, and a 4K sequence length sends a 671 Mbit activation per stage boundary at batch size 1. Uncompressed that is several seconds on a 200 Mbps link, and the pipeline becomes communication-bound. SSN compresses the same activation to approximately 7 Mbits (up to 100× less), which keeps the boundary cost within typical step time.
Asynchronous Sparse Parameter Averaging¶
Workers in the same stage are data-parallel replicas: they each train on different microbatches and should converge to the same parameters. The standard mechanisms for maintaining this agreement are gradient AllReduce and parameter AllReduce. Agora replaces gradient AllReduce entirely and uses a sparse variant of parameter AllReduce.
The first standard mechanism Agora supersedes is gradient AllReduce: every worker computes a local gradient, the cluster averages those gradients, and each worker applies the same update. This is the standard data-parallel training step (e.g. DDP). On a WAN link the all-reduce dominates step time, and one slow peer delays the rest of the stage. Async SPARTA forgoes gradient AllReduce: each worker computes its own gradient and runs its own local optimizer step. Replicas diverge as a result and require a separate mechanism to re-synchronize.
The second standard mechanism Agora supersedes is parameter AllReduce: periodically average the parameters themselves to remove the divergence. A full-parameter AllReduce on every cadence would still dominate the network link, so async SPARTA averages only a sparse subset of parameters per round. An averaging round runs once every 20 local steps and averages 5% of the parameter set across the matched group, with successive rounds covering non-overlapping slices until the full parameter set has been averaged. Communication scales with parameter count (the round transfers 5% × parameter_count × 4 bytes), but the constant fits within a residential uplink, and rounds run alongside ongoing local training so most of the averaging cost overlaps with compute.
Same-stage workers stay in close agreement without the training loop ever stopping for a synchronous all-reduce.