Training Architecture¶
Three layers compose an Agora training run. Discovery runs a stateless DHT for peer-to-peer addressing. Compute is the pipeline itself: workers, each holding one stage's parameters, running forward and backward. Coordination is a small set of CPU-only trainers that route microbatches through the pipeline and shard the dataset across them. The full system runs on heterogeneous, untrusted hardware without any one contributor holding the complete model weights.
grad_input only)Load balancing
Data management
Load balancing
Data management
grad_input) flow Tail → Body → Head. Parameter gradients never cross stage boundaries.Discovery Layer¶
A seed is a stateless DHT bootstrap. Workers query a seed for an entry point into the swarm and then communicate with other peers directly; the seed never stores model data of its own. Pluralis runs two seeds for redundancy: if Seed 0 is unreachable, Seed 1 serves the same role, and a new worker can bootstrap from either one.
Compute Layer¶
A worker holds the model parameters and performs the compute. It owns one pipeline stage (Head, Body, or Tail) and runs forward and backward passes for batches the trainer routes to it. Workers within the same stage participate together in periodic, async SPARTA averaging rounds. The Workers section below covers the runtime structure.
Coordination Layer¶
A trainer holds no model parameters. Its role is to orchestrate the pipeline: route microbatches to a healthy worker in each stage, balance load across the workers in each stage, and supply dataset shards. Trainers run on CPU and run on Pluralis-owned infrastructure rather than on contributor nodes; see the Trainers section for fault-tolerance and load-balancing details.
Component Deep Dive¶
Workers¶
A worker is a single process holding one stage's parameters (one or more transformer layers). It performs forward and backward on those parameters, runs its own local optimizer to apply gradients, and joins same-stage peers in periodic AllReduce rounds for state averaging. A worker has no knowledge of the rest of the pipeline: only its own stage.
DHT¶
Agora uses Hivemind's Kademlia DHT for four functions: peer discovery, expert registration, progress tracking, and matchmaking for AllReduce.
ModuleBackend¶
The nn.Module for this stage and the forward and backward functions the Runtime invokes. Also owns the two task pools (forward and backward) where incoming trainer requests queue up before the Runtime processes them.
Async SPARTA¶
Each worker accumulates gradients from its own backward passes and runs its own optimizer step locally; there is no per-step gradient AllReduce. Same-stage replicas drift apart as a result. To re-synchronize, every 20 local steps the worker matches with same-stage peers over the DHT and AllReduces 5% of its parameters. Successive rounds cover non-overlapping slices, so the full parameter set has cycled through over a 20-round window.
Connection Handlers¶
gRPC listeners that receive trainer requests and put each batch into the right queue: forward or backward. Multiple listeners share a single port.
DHTHandler¶
A background thread that keeps re-announcing this worker in the DHT under its stage-prefixed UID (head.0.0, body1.0.1, tail.0.0). Trainers read the announcements to find workers; same-stage peers use them for matchmaking during AllReduce.
Runtime¶
The main loop. Dequeues batches from the forward and backward queues and runs them through ModuleBackend. On the backward path it rebuilds the autograd graph by re-running the forward (see activation recomputation for details), calls torch.autograd.backward(), and triggers the optimizer step at the appropriate point in the batch-size accumulator.
Batch processing¶
Once running, the Worker runs an event loop processing batches from trainers:
- Trainer sends a forward request via gRPC → Connection Handler places it in the forward queue.
- Runtime dequeues the batch → calls
ModuleBackend.forward()→ returns the output. - Trainer sends a backward request with gradient outputs → Connection Handler places it in the backward queue.
- Runtime dequeues the batch → calls
ModuleBackend.backward()→ triggers the optimizer step.
See also¶
- How a new Worker joins a running swarm (state download, queue, sync mode) → Contributor Join Flow.
- What happens at the optimizer step (ProgressTracker, Matchmaking, SPARTA AllReduce) → Communication Patterns.
- Sync-mode entry / exit conditions and Worker-failure handling → Fault Tolerance.
Trainers¶
The trainer's role looks like an ordinary PyTorch training loop: forward through the model, compute a loss, call backward. The difference is that none of those calls run locally. Every forward and backward is dispatched over the network to a worker holding the relevant stage. The trainer's responsibilities are to track the full pipeline topology, select a healthy worker for each stage, and route activations forward and activation gradients back. Parameter gradients themselves never leave a worker, and the trainer never holds parameters of its own.
Training flow¶
Startup¶
- Loads configuration and tokenizer config.
- Prepares the dataset from Hugging Face.
- Creates DHT connections using seed peers. Each model stage has its own dedicated DHT, enabling partitioning across stages.
- Using the DHT, the Trainer discovers workers in each stage; all stages must show at least one available Worker before training can start.
Training loop¶
For each batch, the trainer iterates through the pipeline in order:
hidden = head.forward(input_ids[:, :-1])
hidden = body1.forward(hidden)
hidden = body2.forward(hidden)
loss = tail.forward(hidden, input_ids[:, 1:]) # shifted labels for LM
loss.backward() # triggers backward on all stages
Worker selection within a stage uses a min-heap keyed by accumulated virtual runtime: the least-loaded worker is selected. When a worker finishes a request, its runtime is credited with the task's estimated duration and the worker re-enters the heap; new arrivals enter at the end. A transient network error triggers a retry against the same worker or a different one depending on the error class, and an unreachable worker is short-banned from the heap for 30s. A pipeline step only stalls if a stage empties entirely: every worker in that stage unreachable or short-banned at once.
Each pipeline stage gets its own DHT connection on the trainer, so worker discovery for stage head is independent of discovery for stage body3. The full protocol for how workers and trainers announce and refresh each other is in Communication Patterns → Periodic node announcement.