Skip to content

Training Architecture

Three layers compose an Agora training run. Discovery runs a stateless DHT for peer-to-peer addressing. Compute is the pipeline itself: workers, each holding one stage's parameters, running forward and backward. Coordination is a small set of CPU-only trainers that route microbatches through the pipeline and shard the dataset across them. The full system runs on heterogeneous, untrusted hardware without any one contributor holding the complete model weights.

Discovery Layer
Seeds
DHT bootstrap nodes. Peer registration & discovery.
Seed 0
Primary DHT bootstrap. Publishes multiaddr for initial peer connections.
Seed 1
Redundant bootstrap. Ensures discovery continues if Seed 0 is unreachable.
Peer Discovery
Multiaddr exchange
libp2p / DHT
Compute Layer
Training Pipeline
Workers hold one pipeline stage's parameters, process fwd/bwd, run periodic SPARTA state averaging within their stage.
Stage 0
Head
Pipe 0
H100 80G
6 layers + embed
Pipe 1
H100 80G
6 layers + embed
SPARTA
State averaging
Stage 1
Body 1
Pipe 0
RTX 4090
4 layers
Pipe 1
RTX 4090
4 layers
SPARTA
State averaging
Bodies 2–4
Pipe 0
mixed
4 layers each
Pipe 1
mixed
4 layers each
SPARTA
State averaging
Stage 5
Body 5
Pipe 0
L40S 48G
4 layers
Pipe 1
L40S 48G
4 layers
SPARTA
State averaging
Stage 6
Tail
Pipe 0
A100 40G
6 layers + lm_head
Pipe 1
A100 40G
6 layers + lm_head
SPARTA
State averaging
Activations →
Forward pass
Head → Body → Tail
← Activation gradients
Backward pass
Tail → Body → Head (grad_input only)
Coordination Layer
Trainers
Microbatch coordination, load balancing, data management.
Trainer 0
Microbatch coordination
Load balancing
Data management
CPU only
Trainer 1
Microbatch coordination
Load balancing
Data management
CPU only
Three-zone Agora architecture: Discovery (Seeds), Compute (Workers), Coordination (Trainers). Heterogeneous example configuration; real swarms vary by participant hardware. Forward activations flow Head → Body → Tail; backward activation gradients (grad_input) flow Tail → Body → Head. Parameter gradients never cross stage boundaries.

Discovery Layer

A seed is a stateless DHT bootstrap. Workers query a seed for an entry point into the swarm and then communicate with other peers directly; the seed never stores model data of its own. Pluralis runs two seeds for redundancy: if Seed 0 is unreachable, Seed 1 serves the same role, and a new worker can bootstrap from either one.

Compute Layer

A worker holds the model parameters and performs the compute. It owns one pipeline stage (Head, Body, or Tail) and runs forward and backward passes for batches the trainer routes to it. Workers within the same stage participate together in periodic, async SPARTA averaging rounds. The Workers section below covers the runtime structure.

Coordination Layer

A trainer holds no model parameters. Its role is to orchestrate the pipeline: route microbatches to a healthy worker in each stage, balance load across the workers in each stage, and supply dataset shards. Trainers run on CPU and run on Pluralis-owned infrastructure rather than on contributor nodes; see the Trainers section for fault-tolerance and load-balancing details.


Component Deep Dive

Workers

A worker is a single process holding one stage's parameters (one or more transformer layers). It performs forward and backward on those parameters, runs its own local optimizer to apply gradients, and joins same-stage peers in periodic AllReduce rounds for state averaging. A worker has no knowledge of the rest of the pipeline: only its own stage.

Worker: Stage X
Connection Handlers
Listen for trainer gRPC requests; place batches in fwd / bwd queues. Multiplexed on the same port.
Runtime
Loops over fwd / bwd queues and dispatches batches into ModuleBackend for execution.
ModuleBackend
Stores the nn.Module for this stage. Owns the forward / backward task pools.
DHTHandler
Declares this Worker's availability in its stage (head.0.0, body1.0.1, …) for trainer + peer discovery.
SPARTA Optimizer
Accumulates gradients locally, runs the local optimizer step, then matches with same-stage peers and AllReduces 5% of parameters.
DHT
Hivemind Kademlia DHT. Peer discovery, expert registration, progress tracking, matchmaking.
Worker internals: six co-resident components inside a single Worker process. Runtime drives ModuleBackend for compute; the SPARTA Optimizer coordinates the parameter-averaging step with same-stage peers via the shared DHT.

DHT

Agora uses Hivemind's Kademlia DHT for four functions: peer discovery, expert registration, progress tracking, and matchmaking for AllReduce.

ModuleBackend

The nn.Module for this stage and the forward and backward functions the Runtime invokes. Also owns the two task pools (forward and backward) where incoming trainer requests queue up before the Runtime processes them.

Async SPARTA

Each worker accumulates gradients from its own backward passes and runs its own optimizer step locally; there is no per-step gradient AllReduce. Same-stage replicas drift apart as a result. To re-synchronize, every 20 local steps the worker matches with same-stage peers over the DHT and AllReduces 5% of its parameters. Successive rounds cover non-overlapping slices, so the full parameter set has cycled through over a 20-round window.

Connection Handlers

gRPC listeners that receive trainer requests and put each batch into the right queue: forward or backward. Multiple listeners share a single port.

DHTHandler

A background thread that keeps re-announcing this worker in the DHT under its stage-prefixed UID (head.0.0, body1.0.1, tail.0.0). Trainers read the announcements to find workers; same-stage peers use them for matchmaking during AllReduce.

Runtime

The main loop. Dequeues batches from the forward and backward queues and runs them through ModuleBackend. On the backward path it rebuilds the autograd graph by re-running the forward (see activation recomputation for details), calls torch.autograd.backward(), and triggers the optimizer step at the appropriate point in the batch-size accumulator.


Batch processing

Once running, the Worker runs an event loop processing batches from trainers:

  1. Trainer sends a forward request via gRPC → Connection Handler places it in the forward queue.
  2. Runtime dequeues the batch → calls ModuleBackend.forward() → returns the output.
  3. Trainer sends a backward request with gradient outputs → Connection Handler places it in the backward queue.
  4. Runtime dequeues the batch → calls ModuleBackend.backward() → triggers the optimizer step.

See also

  • How a new Worker joins a running swarm (state download, queue, sync mode) → Contributor Join Flow.
  • What happens at the optimizer step (ProgressTracker, Matchmaking, SPARTA AllReduce) → Communication Patterns.
  • Sync-mode entry / exit conditions and Worker-failure handlingFault Tolerance.

Trainers

The trainer's role looks like an ordinary PyTorch training loop: forward through the model, compute a loss, call backward. The difference is that none of those calls run locally. Every forward and backward is dispatched over the network to a worker holding the relevant stage. The trainer's responsibilities are to track the full pipeline topology, select a healthy worker for each stage, and route activations forward and activation gradients back. Parameter gradients themselves never leave a worker, and the trainer never holds parameters of its own.

Trainer CPU · holds no parameters libp2p gRPC · forward / backward STAGE 0 Head embed · early layers Pipe 0 / Pipe 1 STAGE 1 Body 1 transformer block Pipe 0 / Pipe 1 STAGE 2 Body 2 transformer block Pipe 0 / Pipe 1 STAGE 3 Tail final layers · loss Pipe 0 / Pipe 1 Forward: activations · Head → Body → Tail Backward: grad_input · Tail → Body → Head
The Trainer (CPU, no parameters) orchestrates remote workers across pipeline stages over libp2p gRPC. Forward activations flow Head → Body → Tail; backward activation gradients flow Tail → Body → Head. Two pipes per stage give data-parallel redundancy.

Training flow

Startup
  1. Loads configuration and tokenizer config.
  2. Prepares the dataset from Hugging Face.
  3. Creates DHT connections using seed peers. Each model stage has its own dedicated DHT, enabling partitioning across stages.
  4. Using the DHT, the Trainer discovers workers in each stage; all stages must show at least one available Worker before training can start.
Training loop

For each batch, the trainer iterates through the pipeline in order:

hidden = head.forward(input_ids[:, :-1])
hidden = body1.forward(hidden)
hidden = body2.forward(hidden)
loss   = tail.forward(hidden, input_ids[:, 1:])   # shifted labels for LM
loss.backward()                                   # triggers backward on all stages

Worker selection within a stage uses a min-heap keyed by accumulated virtual runtime: the least-loaded worker is selected. When a worker finishes a request, its runtime is credited with the task's estimated duration and the worker re-enters the heap; new arrivals enter at the end. A transient network error triggers a retry against the same worker or a different one depending on the error class, and an unreachable worker is short-banned from the heap for 30s. A pipeline step only stalls if a stage empties entirely: every worker in that stage unreachable or short-banned at once.

Each pipeline stage gets its own DHT connection on the trainer, so worker discovery for stage head is independent of discovery for stage body3. The full protocol for how workers and trainers announce and refresh each other is in Communication Patterns → Periodic node announcement.