Training Architecture¶

Three layers compose an Agora training run. Discovery runs a stateless DHT for peer-to-peer addressing. Compute is the pipeline itself: workers, each holding one stage's parameters, running forward and backward. Coordination is a small set of CPU-only trainers that route microbatches through the pipeline and shard the dataset across them. The full system runs on heterogeneous, untrusted hardware without any one contributor holding the complete model weights.

Discovery Layer

Seeds

DHT bootstrap nodes. Peer registration & discovery.

Seed 0

Primary DHT bootstrap. Publishes multiaddr for initial peer connections.

Seed 1

Redundant bootstrap. Ensures discovery continues if Seed 0 is unreachable.

Peer Discovery

Multiaddr exchange

libp2p / DHT

Compute Layer

Training Pipeline

Workers hold one pipeline stage's parameters, process fwd/bwd, run periodic SPARTA state averaging within their stage.

Stage 0

Head

Pipe 0

H100 80G

6 layers + embed

Pipe 1

H100 80G

6 layers + embed

SPARTA

State averaging

Stage 1

Body 1

Pipe 0

RTX 4090

4 layers

Pipe 1

RTX 4090

4 layers

SPARTA

State averaging

…

Bodies 2–4

Pipe 0

mixed

4 layers each

Pipe 1

mixed

4 layers each

SPARTA

State averaging

Stage 5

Body 5

Pipe 0

L40S 48G

4 layers

Pipe 1

L40S 48G

4 layers

SPARTA

State averaging

Stage 6

Tail

Pipe 0

A100 40G

6 layers + lm_head

Pipe 1

A100 40G

6 layers + lm_head

SPARTA

State averaging

Activations →

Forward pass

Head → Body → Tail

← Activation gradients

Backward pass

Tail → Body → Head (grad_input only)

Coordination Layer

Trainers

Microbatch coordination, load balancing, data management.

Trainer 0

Microbatch coordination
Load balancing
Data management

CPU only

Trainer 1

Microbatch coordination
Load balancing
Data management

CPU only

Three-zone Agora architecture: Discovery (Seeds), Compute (Workers), Coordination (Trainers). Heterogeneous example configuration; real swarms vary by participant hardware. Forward activations flow Head → Body → Tail; backward activation gradients (grad_input) flow Tail → Body → Head. Parameter gradients never cross stage boundaries.

Discovery Layer¶

A seed is a stateless DHT bootstrap. Workers query a seed for an entry point into the swarm and then communicate with other peers directly; the seed never stores model data of its own. Pluralis runs two seeds for redundancy: if Seed 0 is unreachable, Seed 1 serves the same role, and a new worker can bootstrap from either one.

Compute Layer¶

A worker holds the model parameters and performs the compute. It owns one pipeline stage (Head, Body, or Tail) and runs forward and backward passes for batches the trainer routes to it. Workers within the same stage participate together in periodic, async SPARTA averaging rounds. The Workers section below covers the runtime structure.

Coordination Layer¶

A trainer holds no model parameters. Its role is to orchestrate the pipeline: route microbatches to a healthy worker in each stage, balance load across the workers in each stage, and supply dataset shards. Trainers run on CPU and run on Pluralis-owned infrastructure rather than on contributor nodes; see the Trainers section for fault-tolerance and load-balancing details.

Component Deep Dive¶

Workers¶

A worker is a single process holding one stage's parameters (one or more transformer layers). It performs forward and backward on those parameters, runs its own local optimizer to apply gradients, and joins same-stage peers in periodic AllReduce rounds for state averaging. A worker has no knowledge of the rest of the pipeline: only its own stage.

Worker: Stage X

Connection Handlers

Listen for trainer gRPC requests; place batches in fwd / bwd queues. Multiplexed on the same port.

Runtime

Loops over fwd / bwd queues and dispatches batches into ModuleBackend for execution.

ModuleBackend

Stores the nn.Module for this stage. Owns the forward / backward task pools.

DHTHandler

Declares this Worker's availability in its stage (head.0.0, body1.0.1, …) for trainer + peer discovery.

SPARTA Optimizer

Accumulates gradients locally, runs the local optimizer step, then matches with same-stage peers and AllReduces 5% of parameters.

DHT

Hivemind Kademlia DHT. Peer discovery, expert registration, progress tracking, matchmaking.

Worker internals: six co-resident components inside a single Worker process. Runtime drives ModuleBackend for compute; the SPARTA Optimizer coordinates the parameter-averaging step with same-stage peers via the shared DHT.

DHT¶

Agora uses Hivemind's Kademlia DHT for four functions: peer discovery, expert registration, progress tracking, and matchmaking for AllReduce.

ModuleBackend¶

The nn.Module for this stage and the forward and backward functions the Runtime invokes. Also owns the two task pools (forward and backward) where incoming trainer requests queue up before the Runtime processes them.

Async SPARTA¶

Each worker accumulates gradients from its own backward passes and runs its own optimizer step locally; there is no per-step gradient AllReduce. Same-stage replicas drift apart as a result. To re-synchronize, every 20 local steps the worker matches with same-stage peers over the DHT and AllReduces 5% of its parameters. Successive rounds cover non-overlapping slices, so the full parameter set has cycled through over a 20-round window.

Paper

AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

Connection Handlers¶

gRPC listeners that receive trainer requests and put each batch into the right queue: forward or backward. Multiple listeners share a single port.

DHTHandler¶

A background thread that keeps re-announcing this worker in the DHT under its stage-prefixed UID (head.0.0, body1.0.1, tail.0.0). Trainers read the announcements to find workers; same-stage peers use them for matchmaking during AllReduce.

Runtime¶

The main loop. Dequeues batches from the forward and backward queues and runs them through ModuleBackend. On the backward path it rebuilds the autograd graph by re-running the forward (see activation recomputation for details), calls torch.autograd.backward(), and triggers the optimizer step at the appropriate point in the batch-size accumulator.

Batch processing¶

Once running, the Worker runs an event loop processing batches from trainers:

Trainer sends a forward request via gRPC → Connection Handler places it in the forward queue.
Runtime dequeues the batch → calls ModuleBackend.forward() → returns the output.
Trainer sends a backward request with gradient outputs → Connection Handler places it in the backward queue.
Runtime dequeues the batch → calls ModuleBackend.backward() → triggers the optimizer step.

Trainers¶

The trainer's role looks like an ordinary PyTorch training loop: forward through the model, compute a loss, call backward. The difference is that none of those calls run locally. Every forward and backward is dispatched over the network to a worker holding the relevant stage. The trainer's responsibilities are to track the full pipeline topology, select a healthy worker for each stage, and route activations forward and activation gradients back. Parameter gradients themselves never leave a worker, and the trainer never holds parameters of its own.

The Trainer (CPU, no parameters) orchestrates remote workers across pipeline stages over libp2p gRPC. Forward activations flow Head → Body → Tail; backward activation gradients flow Tail → Body → Head. Two pipes per stage give data-parallel redundancy.

Training flow¶

Startup¶

Loads configuration and tokenizer config.
Prepares the dataset from Hugging Face.
Creates DHT connections using seed peers. Each model stage has its own dedicated DHT, enabling partitioning across stages.
Using the DHT, the Trainer discovers workers in each stage; all stages must show at least one available Worker before training can start.

Training loop¶

For each batch, the trainer iterates through the pipeline in order:

hidden = head.forward(input_ids[:, :-1])
hidden = body1.forward(hidden)
hidden = body2.forward(hidden)
loss   = tail.forward(hidden, input_ids[:, 1:])   # shifted labels for LM
loss.backward()                                   # triggers backward on all stages

Worker selection within a stage uses a min-heap keyed by accumulated virtual runtime: the least-loaded worker is selected. When a worker finishes a request, its runtime is credited with the task's estimated duration and the worker re-enters the heap; new arrivals enter at the end. A transient network error triggers a retry against the same worker or a different one depending on the error class, and an unreachable worker is short-banned from the heap for 30s. A pipeline step only stalls if a stage empties entirely: every worker in that stage unreachable or short-banned at once.

Each pipeline stage gets its own DHT connection on the trainer, so worker discovery for stage head is independent of discovery for stage body3. The full protocol for how workers and trainers announce and refresh each other is in Communication Patterns → Periodic node announcement.

Training Architecture¶

Discovery Layer¶

Compute Layer¶

Coordination Layer¶

Component Deep Dive¶

Workers¶

DHT¶

ModuleBackend¶

Async SPARTA¶

Connection Handlers¶

DHTHandler¶

Runtime¶

Batch processing¶

See also¶

Trainers¶

Training flow¶

Startup¶

Training loop¶