About

Pluralis is a research lab focused on collectively-owned AI.

Closed models capture enormous value but lead to an unacceptable concentration of power. Open-weight models distribute power but have massive headwinds to being financially sustainable. Our work is on a third path; collective, community driven training that is self-sustaining. Our team came from Google, Anthropic, and Amazon, where we worked together for many years prior to Pluralis. We publish openly.

We are currently carrying out open, multi-participant training runs; you can find information about previous runs here; the current run here, and can apply to join in the planning and development of future runs here.

Research

Factored Gossip DiLoCo: Reducing Blocking Communication within DiLoCo

ICML 2026

C. Koneputugodage, T. Ajanthan, S. Ramasinghe, H. Dolatabadi, S. Siriwardhana, G. Avraham, V. Shevchenko, K. Pajak, J. Snewin, A. Long

We relax DiLoCo’s exact outer synchronization to approximate synchronization via mixing and gossip, factorizing it into a non-blocking step that overlaps computation with no staleness and a blocking step that tightens worker agreement. On billion-parameter language models in low-bandwidth settings, the method substantially improves compute utilization while matching DiLoCo’s training progress and is more robust to failures.

S. Ramasinghe, T. Ajanthan, H. Dolatabadi, C. Koneputugodage, G. Avraham, V. Shevchenko, Y. Zuo, K. Pajak, A. Long

We introduce a fast online curvature estimator that tracks preconditioned Hessian behavior during billion-parameter Transformer training. It reveals depth-driven curvature surges behind loss spikes and motivates architecture warm-up: progressively growing depth to stabilize training without slowing convergence.

T. Ajanthan, S. Ramasinghe, Y. Zuo, G. Avraham, A. Long

Pipeline parallelism trains large models by splitting them into stages, but idle “bubbles” slow training, especially when network latency is high. Our Nesterov method corrects stale updates and outperforms existing async techniques and the synchronous baseline.

S. Ramasinghe, T. Ajanthan, H. Dolatabadi, G. Avraham, V. Shevchenko, Y. Zuo, C. Koneputugodage, A. Long

We introduce a compression method for communication-efficient context parallelism that achieves over 95 % compression with negligible overhead and no convergence loss. By exploiting low-rank activation structure through learned mixtures of subspaces, it scales billion-parameter decentralized models to 100 K+ context lengths on 300 Mbps networks while matching centralized wall-clock convergence.

Backed by