About
Pluralis carries out foundational research on Protocol Learning: multi-participant training of foundation models where no single participant has, or can ever obtain, a full copy of the model. The purpose of Protocol Learning is to facilitate the creation of community-trained and community-owned frontier models with self-sustaining economics.
Research
S. Ramasinghe, T. Ajanthan, H. Dolatabadi, C. Koneputugodage, G. Avraham, V. Shevchenko, Y. Zuo, K. Pajak, A. Long | ICLR 2026We introduce a fast online curvature estimator that tracks preconditioned Hessian behavior during billion-parameter Transformer training. It reveals depth-driven curvature surges behind loss spikes and motivates architecture warm-up: progressively growing depth to stabilize training without slowing convergence.
S. Ramasinghe, T. Ajanthan, G. Avraham, Y. Zuo, A. Long | NeurIPS 2025This work demonstrates that model-parallel training over low-bandwidth networks is possible, training an 8B LLaMA model on par with centralized training while transformer blocks are split across four locations connected only by standard internet links.
T. Ajanthan, S. Ramasinghe, Y. Zuo, G. Avraham, A. Long | ICML 2025Pipeline parallelism trains large models by splitting them into stages, but idle “bubbles” slow training, especially when network latency is high. Our Nesterov method corrects stale updates and outperforms existing async techniques and the synchronous baseline.
▶︎ Code
A. Long*, C. Koneputugodage*, S. Ramasinghe, T. Ajanthan, G. Avraham, Y. Zuo | NeurIPS 2025UPMs enable collaborative training and inference without ever materializing the full model weights for any participant, making decentralized models unextractable in practice.
S. Ramasinghe, T. Ajanthan, H. Dolatabadi, G. Avraham, V. Shevchenko, Y. Zuo, C. Koneputugodage, A. Long | NeurIPS 2025We introduce a compression method for communication-efficient context parallelism that achieves over 95 % compression with negligible overhead and no convergence loss. By exploiting low-rank activation structure through learned mixtures of subspaces, it scales billion-parameter decentralized models to 100 K+ context lengths on 300 Mbps networks while matching centralized wall-clock convergence.