FedPOD: the deployable units of training for federated learning
Daewoon Kim, Jae Sung Lee et al.
Paper
Details
#FederatedLearning #FedPOD #DistributedTraining
FlowMoE Adds Scalable Pipeline Scheduling for Distributed MoE Training
FlowMoE reduces training time by up to 57% and energy use by up to 39% on two GPU clusters. Read more: getnews.me/flowmoe-adds-scalable-pi... #flowmoe #mixtureofexperts #distributedtraining
Text Shot: Flower AI and Vana, two startups pursuing unconventional approaches to building AI, worked together to create the new model, called Collective-1. Flower created techniques that allow training to be spread across hundreds of computers connected over the internet. The company’s technology is already used by some firms to train AI models without needing to pool compute resources or data. Vana provided sources of data including private messages from X, Reddit, and Telegram.
These Startups Are Building Advanced AI Models Without Data Centers www.wired.com/story/these-startups-are... #AI #DistributedTraining
Towards Understanding Bugs in Distributed Training and Inference
Frameworks for Large Language Models
Feifei Niu, Haoxuan Chen et al.
Paper
Details
#DistributedTraining #LargeLanguageModels #AIResearch
Why this matters - decentralized training will alter the political economy of superintelligence: Currently, a lot of Al policy relies on the idea that powerful Al systems will be trained by a very small number of entities that can individually 'mass' very large amounts of compute - for instance, frontier labs like Anthropic or OpenAl, or hyperscalers like Google. As distributed training software gets better and more 'proof points' emerge of good models trained in a distributed way, this dynamic could alter - if models like INTELLECT-2 are good and generate economic value, then it might lead to a new type of player on the AGI gameboard - loose federations of organizations pooling compute in a globally distributed way to train models.
Import AI 409: Huawei trains a model on 8,000+ Ascend chips; 32B decentralized training run; and the era of experience and superintelligence importai.substack.com/p/import-ai-409-huawei-t... #AI #DistributedTraining
Why this matters - distributed training breaks some of the assumptions of Al policy: Distributed training means it becomes easy to train Al systems using multiple disaggregated blobs of compute rather than one single blob of compute. If you push this idea far enough - say, training a 70B model across ~10 distinct datacenters - then you enter a regime where a lot of the tools of Al policy (monitoring of large amounts of compute, controls over the export of certain numbers of compute) might be invalidated.
Import AI 404: Scaling laws for distributed training; misalignment predictions made real; and Alibaba's good translation model importai.substack.com/p/import-ai-404-scaling-... #AI #DistributedTraining
Cambridge researchers show how to use distributed training to make a 1.3bn parameter LLM: ..More evidence that distributed training works well for relatively small models... Researchers with the University of Cambridge and Flower Labs have shown that it's possible to use cheap, distributed training approaches to train LLMs at the billion-parameter scale, providing more clues that in the future, some Al models could be trained via collectives pooling their hardware akin to the filesharing communities that developed around BitTorrent.
Today, frontier Al systems are trained in large data centers that contain lots of computers which are densely networked together. This means that training Al systems is expensive and hard for regular people without access to a large data center to do. Alongside the rise of LLMs, various researchers have been trying to figure out how to make it easy to train LLMs in a much more distributed way - where you have your computers in separate data centers many miles from one another (sometimes, completely different countries), and you train your system by sharding it across all of your different computers, doing some local computation, aggregating data back at some cadence, and using this to update the global model and step through training. These techniques used to be very fragile and of dubious utility, but they have started to improve recently, and major Al research organizations such as Google DeepMind have been pouring resources into this area
Import AI 380: Distributed 1.3bn parameter LLM. importai.substack.com/p/import-ai-380-distribu... #AI #DistributedTraining (interesting)