MLX now supports distributed inference and training across multiple Apple Silicon Macs via RDMA over Thunderbolt 5 and the JACCL collective communication library, enabling larger models and faster token generation than a single machine can achieve. Developers can shard billion-parameter LLMs across a cluster of Macs using pipeline or tensor parallelism, orchestrated through MLX's Swift, Python, or CLI APIs.
⢠Run models too large for a single Mac ā a 1-trillion-parameter model like Kimi 2.6 requires ~1 TB of memory at 8-bit quantization, which fits across four M3 Ultras but not on one
⢠Near-linear inference speedup ā a 4-node M3 Ultra cluster achieves nearly 3Ć the tokens-per-second of a single machine for a 27B-parameter model like Qwen 3.6
⢠Zero-change model code ā wrapping an existing mlx_lm command with mlx.launch shards and coordinates the model across the cluster automatically, with no model-level code changes required
Demonstrates how to use the MLX Swift API to initialize a distributed process group, query the local rank and world size, and perform a basic all-reduce communication primitive ā the building block of tensor-parallel inference across a Thunderbolt 5 Mac cluster.
import Foundation
import MLX
import MLXDistributed
// MARK: - Distributed Group Setup
// Each node in the cluster runs this same executable.
// mlx.launch SSHes into each node and starts it with the correct rank env vars.
struct DistributedClusterDemo {
static func run() async throws {
// Initialize the default distributed process group.
// MLX reads JACCL/RDMA configuration from environment variables
// set by mlx.launch via the hostfile.
let group = MLXDistributed.init()
let rank = group.rank // Index of this machine in the cluster (0-based)
let worldSize = group.size // Total number of machines in the cluster
print("Node \(rank) of \(worldSize) initialized.")
// MARK: - All-Reduce: Sum activations across all nodes
// In tensor parallelism each node computes a partial result for every
// layer. An all-reduce sums those partials so every node has the
// full result before the next layer.
// Simulate a partial activation computed locally on this node.
// In a real LLM this would be the output of a sharded linear layer.
var localActivation = MLXArray(
(0..<8).map { Float($0) + Float(rank) * 8 },
[8]
)
print("Node \(rank) partial activations: \(localActivation)")
// All-reduce (sum) across the group ā blocks until all nodes contribute.
let globalActivation = group.allReduce(
localActivation,
operation: .sum
)
// Force evaluation so the result is materialized before printing.
eval(globalActivation)
print("Node \(rank) global activations after all-reduce: \(globalActivation)")
// MARK: - All-Gather: Collect sharded embeddings from every node
// Pipeline parallelism exchanges full activations at layer boundaries.
// All-gather lets every node obtain the complete tensor.
let gatheredActivations = group.allGather(localActivation)
eval(gatheredActivations)
if rank == 0 {
// Only the rank-0 (coordinator) node logs the aggregated result.
print("Coordinator sees gathered shape: \(gatheredActivations.shape)")
// Shape is [worldSize * 8] ā each node's shard concatenated.
}
// Barrier: wait for all nodes before exiting.
group.barrier()
print("Node \(rank): distributed run complete.")
}
}
// Entry point ā run as a Swift executable via:
// mlx.launch --hostfile cluster.json -- /path/to/DistributedClusterDemo
try await DistributedClusterDemo.run()⢠RDMA must be explicitly enabled per machine in System Settings and requires a reboot ā it is off by default ⢠The Thunderbolt Bridge must be disabled on each link used for RDMA; the mlx.distributed_config --auto-setup flag handles this automatically ⢠MLX_METAL_FAST_SYNCH=1 environment variable is critical for distributed tasks (GPU compute + CPU communication interleave); omitting it degrades performance significantly ⢠Tensor parallelism requires low-latency full-mesh topology ā ring topology is better suited for bandwidth-bound pipeline parallelism with large activations ⢠All executables and model weights must be accessible on every node at the same path; SSH access from the orchestrating machine to all nodes is required
Requires Apple Silicon Macs with Thunderbolt 5 ports, physical Thunderbolt 5 cables connecting the cluster, RDMA over Thunderbolt enabled in System Settings, and macOS 26.2+. Not available on iOS or iPadOS. All cluster nodes must have MLX and required libraries installed.
More iOS 27 APIs land every week.
Get notified when new capabilities are published ā no noise, just signal.