Mixture-of-Experts Routing at Trillion-Parameter Scale
How frontier labs partition expert layers across GPU clusters — activation sparsity, load balancing, and the hidden cost of all-to-all communication.
Intelligence
In-depth analysis on the systems, architectures, and infrastructure shaping the future of artificial intelligence.
How frontier labs partition expert layers across GPU clusters — activation sparsity, load balancing, and the hidden cost of all-to-all communication.
A deep look at next-gen GPU interconnect bandwidth, rack-scale NVSwitch layouts, and what it means for 100k+ token context windows.
From quantized MoE checkpoints to community-maintained inference stacks — why the weights themselves are becoming the platform.
Sliding window, linear attention, state-space hybrids — mapping the architectural primitives that define 2026 sequence modeling.
Power density, CDU design, and why thermal engineering is now a first-class constraint in large-scale training clusters.
Draft models, acceptance rates, and the systems-level tradeoffs between throughput and time-to-first-token.