Zai Replaces Network Architecture for GLM-5.1 Inference with ZCube, Boosting Throughput and Cutting Costs

by /u/Scared-Biscotti2287 ·

Patch your inference pipelines to accommodate ZCube’s flattened bipartite topology and monitor latency reductions of up to 40.6% on first token.

What to do now

Patch your inference infrastructure to integrate ZCube’s flattened topology and validate throughput and latency gains.

Summary

Zai has upgraded the network architecture on a thousand‑GPU cluster that runs GLM‑5.1 coding inference, replacing the standard ROFT setup with a new design called ZCube developed with Tsinghua University and HarnetsAI. The switch and optical module costs dropped 33%, while GPU inference throughput rose 15% and the P99 tail latency on the first token fell 40.6%. The same GPUs, software stack, and model are used; only the network topology changed. The problem addressed was the highly asymmetric traffic from KV Cache transfers in Prefill‑Decode disaggregated inference, which ROFT’s static rail mapping could not handle, leading to hotspots and PFC backpressure. ZCube eliminates the spine layer entirely, using a fully flattened bipartite interconnect between two switch groups, removing a whole class of congestion that ROFT cannot avoid by design. This cost reduction while improving performance is notable because normally better network hardware costs more. The new architecture demonstrates that hardware savings and performance gains can be achieved simultaneously.

Key changes

Switch and optical module costs down 33%
GPU inference throughput up 15%
P99 tail latency on first token dropped 40.6%
ZCube removes spine layer, uses bipartite interconnect
KV Cache transfer traffic asymmetry mitigated
Same GPUs, same software stack, same model
ROFT topology unsuitable for Prefill‑Decode disaggregation
ZCube fully flattened architecture

Affects

internal

Story evolution

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting