China built a giant AI supercomputer without Nvidia GPUs and instead used millions of Huawei ARM cores

The Huawei-connected LineShine supercomputer crams 2.45 million arm cores into one huge AI cluster
Huawei’s processors power one of China’s largest AI computing installations today
CPU-only supercomputers eliminate expensive data transfers between processors and accelerators during workloads

China has deployed a massive CPU-only supercomputer called LineShine that delivers 1.54 exaflops of AI training performance without using any GPUs at all.

The system packs 20,480 compute nodes, each containing two LX2 processors, for a total of 40,960 chips across the entire machine.

Each LX2 processor has 304 CPU cores, meaning the entire supercomputer uses approximately 2.45 million Armv9 cores in total.

Inside the unusual architecture of the LX2 processor

The processor was developed by Huawei or through a joint design with China’s National Supercomputing Center, although the exact origin remains undisclosed.

Each LX2 processor uses two computing chiplets with cores organized into eight clusters containing 38 cores per cluster.

Each core includes ARM’s Scalable Vector Extension and Scalable Matrix Extension units that accelerate matrix operations used in AI training.

The processor delivers 60.3 teraflops of FP64 performance, 240 teraflops of BF16 throughput and 960 teraflops of INT8 performance from a single chip.

The memory subsystem combines 32 GB of pack HBM that delivers up to 4 TB/s of bandwidth with up to 256 GB of off-pack DDR5 memory.

CPU-only systems offer several advantages for complex scientific tasks that combine AI training with massive data ingestion and preprocessing.

Since everything runs on the same processor and memory space, they avoid expensive and bandwidth-intensive CPU-to-GPU data transfers.

Homogeneous CPU-based systems can also expose much larger contiguous memory pools by combining HBM with large DDR capacities.

This is useful for handling massive scientific data sets, extended generation retrieval, and long context windows that GPU memory limitations cannot easily accommodate.

The big caveat that comes with this approach

CPU-only systems are usually less power efficient and deliver lower-density AI throughput than GPU-based supercomputers.

This is the main reason why most of the industry is betting on heterogeneous CPU plus GPU architectures for large-scale AI workloads.

China is pursuing this path mainly because of US bans on GPU exports, not because CPU-only systems are technically superior for AI tasks.

LineShine shows that CPUs can successfully handle GPU jobs, but the efficiency gap between the two approaches remains significant and unlikely to close anytime soon.

China is making a strategic trade-off, accepting lower performance and higher power consumption in exchange for independence from foreign hardware and software ecosystems such as Nvidia’s GPUs and CUDA.

Whether that trade-off makes sense for long-term AI development depends entirely on how quickly Chinese manufacturers can close the performance gap with their own GPU designs.

Until then, LineShine will remain a remarkable technical achievement and a practical necessity, but probably not a blueprint for how most of the world will build AI supercomputers.

Via Tom’s Hardware

Google logo on black background next to text reading 'Click to follow TechRadar'