‘A virtual DPU within a GPU’: Could smart hardware hack be behind Deepseeks groundbreaking AI efficiency?

A new approach called dualpipe appears to be the key to deexek’s success
An expert describes it as a virtual DPU on GPU that maximizes bandwidth efficiency
While Deepseek has only used Nvidia GPUs, one wonders how AMD’s instinct would manage

China’s Deepseek Ai Chatbot has stunned the tech industry and represented a credible alternative to Openais Chatgpt to a fraction of the cost.

A recent article revealed Deepseek V3 was trained on a cluster of 2,048 NVIDIA H800 GPUs – paralyzed versions of H100 (we can only imagine how much more powerful it would run on AMD instinct accelerators!). It allegedly required 2.79 million GPU hours for prior, fine-tuning of 14.8 trillion tokens and cost-oming calculations made by The next platform – Only $ 5.58 million.

But exactly how Deepseeks developers managed this feat is probably down to a smart hack.

A virtual DPU on the GPU itself

First a little background. Deepseek is an advanced blend of experts (MOE) language model designed to optimize performance by selectively activating the most relevant parts of its architecture for each task. The third version of the model, Deepseek-V3, contains a total of 671 billion parameters, with only 37 billion enabled for a given token prediction. This selective activation reduces massive calculation costs massively while maintaining high performance and accuracy – as you will see if you are trying it.

It is easy to be skeptical of Deepseek and the claims made regarding its training, but the paper reveals some of the magic that the developers came up with to make the most of the paralyzed hardware they had to work with. This includes the creation of the double pipe algorithm for effective pipeline parallelism.

According to the information published by Deepseek, Dualpipe overlaps forward and backward calculation, reduces latency time and optimizes data movement across GPUs. When effective control of communication, it minimizes idle time (pipeline bubbles) and balances dynamic GPU calculation core (streaming multi -processors) between calculation and communication, which prevents bottlenecks in the data transfer as model scales.

A commentator on The next platform Dualpipe describes as “essentially to create a virtual DPU on the GPU itself to deal with all-to-all communication”, which highlights its role in optimizing data transmission efficiency.

The paper goes into further detail, “To ensure adequate calculation benefit for dualpipe, we customize effective cross -node all-to-all communications (including sending and combination) to preserve the number of SMs dedicated to communication. The implementation of the kernels is co-designed with the MOE GATING algorithm and the network topology in our cluster.

Example of dualpipe planning for 8 pp rows and 20 micro-batches in two directions. The microbatches in the opposite direction are symmetrical to them in the front direction, so we omit their batch ID for illustration simplicity. Two cells closed by a split black edge have mutually overlapped calculation and communication. (Image Credit: Deeksek)

Must Read

Leave a Comment Cancel Reply