- DeepSeek’s Engram separates static memory from computation, increasing the efficiency of large AI models
- The method reduces high-speed memory requirements by enabling DeepSeek models to use lookups
- Engram supports asynchronous prefetching across multiple GPUs with minimal performance
DeepSeek, in collaboration with Peking University, introduced a new training method called Engram, designed to decouple memory storage from computational processes.
Traditional large language models require high-bandwidth memory for knowledge retrieval and basic computation, creating a bottleneck in both performance and cost.
This HBM bottleneck is widely recognized as a key reason why DRAM prices increased by 5X in just 10 weeks as hardware demand increased to support large AI models.
Validation and technical approach
The researchers said that existing models waste sequential depth on trivial operations that could otherwise support higher-level reasoning.
Engram allows models to efficiently “look up” essential information without overloading GPU memory, freeing up capacity for more complex reasoning tasks.
The system was tested on a model with 27 billion parameters and showed measurable improvements across standard industry benchmarks.
By performing knowledge retrieval through hashed N-grams, Engram provides static memory access independent of the current context.
The retrieved information is then adjusted using a context-aware gating mechanism to adapt to the hidden state of the model.
This design enables models to handle long context inputs more efficiently and supports system-level prefetching with minimal performance.
The Engram method complements other hardware-efficient approaches, including solutions such as Phison’s AI inference accelerators.
Engrams minimize the amount of high-speed memory required by using lookups for static information, making memory usage more efficient.
Phison offers a cost-effective way to expand total memory using SSDs that support large AI models such as Engram or Mixture-of-Experts systems.
Combined, these approaches allow AI systems to optimize the use of fast memory while increasing overall memory capacity at an affordable cost.
It also works with new Compute Express Link (CXL) standards, which aim to overcome GPU memory bottlenecks in large AI workloads.
The method separates static pattern storage from dynamic computation, improving transformer backbone without increasing FLOPs or parameter counts.
DeepSeek formalized a U-shaped expansion rule to optimize the assignment of parameters between the MoE conditional computation module and the Engram memory module.
Tests show that reallocating about 20-25% of the sparse parameter budget to the Engram provides better performance than pure MoE models, maintaining stable gains across different scales.
Memory slot expansion provides predictable improvements without additional computational overhead.
This confirms the scalability of conditional memory as an independent axis for sparse models.
Engram’s deterministic retrieval mechanism allows memory capacity to scale linearly across multiple GPUs while supporting asynchronous prefetching during inference.
It offloads static knowledge reconstruction from lower layers, freeing attentional mechanisms to focus on global context.
Hierarchical caching of frequently used embeds increases efficiency, and the module works with existing GPU and system memory architectures, potentially avoiding expensive HBM upgrades.
This technique could ease the pressure on expensive memory hardware, especially in regions like China, where HBM access lags behind competitors like Samsung, SK Hynix and Micron.
Early validation of Engram suggests that models can expand parameter scale and reasoning capacity while handling memory requirements more efficiently.
This approach can help ease memory constraints across AI infrastructure, potentially reducing sharp DDR5 DRAM price swings.
Via SCMP
Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews and opinions in your feeds. Be sure to click the Follow button!
And of course you can too follow TechRadar on TikTok for news, reviews, video unboxings, and get regular updates from us on WhatsApp also.



