This tiny AMD PC just ran a massive 397B AI model that required a server room full of GPUs a year ago

AMD’s Ryzen AI Halo recently went on sale for $4,000, sparking an interesting debate about how it compares to Nvidia’s slightly more expensive DGX Spark offering.

However, the configuration offered by the Ryzen AI Halo has been on the market for a few months now, and while most OEMs and enterprise providers offer the same flavor and configuration, Shenzhen-based memory and storage company Longsys has taken things a step further.

The storage giant demonstrated a localized version of a 397B parameter AI model running on its own version of Ryzen AI Halo, with the same 16-core Ryzen AI Max+ 395 and 128GB RAM configuration.

How was the Ryzen AI Max+ 395 able to run such a massive model with only 128GB of RAM?

Although the running model was not explicitly stated, it appears to be a customized version derived from Alibaba’s Qwen 3.5 397B (A17B), a multi-modal foundation model that utilizes a Mixture-of-Experts (MoE) approach, which made the original DeepSeek such a potent contender.

Although it utilized INT4 quantization, the memory requirements far exceed the memory the device demonstrating the feat had to offer: only 96GB of VRAM is available to the GPU in a 128GB total configuration, against an estimated 200-250GB of VRAM for the model to run.

The secret sauce is Longsys’ recently unveiled custom SPU and iSA configuration that allows for real-time data compression, a feat the company says allows it to fit as much as twice the amount of data in storage drives up to 128GB, leveraging a caching layer that reduces DRAM requirements considerably.

The approach involves offloading experts who are not in active use to a large, fast storage buffer, which the AI chip can then reintroduce them from if necessary.

In a press release, Longsys claimed that their approach worked by targeting the “pain points of MoE LLMs”, such as large parameter counts, rapid KV cache expansion, and I/O latency that hamper inference efficiency

“It leverages expert offloading, intelligent cache management, and predictive prefetch algorithms to effectively solve storage planning challenges and greatly improve local AI inference smoothness,” the company added.

It’s important to note that while the move itself is an impressive feat, Longsys didn’t provide details on computing power in terms of tokens per second, with the Ryzen AI chip being relatively limited compared to most modern AI GPU offerings.

Regardless, the approach that essentially treats storage as memory suggests that localized AI may be able to run significantly larger models, and that memory may not be such a severe limitation for certain approaches.

This means that memory limitations can be circumvented by leveraging fast storage and running a model at a limit level that would otherwise require tens of thousands of dollars in AI hardware, which is no small feat. This means that models previously limited to data centers only can now run on a device that fits in the palm of your hand.

Google logo on black background next to text reading 'Click to follow TechRadar'