AMD Zen 5 Architecture Reveal: A Ryzen 9000 And Ryzen AI 300 Deep Dive


A Closer Look At RDNA 3.5 Graphics, The XDNA 2 AI Engine And Final Thoughts

amd ryzen ai 300 chip top
As we've mentioned previously, although they both leverage Zen 5, Granite Ridge-based Ryzen 9000 series desktop processors and Strix Point-based Ryzen AI 300 series mobile processors are built from completely different slices of silicon. Strix Point features RDNA 3.5-based on-processor graphics and an XDNA 2-based NPU for AI workloads, and is manufactured using TSMC's 4nm process node. RDNA 3.5 and XDNA 2 do not appear on the first wave of Ryzen 9000 series desktop processors, but MAY be used on as-yet-unannounced future desktop APUs. But we digress...

RDNA 3.5 Graphics Architecture

rdna 35 summary

RDNA 3.5 is not a complete departure from the existing RDNA 3 architecture used in current-gen Ryzen and Radeon products. It has, however, received a number of updates and enhancements designed to optimize power, efficiency and performance. AMD claims RDNA 3.5 is optimized for performance per watt, performance per bit, and should ultimately offer better battery life for mobile users.

AMD said much of what it learned from its endeavors with Samsung creating low-power smartphone GPUs for Exynos processors made its way into RDNA 3.5.

rdna 35 arch overview

So, how'd AMD do it?  Well, first off, RDNA 3.5 offers 2x the texture sample rate of RDNA 3. A subset of the most common texture sampling operations now run at double the data rate of the previous-gen. Most of the interpolation and comparison rates have been doubled for common shader operations as well. RDNA 3.5 also features improvements to primitive batch processing, and employs better compression techniques to reduce memory footprint. In the end, the updates to RDNA 3.5's memory management reduce memory accesses, make better use of available memory capacity and bandwidth, and reduce the overall workload. The memory controller has also been optimized for LPDDR5 access.

rdna 35 efficiency

All told, AMD is claiming up to 32% higher performance per watt for RDNA 3.5 vs. RDNA 3, with 19-32% better realized performance at ISO power. 3DMark Time Spy and Night Raid scores for Strix Point vs. Hawk Point show the kind of performance advantages Strix Point should offer, when both are configured at similar 15w power levels. AMD notes that the efficiency -- and ultimately battery life -- improvements offered by RDNA 3.5 come by way of the aforementioned architectural enhancements, in addition to manufacturing advancements and power profile optimizations.

XDNA 2 NPU Architecture Details

xdna2 efficiency

Whether consumers want it or not, AI is the buzzword du jour and it is likely here to stay. At this early stage, however, many companies are making different bets as to how to best process AI workloads. NPUs, which complement CPU and GPU cores, have been integrated into mobile processors from AMD, Intel, and Qualcomm, but each company has taken somewhat different approaches. 

The common thread among the three is that NPUs are vastly more efficient than CPUs and GPUs for some tasks, which make them ideally suited to longer-running, sustained workloads that often run alongside traditional compute tasks. This includes things like blurring backgrounds or doing real-time translation during a video call. In AMD's testing, its latest XDNA 2 NPU is up to 35x more efficient than its CPU cores, or 8x more efficient than its iGPU.

xdna2 architecture column layout

AMD's XDNA 2 NPUs employ what the company calls a spatial data flow architecture. Unlike today's CPUs, where data is fetched and stored in mulitple layers of cache memory before any processing is handled by the cores, XDNA 2 is a cacheless architecture with dedicated memory resources attached to multiple compute tiles. Data can be read out of the NPU's memory once and multi-cast out through the engine, with "many terabytes" of north-south and east-west bandwidth.

xdna2 spatial architecture

XDNA 2 NPUs feature tiled resources linked via a flexible programmable interconnect. The design allows AMD to dynamically partition the NPU at runtime for optimal multi-tasking and to shift resources as necessary based on the workload. If you think of the design as featuring eight 4-tile columns, each column can be partitioned individually or grouped together to best handle multiple tasks. 

Should the NPU be required for, say, video effects, audio processing, or a generative AI workload, the NPU's resources can be divvied up as necessary to provide a good experience. We should also note that the NPU can be partitioned temporarily as well, which means multiple applications or models can be assigned time slices of the entire NPU's resources and run concurrently.

The NPU's partitioning also allows for efficient power usage. The NPU features only a single power domain, but power is distributed per column and can be gated as necessary, depending on the workload and utilization at the time.

xdna2 gen comparison

In terms of the actual implementation of the XDNA 2 NPU on Strix Point, it's larger, has more resources, supports more data types, and features architectural improvements that enhance performance and improve efficiency by up to 2x versus XDNA 1. In fact, XDNA 2 offers 5x the TOPS of XDNA 1, with less than 2x the number of tiles (20 vs. 32). XDNA 2 features 2x the compute resources per tile, 1.6x the amount of on-chip memory, support for Block Floating Point data types, and offers improved non-linear support as well.

xdna2 bfloat16 support

Another feature of XDNA 2 AMD has already talked about is support for Block FP16, among other data types. For applications and models supporting 16-bit data types, XDNA 2 can offer the same performance as its competitors using INT8, with the precision of FP16, without requiring the quantization step typically required. In other words, it should be easier to set up and deploy existing FP16 AI workloads on Ryzen AI 300 without sacrificing performance or precision.

xdna2 tops comparison

To date, NPU 8-bit TOPS is the de facto standard used to describe PC NPU performance, and the Snapdragon X Elite features an NPU capable of 45 INT8 TOPs. Should Block FP16 be used with a given model, it effectively allows AMD's XDNA 2 NPU to offer similar performance to INT8, but with nearly identical precision to FP16. On architectures that don't support Block FP16, using FP16 precision approximately halves performance versus INT8, hence the Snapdragon X Elite landing at the bottom of the AMD-supplied chart above. AMD is also projecting Lunar Lake's reported 40-45 TOPS NPU won't perform quite as high with Block FP16, but Intel's Lunar Lake likewise supports Block Float, so we're not sure this slide is completely accurate.

int8 npu tops comparison

This is how today's NPUs stack up in terms of 8-bit TOPS. The Ryzen AI 300 series still offers the highest theoretical peak performance, but there's a lot more to NPU TOPS than potential peak performance. Software support and optimization are critical to optimally leverage NPU resources and the software landscape is evolving rapidly as it relates to AI.

Speaking of software support, 40 TOPS is the level of NPU performance Microsoft set to achieve Copilot+ support. To date, Qualcomm is Microsoft's exclusive Copilot+ partner, but that will likely change in the coming the months. AMD claims all Copilot+ models are already up and running on its platform, and that it expects actual support to arrive sometime next year.

xdna2 roadmap

AMD expects NPU capabilities and performance to be one of the key areas of future processor development. Next-gen Ryzen AI NPUs will target emergent LLMs and presumably offer support for additional data types and significantly increased performance, but no hard details were given just yet.

As its NPU hardware evolves, so too is AMD's software suite. The company is working on a Ryzen AI platform that will ultimately offer a unified AI software stack to target all available compute engines on a given platform. That's an ambitious goal, and one that is historically difficult to achieve, but we'll remain cautiously optimistic this time around. The end game is to make it easy for developers to leverage AMD's platforms across all hardware targets, and the company expects to share early versions of the Ryzen AI software suite with partners before the end of the year.

amd jack with asus

AMD Ryzen 9000 And Ryzen AI 300 Featuring Zen 5: Coming Soon

AMD's next-gen mobile and desktop processors featuring Zen 5 will both be available before the month is out. AMD is touting 100+ new design wins for its Ryzen AI 300 mobile processors and Ryzen 9000 series desktop processors will drop right into socket AM5 motherboards, after a BIOS update. All told, both Ryzen 9000 and Ryzen AI 300 both appear to offer significant performance and efficiency gains versus their predecessors, and we look forward to taking them for a spin. Their arrival also gives AMD quite a bit of a head start versus Intel. Although Lunar Lake is shaping up to be an exciting platform, it won't arrive for a couple more months, and rumors suggest Arrow Lake for desktops won't be here until December.

There's a lot a fresh technology for the PC that has arrived recently or is coming down the pipeline. The next few months are shaping up to be quite busy and we're excited to see how things ultimately shake out.

Related content