Updated: Aug 15, 2020
UPDATE 15-08-2020: Please read my Knowledge Update on RDNA's Primitive Shaders!
So you know AMD has done a pretty huge rework on the Compute Unit with the new"RDNA" Architecture that powers the "Navi 10" chip in the new RX 5700 series that are going to launch soon. Well, this post is about those changes in a way.
You know I did a lot of posts about geometry performance with GCN? Well I did you can check my main one out here. Then.... AMD goes and completely re-designs the Compute Unit's SIMD (That's Single Instruction Multiple Data, and it basically means you take a group of numbers and execute one operation on all of them at once. The instruction is the same but you get a set of different results) structure, and that change was pretty big and honestly for me: unexpected. :o
I admit I don't really know as much about the really nitty-gritty underlying GPU architecture numbery-crunchery bit that actually goes on within those Compute Units, but I learn more every day! Please don't hate me. And basically now I feel I have a better understanding of some of the 'issues' with Graphics Core Next's bottlenecks.
GCN and bottlenecking because of Geometry
So I originally believed a lot of GCN's issues were due to somewhat limited geometry performance relative to its absolutely enormous compute throughput. Well, I believe this is still an area of concern for GCN but after the RDNA announcement my idea of the bottlenecks in GCN changed. But first a bit about the geometry aspect of GCN...
Going back to early January 2012, when the first GCN-based GPU launched; Tahiti. This card powers the HD 7900 and R9 280 series of graphics cards. Anyway, this GPU has a monolithic (compared to NVIDIA architectures, which use a more "distributed" way of geometry, where the "PolyMorph" engines are attached to each SM) Geometry Front end with a pair of Raster Engines and geometry engines. Each one of these geometry engines handles things like Tessellation and actually setting up the geometry data. The Raster Engines actually spit out the primitives so you get 2 Triangles per clock on Tahiti.
This is Tahiti's official block-diagram which shows the very simplified top-level distribution of all its resources. You can see the two distinct "Shader Engines" each with 16 CU (AMD does not call them Shader Engines on Tahiti as far as I know, by the way, but I think they are). And right at the top you can see the monolithic front end with the dual geometry and raster engines. I don't know about you, but this was always looking heavily lopsided towards Compute throughput than geometry. And it sort of is.
With Hawaii and all chips going forward from this, AMD had increased the Geometry front-end by 2X, they doubled it up and also it became a quad-raster design.
Here is the block diagram for Tonga. I chose this chip to show as it has a very similar resource compliment to Tahiti... So you can see they have the same number of Compute Units but Tonga distributes those resources between four shader engines each with its own Raster and Geometry engine. I measured some pretty huge gains in tessellation and what I believe to be triangle-limited workloads on this chip over Tahiti. Obviously, the doubled throughput did help.
So AMD never actually upgraded the GPU design like this again, not with GCN anyway. Even the Mighty Vega 20 GPU powering the single most powerful Radeon gaming graphics card you can buy today: Radeon VII, still uses the Quad-design.
The overall GPU layout is very similar but this time you get twice the Compute Units per engine. AMD made some fairly significant improvements to the actual Geometry engines themselves, though. And with Polaris and Vega; they have a hardware level Primitive Discard engine, which I also measured the performance gain from if you look at the "GCN Geometry Performance" post I made.
But overall, the gains weren't massive and this has always been an area I thought was primarily limiting GCN, and primarily Vega 10 and Vega 20 GPUs, which have buckets of Compute but relatively lighter geometry throughput.
GCN shaders are under-utilised in graphics workloads
I think we knew this already. But I was more interested in: Why? Why does Vega especially not seem to be able to pump out significantly more FPS in video games versus the comparatively leaner GP104 chip in GTX 1080, for example? (That is 4096 Stream Processors compared to 2560 CUDA cores, they do largely the same job). Well, I still think it's somewhat limited by geometry in some frames but that is not the primary cause I think.
If you look at what AMD has stated and changed with the new RDNA architecture for Navi, I get a sense of why this is. I also did some reading of tech sites and I updated my understanding since that information was released.
You know, before I get into the technicalities of these Compute Units, lets throw-back to when GCN replaced the VLIW-based TeraScale back in very late 2011...
GCN: A Complex Compute Powerhouse
TeraScale based cards (HD 6000 and prior) were pretty great for Graphics workloads but they had a huge Achilles heal: Compute. Well, this was primarily due to an inefficient way of handling dependency heavy code. I won't go too much into detail but my understand here is that VLIW had huge utilisation issues in compute tasks with dependency in them, that is, instructions relying on results from other instructions. And GCN was built around this, by extracting more parallelism.
A GCN Compute Unit contains four Vector units, each 16 ALU wide, call them "SIMD-16" units if you want. With GCN AMD schedules work into groups of 64 threads called "Wavefronts". A single GCN Compute Unit in my understanding, takes four separate Wavefronts and executes 25% of each one (16 numbers) at a time, on each of its four Vector SIMD-16 units. But that means each CU can only complete a single Wavefront in 4 clock cycles. But in highly parallel code with lots of dependency (compute) this is very effective. In Graphics, it's good too but requires some tricky optimising and code writing/scheduling to achieve peak utilisation.
Everything about GCN from TeraScale was about Compute. It was designed to be a computing, number-crunching powerhouse, and that's what it is. In fact, I often joked that GCN is a "Parallel Computing architecture with a graphics pipeline tacked on", lol. RDNA changes things by being optimised and design for the one thing us PC Gamers care about: Graphics.
Oh wow, I typed a lot already and it's 4AM and my mind is fuzzy and I want to play Warframe so I will get to the point.
What does all this mean? About GCN Shaders and bottlenecking? Oh, and RDNA: Built for Graphics (at last)!
GCN was built for Compute first and foremost. Its Compute Unit structure is excellent for complex Compute tasks and it excels there, but it's tricky to schedule for in graphics work unless a developer actually explicit includes GCN-specific optimisations in the shader code. Do you notice how some games run really well on GCN and others, well, run like pants (at first)? That is this in action. AMD probably has been fighting a huge battle keeping the Compute Units on GCN chips occupied with work because of this architectural design. Now, I do not know the complex intricacies (yet!) but RDNA is built to be much simpler for shading code to scale too, and thus those Stream Processors are going to see much higher utilisation which leads you to the "IPC" increase observed on Navi GPUs.
Basically what I am trying to say is: It wasn't all geometry. In fact, I think it was equally, or maybe even more so limited actually by the shaders. GCN being shader limited? What? Who would have thought? Yes, shader limited, but not because it is overworked and running out of resources. It's because a large portion of those shaders are sat doing nothing due to un-GCN friendly code in video games.
I'm not going to say too much on the actual architecture of the CU for RDNA, as I am still learning a lot about how this all works but the SIMD is now using dual vector units per single CU each with 32 ALU (SIMD-32) to make the 64 slots. This does actually trade a bit of instruction granularity I think (you can only work with a minimum of 32 numbers at a time vs 16 in GCN) but apparently AMD believes this approach will lead to better utilisation in 3D graphics. That's good for gamers!
So basically GCN was having Shader under-utilisation issues, but for a different reason that I largely originally believed. Don't get me wrong, I still think geometry plays a big role in this too, and Navi does have some huge gains in geometry (working Primitive Shaders, by the way. These can take 8 triangles in and spit out 4, culling invisible triangles super quickly, but is still limited by quad-raster oddly).
I think, in a way, RDNA was built to be more similar to NVIDIA's approach to shading, and may even benefit from NVIDIA friendly code. Hopefully Navi based GPUs will show less huge performance variance in games because of this.
As always, thanks very much for reading my babble. I hope you like it, oh and I am thinking about adding a way for people to contact me but not sure. I want to talk to new people who are in the know about these topics. So I can expand my own understanding. ^-^
But am super shy so probably won't :c