Updated: Aug 15, 2020
UPDATE 15-08-2020: Please read my Knowledge Update on RDNA's Primitive Shaders!
So you know AMD has done a pretty huge rework on the Compute Unit with the new"RDNA" Architecture that powers the "Navi 10" chip in the new RX 5700 series that are going to launch soon. Well, this post is about those changes in a way.
You know I did a lot of posts about geometry performance with GCN? Well I did you can check my main one out here. Then.... AMD goes and completely re-designs the Compute Unit's SIMD (That's Single Instruction Multiple Data, and it basically means you take a group of numbers and execute one operation on all of them at once. The instruction is the same but you get a set of different results) structure, and that change was pretty big and honestly for me: unexpected. :o
I admit I don't really know as much about the really nitty-gritty underlying GPU architecture numbery-crunchery bit that actually goes on within those Compute Units, but I learn more every day! Please don't hate me. And basically now I feel I have a better understanding of some of the 'issues' with Graphics Core Next's bottlenecks.
GCN and bottlenecking because of Geometry
So I originally believed a lot of GCN's issues were due to somewhat limited geometry performance relative to its absolutely enormous compute throughput. Well, I believe this is still an area of concern for GCN but after the RDNA announcement my idea of the bottlenecks in GCN changed. But first a bit about the geometry aspect of GCN...
Going back to early January 2012, when the first GCN-based GPU launched; Tahiti. This card powers the HD 7900 and R9 280 series of graphics cards. Anyway, this GPU has a monolithic (compared to NVIDIA architectures, which use a more "distributed" way of geometry, where the "PolyMorph" engines are attached to each SM) Geometry Front end with a pair of Raster Engines and geometry engines. Each one of these geometry engines handles things like Tessellation and actually setting up the geometry data. The Raster Engines actually spit out the primitives so you get 2 Triangles per clock on Tahiti.
This is Tahiti's official block-diagram which shows the very simplified top-level distribution of all its resources. You can see the two distinct "Shader Engines" each with 16 CU (AMD does not call them Shader Engines on Tahiti as far as I know, by the way, but I think they are). And right at the top you can see the monolithic front end with the dual geometry and raster engines. I don't know about you, but this was always looking heavily lopsided towards Compute throughput than geometry. And it sort of is.
With Hawaii and all chips going forward from this, AMD had increased the Geometry front-end by 2X, they doubled it up and also it became a quad-raster design.
Here is the block diagram for Tonga. I chose this chip to show as it has a very similar resource compliment to Tahiti... So you can see they have the same number of Compute Units but Tonga distributes those resources between four shader engines each with its own Raster and Geometry engine. I measured some pretty huge gains in tessellation and what I believe to be triangle-limited workloads on this chip over Tahiti. Obviously, the doubled throughput did help.
So AMD never actually upgraded the GPU design like this again, not with GCN anyway. Even the Mighty Vega 20 GPU powering the single most powerful Radeon gaming graphics card you can buy today: Radeon VII, still uses the Quad-design.
The overall GPU layout is very similar but this time you get twice the Compute Units per engine. AMD made some fairly significant improvements to the actual Geometry engines themselves, though. And with Polaris and Vega; they have a hardware level Primitive Discard engine, which I also measured the performance gain from if you look at the "GCN Geometry Performance" post I made.
But overall, the gains weren't massive and this has always been an area I thought was primarily limiting GCN, and primarily Vega 10 and Vega 20 GPUs, which have buckets of Compute but relatively lighter geometry throughput.
GCN shaders are under-utilised in graphics workloads
I think we knew this already. But I was more interested in: Why? Why does Vega especially not seem to be able to pump out significantly more FPS in video games versus the comparatively leaner GP104 chip in GTX 1080, for example? (That is 4096 Stream Processors compared to 2560 CUDA cores, they do largely the same job). Well, I still think it's somewhat limited by geometry in some frames but that is not the primary cause I think.
If you look at what AMD has stated and changed with the new RDNA architecture for Navi, I get a sense of why this is. I also did some reading of tech sites and I updated my understanding since that information was released.
You know, before I get into the technicalities of these Compute Units, lets throw-back to when GCN replaced the VLIW-based TeraScale back in very late 2011...
GCN: A Complex Compute Powerhouse
TeraScale based cards (HD 6000 and prior) were pretty great for Graphics workloads but they had a huge Achilles heal: Compute. Well, this was primarily due to an inefficient way of handling dependency heavy code. I won't go too much into detail but my understand here is that VLIW had huge utilisation issues in compute tasks with dependency in them, that is, instructions relying on results from other instructions. And GCN was built around this, by extracting more parallelism.
A GCN Compute Unit contains four Vector units, each 16 ALU wide, call them "SIMD-16" units if you want. With GCN AMD schedules work into groups of 64 threads called "Wavefronts". A single GCN Compute Unit in my understanding, takes four separate Wavefronts and executes 25% of each one (16 numbers) at a time, on each of its four Vector SIMD-16 units. But that means each CU can only complete a single Wavefront in 4 clock cycles. But in highly parallel code with lots of dependency (compute) this is very effective. In Graphics, it's good too but requires some tricky optimising and code writing/scheduling to achieve peak utilisation.
Everything about GCN from TeraScale was about Compute. It was designed to be a computing, number-crunching powerhouse, and that's what it is. In fact, I often joked that GCN is a "Parallel Computing architecture with a graphics pipeline tacked on", lol. RDNA changes things by being optimised and design for the one thing us PC Gamers care about: Graphics.
Oh wow, I typed a lot already and it's 4AM and my mind is fuzzy and I want to play Warframe so I will get to the point.