Updated: Mar 16
Okay so I just want to make an addition to my tech babble which itself was an "addition" to my assessment on GCN bottlenecks. I am just updating it as I learned new things and wanted to add to my original post. :D
Wavefront goes from 64 to 32 threads with RDNA
AMD call batches of threads (numbers to be worked on) into groups called Wavefronts. NVIDIA terms these Warps. But anyway, with GCN a batch of numbers to be operated on is issues in groups of 64: that is able to be executed by a single GCN CU in four clock cycles. I explain why in just a moment.
Here is a GCN Compute Unit: The part of the GPU which operates on the numbers in graphics code (it also filters textures and such). The overall structure of this Compute Unit hasn't changed almost at all since GCN1, Tahiti in 2012 as far as I know. Newer revisions gain new functions (ability to prefetch shader instructions with GCN4 and 5) and new instructions, but the SIMD layout remained the same.
That is, the CU has four Vector units with 16 ALU slots and each one of these vector units operates on a Wavefront in the following way:
This diagram displays how GCN handles SIMD compared to Very Long Instruction Word 4 (VLIW4) based TeraScale. I won't go into too much detail but the layout for Terscale has issues with dependency in Compute Tasks: the GCN design gets around this by allowing the CU to work with four seperate wavefronts at the same time. These wavefronts will be largely independant from each other and can achieve higher utilisation in code with dependency compared to the VLIW approach which fills up all its ALU slots with operations from just one wavefront. If instructions are dependant in this wavefront: ALU slots are going to be unused as they are resolved.
So you can see GCN getting around this by extracting high parellism. Each of those Vector Units can execute 25% of one wavefront in one clock cycle. This means that a GCN CU can theoretically complete four full wave fronts in four clock cycles: but if only one is needed it still takes those four clock cycles.
RDNA and better for gaming shader code?
With RDNA and "Navi" AMD have changed the way the threads are batched, by placing them into groups of 32 numbers instead of 64. The Navi RDNA Compute Unit now consists of two SIMD-32 units instead of four SIMD-16.
I am still digesting how this works but this theoretically means a single Wavefront only takes one clock cycle to complete on RDNA. This improved single-threaded performance as stated in the above diagram.
Here look, you can see how Navi can execute the same number of threads but in less time compared to the way a GCN CU operates on those same threads.
Due to the way the SIMD units work I am assuming each Wavefront of 32 threads also consists of just one instruction (rather than 2 seperate instructions with 16 numbers each). But unless the SIMD is very different I think this is likely the case. A very good article explaining complex details of this was made over at Extremetech if you want to check it out and I did a crap job of explaining it T_T
Anyway this will allow Navi to scale to graphics workloads easier without complex pipelining and optimisations as I said in previous post about Shaders and GCN. And in situations where GCN simply cannot be pipelined effectively (games for example that run really bad on GCN) Navi should do much better. Hooray for video games.
Thanks for reading my little update babble. :o