Updated: Aug 15, 2020
This is a pure Babble from my inner self. It might not even make sense at some points. Warning: contains typos. I'll fix them later. Maybe.
UPDATE 15-08-2020: Please read my Knowledge Update on RDNA's Primitive Shaders!
The little Guy
Nobody paid much attention to Little Navi. Well, at least not as much as they did to his bigger Brother, Middle-sized Navi (Navi 10) that first debuted in the RX 5700 XT. People are obsessed with the high performance, large-die size, performance-tier pushing parts that usually cost a fortune too. And that's okay, I love those too. As a tech enthusiast, it's awesome to see the boundries of computing get pushed forward with each new architecture.
However, we must remember that the Little Guys, I am talking about GPUs like Polaris (10/11) and Navi 14, and before those: the likes of Pitcairn, the Immortal GPU. It is the smaller, 'entry-level' processors that enable low-cost access to the latest technology, and it is them that enable everyone - regardless of budget - to access it.
Now, you may already be aware that I hold Polaris in a very high esteem, much like I do Ryzen, especially with how they have enabled people on low incomes to break down the barrier to computing and PC gaming. Indeed, I am a big proponent of high value technology products because I feel they are the most progressive for the vast majority of the population.
That brings me to Navi 14.
But Sash, RX 5500 XT was kinda 'meh' on launch and the value wasn't even that great. You're an idiot, why are you typing this crap?
I am fully aware of that. But the purpose of this post is to talk about a GPU that often doesn't get a lot of attention, and well, I have one as my main GPU and I just wanted to talk about it. So you can just deal with that. CHUMP! That was a joke. :3
After my Polaris 30-based RX 590 burped and I decided to retire him permanently, I found myself using his somewhat ill-fated (value position) replacement, RX 5500 XT; based on the tiny - and somewhat adorable - Navi 14 graphics processor.
Navi 14 is a bit different from the GPU that it 'replaces' in terms of performance level, and it actually is more of the successor for Polaris 10/20/30's little brother, Polaris 11/21 - of which featured in the RX 460 and 560 cards. I actually touched on the subject of GPU succession when some dumb people threw their toys out of the pram complaining that RDNA (Navi) is crap because RX 5700 XT wasn't 'That much faster' than Vega 64.
I mean, RX 5700 XT's Navi 10 processor is designed to replace Vega 64, not succeed it. The job of succeeding the relatively fat (~500mm2) Vega 10 will be down to the so-called 'Big Navi' that everyone is hyped about for the next few months. Anyway, on the subject of Navi 14, this little guy is actually most likely intended to suceed the Polaris 11 processor - also known as 'Baffin' which servs on the RX 460 graphics card.
So in effect, what we have seen with RDNA is a pretty huge jump in performance, so much so that AMD has been able to push an entire tier of GPU up a notch - much like Nvidia did with the GK104-based GTX 680 in 2012. Navi 14; RX 5500 XT; for all intents and purposes, is an RX 660 at the same performance level as an RX 580.
Technical details of Little Navi
We really have to dive into the details of this little chip and compare it to its predecessors to get an understanding of the sort of market this little processor was built for. The post I mentioned above on GPU succession has a nice little layout of specifications that you can read, it is relevant to this subject.
Anyway, Navi 14 is a very small processor that has design choices to make the chip cheaper to make and improve yields, resulting in maximisation for margins in a market that already has very low margins (entry-level). Despite these constraints, Navi 14 achieves a full performance tier gain over GCN, being able to provide RX 580-like performance with fewer stream processors (though interestingly; more transistors - increasing clock speeds and upgrading internal caches eats into the transistor budget, along with new features such as video engine and my belief that an RDNA WGP [2x CU] has signficiantly more transistors than two GCN CU, additional Scalar unit, bigger caches, wow this bracket sentence is huge. It's 5.7b transistors for Polaris 10 and 6.4 for Navi 14, by the way, Navi 14 has 200m more transistos than even Hawaii [R9 290X]), half the memory interface width (GDDR6 memory helps), half the PCI-E Express lanes (4.0!) and only two primitive output units.
Bus Width to Graphics Memory
The memory interface on this GPU is only 128-bits wide. That is to say, it only needs four GDDR chips to occupy the interface to its fullest implementation - as GDDR memory chips have up to 32-bit wide access granularity, or 16-bit when a card is configured in 'clamshell mode'. The very fact that the bus is so narrow is a big indicator that this GPU was built to be cost effective - less complex PCBs due to fewer traces and fewer memory chips. A smaller bus also uses less power, and the physical connections (PHYs) on the chip occupy less space; about half as much space, as you might have thought, as the 256-bit connection on Navi 10, but maybe a bit more than half than Polaris 10 assuming we normalise for the process density: That is because I believe GDDR6 PHYs are slightly larger on chip than GDDR5 ones.
This contrasts the RX 580's 256-bit interface, and is likely helping to offset the added cost of using GDDR6 instead of GDDR5. Obviously, I have to point out that the data-rate on Navi 14's standard GDDR6 rating - 14 Gbps - is significantly greater than the standard of 8 Gbps on the GDDR5 for Polaris 10. Almost twice the signals per second means that despite the bus being 50% the width; the effective data bandwidth is almost the same - netting all those space savings in the process.
I said almost; Navi 14 with 14 Gbps GDDR6 along 128-bit produces a theoretical peak raw memory bandwidth of 224 GB/s. Polaris 10, with its 256-bit interface and 8 Gbps GDDR5 produces 256 GB/s in raw bandwidth; 32 GB/s more than Little Navi.
However. This would bring me to the little extra section tacked on under the Graphics memory Section.
Navi 14 has new tricks to improve bandwidth efficiency and I wish I had more Cache.
Since GCN3 (Tonga/Fiji) AMD has followed Nvidia and implemented a lossless compression technology on their GPUs; that essentially tries to minimise the amount of raw colour data sent to memory by grouping bits of data that are similar up together - essentialy compressing them and reducing the overall bits of data sent to memory. This allows additional traffic to occupy that saved space - inreasing effective bandwidth.
This technology is in its 2nd generation on Polaris; it will almost certainly have been upgraded to a 3rd generation on RDNA1 (Navi 10 & 14) GPUs. Since this compression technology is baked into the silicon logic, upgrades cannot be back-ported to older chips; newer ones with feature the improvements and upgrades.
Polaris also did something very special for bandwidth efficiency: More L2 Cache. A big trend in more modern processor designs is increasing cache sizes and performance; the more data you can keep on-chip, the better performance is going to be; and the lower power consumption will be, because going off-die to an external memory uses a pretty significant amount more energy than accessing an internal cache.
Polaris 10 increased the L2 cache size for the 'mid-tier' GPU from 768KiB on Tonga to 2MiB: this allows more memory accessess to remain on-chip and thus frees up more bandwidth for use by heavier memory operations such as often performed by the Render Back-Ends (ROPs). Now, Navi 14 also has an 2 MiB L2 cache - but before you bite my head off and shout that this isn't an improvement, remember that Navi 14 replaces Polaris 11 (RX 460) Which had just 1MiB of L2 cache - that is a 2X increase for this GPU 'class' (Little ones).
But it's not just L2 that got upgraded on Navi-based GPUs. Little Navi, just like Middle-sized Navi, feature an all-new revamped 'L1" cache system between the L0 within the Compute Units which sits above the register files, and the L2 cache, essentially buffering even smaller requests from accessing the L2 cache at all. This alone allows Navi 14's L2 cache to be used for heavier transactions in memory more often than Polaris's and means efficiency of bandwidth increases even more.
Also, Navi 14's L2 cache will have likely seen some old-fashioned improvements to bandwidth internally, so that helps a lot too.
Okay, so with this memory bit out of the way, you can see that Navi 14 is truly equipped to replace the previous 'middle-tier' GPU (Polaris 10) in performance but with 'Little-tier' hardware specs - that is how improved RDNA is, and we have not seen that very often with a GCN GPU. (Polaris 11 wasn't able to beat Pitcairn in R9 270X viably, and since we never got a GCN5-based 'middle tier' GPU [or at least, it was never released to consumers] we had to deal with GCN4 for this class all the way up until Navi 10 pushed the middle tier up to GCN5's top tier, and the little tier to GCN4's middle tier. If you follow me? :D)
RX 5500XT only has 8 PCI-E lanes because AMD is greedy and they want you to buy a Gen4 motherboard so it's actually worth it!!!
Untangle the jimmies there, this is a joke, aimed mainly at the chumps who actually claimed this. It's true that Navi 14 - and the card it's based on - RX 5500 XT - only has 8 PCI-E lanes, and it's also true that they are rated for Gen4.0 spec like all Navi-based GPUs so far, but the reason for there being half the normal expected amount is really nothing to do with greed or milking like those sensationalists like to believe.
The reason is because PCI-E lane PHYs take up actual die space on the GPU, and they require more complex PCBs because of additional tracing (we already talked about this about the memory controllers), and they also use more power. The best way to look at this objectively, is to understand that Navi 14's immediate predecessor in terms of GPU 'lineage'; Polaris 11 (Baffin) in RX 460 (and 21 in RX 560); also had only 8 PCI-E lanes. These were of course Gen3 rated, but the idea is the same. Navi 14 is a Baby Chip built for cost-sensitive markets where simplicity of design, small sized chips and ease of deployment are key to make them profitable. And let's face it, even at gen3 speeds; 8X Lanes is not going to bottleneck the class of performance on RX 5500 XT in any meaningful way...
... No I'm not even going to recognise the 4GB model of the RX 5500 XT, and its VRAM-related woes; up to and including stuttering - where the 8X interface actually makes a tangible difference. But I just said that it didn't? Well, I use the 8GB card as a reference point because it disgusts me that this tier even has 4GB models available, did I also mention that I vomitted in my mouth when AMD announced the RX 5600XT with 6GB? Well, I did.
Anyway, our Little Navi has only 8 PCI-E lanes because it means the chip can be smaller, and you know why that matters? You know why processors being smaller matters at all? Because AMD pays GlobalFoundries/TSMC a fixed amount per wafer (I believe this is regardless of defective dies) and the smaller the chip is; the more of them per wafer you can make. That essentially translates to the cost of fabricating each chip is directly linked to its physical size. That's why this matters. For some people to think it was a deliberate decision to justify X570/B550 motherboards are just plain stupid and I won't comment further on that subject. :D
Okay, I think I am starting to lose focus now. I have been here for a long time and typing this crap purely from my Brain. But another subject of the Navi 14 GPU that is interesting with how it gets more from less (hardware, that is), is that Navi 14's geometry front-end only has two primitive units. That is, the GPU (as far as my understanding goes) can only draw (as in, rasterise, you're turning that 3-point shape into something that can be turned into a colour value on a grid; your monitor is just a grid of little coloured boxes after all), uh, where was I?
Oh yeah, Navi 14 can only draw two triangles per clock for primitive output. That is exactly half of what Polaris 10 can do; each of Polaris 10's 4 Shader Engines contains a Rasteriser that can handle a single primitive per clock cycle. So, what gives?
In another awesome twist of increased efficiency: Navi's geometry engine is a bit of a beast. Each primitive unit can actually accept two triangles into the geometry pipeline; using fast hardware-level methods it can essentially discard half of those if necessary; drawing the resultant (hopefully visible) triangle for rendering. In essence, Navi 14 can take the same amount of triangles into the pipeline as Polaris 10, 4 per clock, but Navi 14 can only draw 2 of them. To put this into perspective, in geoemtry heavy scenes, Navi 14 can likely match the geometry performance of Polaris 10; despite 50% the draw-rate, because the GPU is still accepting 4 in; and is culling/discarding invisible geometry so quickly that it's not bottlenecked by the 2 triangle output.
Of course, an ideal situation for Navi 14 then, is in a scene with complex geometry (tessellation) and lots of hidden meshes. Meshes that the game might not occlude properly, or can't; perhaps a person places their hand on their face covering their nose - to the game this might still be "in the scene" and thus not occluded in software: But the GPU hasn't drawn the person's nose because it isn't visible; the hand is in the way in the final, 2D image. it can do that extremely quickly in hardware using primitive shaders which allow the GPU to fine-control and analyse geometry for these hidden (degenerate) triangles.
Polaris 10 actually has something similar; the Primitive Discard Acellerator. This was added to address an issue that had been plagueing GCN-based GPUs for a long time, especially in GCN1 Tahiti-era (a GPU with only 2 raster engines entirely); drawing hidden geometry. I also realised just now; I use semi-colons quite a lot; but you'll have to just go along with that; sorry. :D
Anyway, If you remember a game called Crysis 2, you might remember a certain under-ground ocean of tessellated water. It wasn't visible - the game engine wasn't oclcuding it, and it reduced performance on Radeon GPUs because they were spending precious frame-time drawing waves you couldn't even see. The PDA is a hardware (ASIC likely) level block that checks triangles for visible parts, but the Primitive Shader system on RDNA is vastly more efficient, and fully integrated into the geometry pipeline from the beginning at the shader level.
Of course, in scenes with lots of complex, visible geometry, Navi 14 is going to be at a disadvantage in hardware; as the chip can only technically draw 2 primitive (triangles) per clock... But that's the thing; Navi 14 runs at much, much higher clock speeds.
Navi Shaders :O
Focus... really losing it. Uh, continuing this trend of "Little Navi does more with less", a really important aspect is the shader processors, because they were entirely upgraded for bursty (hilariously: "Single threaded improvement") game-shader related workloads by addressing a utilisation issue that GCN had, due to a quirky 4-clock instruction issue latency. (GCN takes being paralell to the extreme, pure throughput at the expense of latency). Long story short, GCN shader cores are often underutilised in games because not all of the SIMDs can be filled up (pipelined) effectively, because games often use small, bursty, sporadic shader code, some of which might need to complete in less than 4 clock cycles. You can read an in-depth Babble about the Shaders issue with Vega and how RDNA addressed that in my 10th Tech Babble; Here.
Because my Fingers hurt and I don't know what I'm doing anymore, I'm also hungry, I am going to stop now. If you got this far without skipping to the end, thanks, this is for you (and only you):
Love you <3