This is annotated to the best of my knowledge. I am not making any guarantee that it is 100% accurate, however I am reasonably confident it is accurate. Please do your own research, or add a disclaimer before citing my diagram. Thanks. :D
Processor Diagram - Navi 10
Navi 10 silicon as used in the RX 5700 XT, RX 5700 and RX 5600 XT in various configurations. It represents the first of the new RDNA graphics architecture and is significantly different from previous GCN chip layouts in silicon. Below is my attempt to annotate this big jump in GPU technology for AMD Radeon.
You can click on the image to zoom in and scroll on it. (die shot credit: Fritzchen Fritz)
40x Compute Unit (CU)
(RDNA 1.0) The processing core of the GPU, this is where the numbers are 'crunched'. Each Compute Unit on Navi 10 (RDNA1) consists of two vector SIMD units (that is, they perform Single Instruction Multiple Data operations), each one is 32 ALU slots wide. The total vector width of the CU is the same as previous-generation GCN at 64, but the SIMD structure is different (GCN has 4x SIMD-16). A big motive for moving to dula SIMD-32, is that "Navi 10" dispatches Wavefronts (groups of tasks) in batches of 32 now, and each one can be completed in one clock cycle on a CU. This allows a new instruction to be issued every clock, whereas due to the nature of GCN, it would take 4 clocks to complete GCN's Wavefront with 64 tasks. (Each GCN CU works on 4 Waves at the same time, but single waves of 64 still take 4 clocks to complete before a new instruction can be issued). This allows Navi 10 to execute game shader code more quickly and achieve higher utilisation with, theoretically, fewer transistors to a GCN part.
20x Work Group Processor
The Navi 10 silicon's Compute Units are arranged into groups of two, which are represent the smallest granular processing elements. Each WGP thus contains 128 Stream processors, and can execute two 32-wide Wavefronts in a single clock. GCN-legacy 64-wide Wavefronts are supported, but incur a 2-cycle instruction latency. (This is still 2x better than GCN on Wave64).
2x Shader Engine
On the above diagram, each set of 10 Work Group Processors, their associated front-end (Prim/raster) are split into two distinct partitions on each side of the chip. These are Shader Engines. On Navi 10 (RDNA1), each SE can now handle two primitives per clock, compared to only one on GCN designs. However, the full Navi 10 silicon only contains two SE, meaning the raw triangle output remains untouched at 4 Tri/clock. Advanced primitive shading techniques allow fast triangle culling in hardware, however, and as such the GPU can accept up to 8 triangles into the pipeline per clock.
8 x 32-bit (256-bit) GDDR6 PHY
This is a physical connection to the traces on the PCB around the GPU chip that connect with its external memory packages. Each block here connects to a single GDDR6 memory chip via 32-bit interface.
2x Memory Controllers
These logic circuits are responsible for handling memory accesses to and from the external GDDR6 memory chips. They connect directly to the GDDR6 PHY blocks and the Render Back-Ends, which are typically very bandwidth heavy in their operations.
16x Render Back-End (ROP)
The Render Back-End is one of the final stages in the graphics pipeline. There blocks are tasked with turning all the crunched 3D data into a final colour that represents a pixel to be displayed on the screen. The individual blocks highlighted here contains a RBE which can work on 4 Pixels per clock, so this totals the 64 Pixel/clock throughput of the Navi 10 graphics processor. The RBE layout is similar to previous-generation GCN designs.
2x(?) L2 Cache Partition
These large blocks of SRAM contain fast on-chip memory. The L2 cache is esentially a buffer between the ROP and the WGPs before going to memory access. Navi 10 has 4 MiB of on-chip L2 cache.
2x(?) L1 Cache Partition
Navi 10 has a large, dedicated L1 graphics cache attached to each Shader Engine (?). Further buffering quick memory accesses in latency sensitive workloads like gaming.
4x Raster/Primitive Unit
These blocks are responsible for taking 3D data and turning it into a Primitive that can be sent to the Render Back-end to be rasterised into Pixel data. Setting up geometry, tessellation, transformation of the viewport happens here. Each "Prim Unit" can output a single primitive per clock, giving Navi 10 the familar 4 tri/clock throughput of previous larger GCN GPUs. However, advanced geometry culling in hardware allows up to 8 triangles to be accepted, with up to 50% of degenerate (hidden) geometry being discarded early in the pipeline.
This central logic block contains the circuitry responsible for issuing work to all the functional units of the Graphics Processor. Hardware Schedulers responsible for Asynchronous Compute are also located here. On Navi 10, this block (or around here) also contains the primary Geometry processor which is linked to the 4 Primitive Units.
This large block on the left of the die diagram contains the chip's Input / Output blocks and functionality. Within this section are the Display PHY (wires to display outputs, DP, HDMI, etc), Display Controllers for those, the Video Engine and its codecs (fixed-function Encode/Decode in Hardware) and the PCI-E 4.0 PHY and controller. Navi 10 chip has a 16 Lane PCI-E 4.0 interface to communicate with the system. There appears to be a large amount of SRAM in this block, which could be buffers/caches for the I/O, or potentially another slice of L2 (Unsure ATM).