(Tech Stuff) Zen1+ (2950X) vs Zen2 (3900XT) IPC and L3 Hit Rate in OpenPandemics C-19 (WCG)
I recently set these machines to an entirely OPN diet (currently C-19) on World Community Grid, in order to collect some statistics regarding their performance with the handy Performance metrics tool written by a friend I talk to, that reads the Zen processor's internal performance counters and displays them in a handy interface.
Before I continue, I will first state that there are some significant differences between these two systems, and as such, this is absolutely not an 'Apples to Apples' comparison between Zen1+ and Zen2 cores in identical scenarios. I will state this differences now:
Ryzen Threadripper 2950X 16-core has a NUMA memory sub system that uses 2x dual channel memory controllers to access memory via 4 total channels. During this test, the processor is configured to expose itself as UMA, so memory accesses are balanced across both dies and all 4 memory controllers, resulting in higher average latency to each channel. Much higher. Over 100ns vs 70-80ns on the 3900XT. On the flipside, due to the quad-channels; it has higher sustained memory bandwidth for each thread in use.
Furthermore, the 2950X's processor cores are configured in a 4-core to 8 MB L3 cache topology, versus 3-core to 16MB on the 3900XT. While the 2X L3 cache is a result of the Zen2 architecture, the 3900XT has a single core per CCX disabled, but retains the full L3 cache; whereas the 2950X has all four cores enabled. This effectively means that the 3900XT has more than twice the L3 cache per core compared to the 2950X, more than just the architectural differences. An accurate comparison would have been the 2920X, (3-core 8MB) but I do not have one.
With that out of the way, it's impressive to see the 3900XT achieving nearly +44% more IPC than the 2950X, per thread; with an impressive L3 hit rate of nearly 90% versus the 2950X's still reasonable ~75%. Zen2's doubled FPU pies (256b from 128), doubled micro-op cache (4K entry from 2K) along with the additional L3 cache are definitely helping extract more parallelism from the core.