Date: February 1, 2019
Author(s): Rob Williams
Performance regression issues in Windows on AMD’s top-end Ryzen Threadripper CPUs haven’t gone unnoticed by those who own them, and six months after launch, the issues remain. Fortunately, there’s a new tool making the rounds that can help smooth out those regressions. We’re taking an initial look.
AMD’s 32-core Ryzen Threadripper 2990WX is an impressive chip for many reasons, but as we quickly discovered at its launch, not all workloads are going to utilize it (or the 24-core 2970WX) efficiently. When those 32-cores are used properly, the gains can be huge, but when software can’t figure out where to put the next thread, bottlenecks will instantly rear their ugly head.
For general use, performance regression shouldn’t be noticeable until the CPU beings assigning tasks to threads that can’t access the memory directly, and instead travel through another CCX module to send or get what it needs. The problem is fairly easy to understand, but the fix is a bit more complicated.
Last fall, AMD’s Robert Hallock wrote about Dynamic Local Mode, a new feature of Ryzen Master software that moves the system’s most demanding threads to the cores that have local memory access. According to AMD, this can greatly improve gaming (and Euler) performance, as seen in this official performance chart:
We didn’t touch gaming for this article, because application performance strikes us as more important (since time to tackle everything is a definite issue). Nonetheless, not long after we posted a look at Windows performance with the 2990WX, we followed-up with a look at its Linux performance, and were surprised to see much better scalability on the whole.
Not too long after the 2990WX’s launch, the general consensus from industry people we talked to was that Microsoft was to blame for the performance regressions. To be more specific, the Windows kernel needs an update to better accommodate many-core chips, especially those that have a unique design like the top two Threadripper chips. Since the same kind of regression isn’t easily seen in Linux, there may be some truth to that.
No one at AMD we’ve talked to is willing to pin the blame on Microsoft, and we don’t blame them, since Microsoft is clearly an important partner. We’re under the assumption that we could see a proper fix deployed to Windows 10 within the year, which would likely mean the October update. There’s another major Win10 release in April, but it might take a mini-miracle to see the fix appear in that build.
Ahead of CES, Wendell from Level1Techs released a video tackling the regression issue with EPYC, heaping praise on a new tool called Coreprio. The timing of this video was nutty, and sleep was lost the night before an early-morning CES flight trying to quickly test things out. Alas, we couldn’t get very far, so more thorough testing had to wait until Sin City madness subsided. Here’s Wendell’s video:
While AMD’s Dynamic Local Mode can help with gaming, and perhaps other situations, Wendell focused on a new tool called Coreprio, developed by Bitsum. Coreprio can smartly assign thread affinities on and after application load, aiming to make sure that every important thread is going to be on a core with local memory access.
The “32” shown in the screenshot represents the fact that 32 cores will have affinities assigned to them, while the other half act as threads. Depending on your situation, you may need to fiddle with these options, but in our testing, default settings with the Enable Dynamic Local Mode checked was best.
IndigoBench was the main benchmark focus in Level1Tech‘s video, an interesting choice since we used that one for our Linux testing of the same chip. In this article, we test a whack of others, to see where potential improvements can be found. In addition to stock settings, the 2990WX was tested using AMD’s Dynamic Local Mode, as well as Bitsum’s Coreprio.
|AMD TR4 Test Platform|
|Processor||AMD Ryzen Threadripper 2990WX (3.0GHz, 32C/64T)|
|Motherboard||MSI MEG X399 Creation|
Tested with BIOS 1.20 (Nov 14, 2018)
|Memory||G.SKILL TridentZ (F4-3200C14-8GTZ) 8GB x 4|
Operates at DDR4-3200 14-14-14 (1.35V)
|Graphics||NVIDIA TITAN Xp (12GB; GeForce 398.82)|
|Storage||WD Blue 3D NAND 1TB (SATA 6Gbps)|
|Power Supply||Cooler Master Silent Pro Hybrid (1300W)|
|Chassis||Cooler Master MasterCase H500P Mesh|
|Cooling||Cooler Master Wraithripper Tower Cooler|
|Et cetera||Windows 10 Pro (Build 17763), Ubuntu 18.04 (4.20.5 kernel)|
Most performance articles are straight-forward, in that scaling is usually obvious between one product and another. Comparing a product against itself is a little harder, likely for obvious reasons. With this kind of testing, it’s really challenging to get results that instill a great deal of confidence. That implies that extra test runs are required, and believe us when we say there were too many test runs involved for this one.
The first batch of results below are going to exhibit almost no differences between the modes at all, which was largely to be expected.
In most cases above, Dynamic Local Mode and Coreprio both hurt performance more than they help. That said, the performance degradation caused by these solutions are barely noticeable in the real-world. Contrast that to the regression problem which is very noticeable, as we’ll see below:
In simple encodes of one format to another, the 2990WX performs extremely well. With a real video project, such as one of our YouTube videos, that’s not so much the case. With default settings, and also Dynamic Local Mode, absolutely no improvements can be seen. Enable Coreprio, however, and the YouTube project video suddenly encodes in 4m 42s, rather than 7m 50s.
As mentioned before, this kind of testing is very tedious, so these encodes were tested many times over on each setting to verify their accuracy. Without any help, Premiere Pro would hit 7m 50s (give or take 1 or 2 seconds) every single time, but as soon as Coreprio was enabled, that dropped significantly.
That said, Coreprio is unfortunately not perfect. Even with it in use, there were occasions when we’d still see that 7m 50s result, but it was rare. We’re not sure of the logic behind whether or not it will work, but we never found a perfect recipe that would ensure Coreprio would assign the proper threads the first time, every time.
Nonetheless, Coreprio also managed to help out with a particular KeyShot project:
The project in question here is a bathroom scene, which features lots of glass and mirrors. At stock, and even with DLM, the performance is poor in that project, while it’s just fine with another (simpler auto render). Meanwhile, Coreprio managed to almost cut the interior render time in half.
Again, this result was not bullet-proof. There were occasions when Coreprio would be loaded, and the first result be 2m 38s, while another time, it may have been 1m 20s, or thereabouts. Meanwhile, most often, the result will be properly accelerated, giving us a dramatic reduction in render time.
Because we had limited benchmarks that showed proper advantages when using DLM or Coreprio, we loaded up GeekBench to see if the myriad tests it runs would result in different scaling. Across stock, DLM, and Coreprio, there are few differences, although notably, Coreprio delivered the best result overall (averaged from five runs). We then loaded up Linux, and decided to see how things fared there. Welp:
In GeekBench, simply using Linux will double the multi-core score. Even the single-threaded performance is somehow improved. This is the kind of boost that seems very unrealistic, but after searching for other 2990WX results on GeekBench’s database, we saw results that lined up with ours. Digging into the results a bit deeper, differences between the OSes are clear:
Here are some specific results online if you want to dig deeper:
Does all of this mean that Linux is the best OS for a chip like the 2990WX? It’s really hard to believe otherwise. To base that off of GeekBench alone would be nonsense, but we have other testing experience to back up those opinions. Blender almost always performs better in Linux than in Windows, so the fact that a many-core chip works better in the penguin OS isn’t a huge surprise.
Across all of this testing, it’s clear that if you are not personally affected by a performance regression, then there’s no need to jump on either Dynamic Local Mode, or Coreprio. Based on our testing, and AMD’s own, it seems like Dynamic Local Mode is best-suited to improve gaming performance, but not necessarily gaming CPU performance. That is to say, frame rates may increase, but the actual CPU crunching might not change too much, based on our 3DMark physics results.
Fortunately, using either DLM or Coreprio won’t hurt your performance in other areas too much, but it’s important to note that it can in fact negatively impact them. On the flipside, if you bought a 2990WX (or 2970WX) and are running against a regression, you shouldn’t hesitate in giving the tool a test. Don’t like the result, or don’t need it active all of the time? All you need to do is simply stop the service from within the applet, and you’ll be back to normal.
If you give Coreprio or DLM a go, and have interesting experiences to share, please feel free to leave a comment. Again, we’re having a hard time believing that the Windows fix for this regression issue will hit Windows 10 soon, but we can hope something will speed the process up. Until then, know that Linux takes better advantage of the chip than Windows, and that Coreprio can work as a stop-gap until an official fix drops.
Copyright © 2005-2019 Techgage Networks Inc. - All Rights Reserved.