It’s been long rumored that NVIDIA designs PhysX to run in a throttled state on a CPU in order to have more impressive results on its own GeForce hardware, and after seeing the research done by David Kanter of Real World Technologies, it’s going to be hard to continue calling it a rumor. By analyzing the finest details of a running PhysX-enabled application, David was able to better understand how it was doing its work on the CPU.
The fact of the matter is, NVIDIA optimizes its PhysX technology for its hardware, and only the most rapid anti-NVIDIA zealot would disagree with those actions. If I were to create software to pair up with my hardware, I’d sure be optimizing both to the best of my abilities as well. But that’s not the issue here. Rather, the problem is that NVIDIA doesn’t keep the playing field fair, by deliberately running less-efficient code on the CPU to get the same job done.
For those who aren’t familiar with “x87”, it’s a floating-point sub-set of x86, sometimes built straight into the CPU as a co-processor, and other times implemented as an instruction set. Unlike SSE, x87 isn’t relevant in today’s computing, and proof of that is in the fact that the first processor to use it was Intel’s 8087… released in 1980. As David found out, Intel began discouraging usage of x87 as soon as SSE came to market, because it’s far less efficient, just as the upcoming AVX instruction sets will help make SSE look just as inefficient.
Just what does x87 have to do with anything? Does “PhysX87” give you an idea? That’s right… according to David’s research, which enlisted the help of Intel’s VTune performance analyzer, we can see that when PhysX is run on a CPU, x87 instructions are heavily used, while SSE instructions are not. This of course results in a lack of efficiency, meaning much slower performance.
There’s no excuse NVIDIA could make for sticking to x87, but it’s obvious, so it doesn’t matter. If PhysX made use of SSE instructions, like it should, then the performance could be boosted up to 2 – 4x, depending on the processor. To add to the problem, PhysX on the CPU isn’t multi-threaded by default, but according to NVIDIA, it can be if the game developer wants to make it so. In an ideal situation, meaning that PhysX was compiled both to support SSE and improve multi-threading, we might just see PhysX running just as fast on a CPU as we do on the GPU. Unfortunately, without having access to PhysX’s source code, such an apples-to-apples test isn’t going to happen anytime soon.
In Cryostasis, there is only one process of significance, cryostasis.exe itself; all others constitute roughly 2% of instructions retired and 10% of the cycles. Strangely enough, Cryostasis uses a tremendous amount of x87 instructions; roughly 31% of the instructions retired are x87. There are plenty of x87 uops, but hardly any SSE floating point uops, roughly a 100:1 ratio. Perhaps at finer granularity, it will be clear exactly where these x87 instructions are coming from. Despite the x87 instructions, the IPC is a respectable 1.15.