Imagine an alpinist climbing within 100 feet of a summit and being told he could go no farther without a special pass. Or a chef being informed that the transcendent ingredient in her signature dish could not be obtained because of a punitive trade agreement. Or Steph Curry being told by a coach that he mustn’t shoot beyond 23 feet from the basket.
Exaggerated perhaps, but it gives an idea about how developers of multi-threaded applications must feel when coming across Microsoft Windows restrictions that don’t allow significant performance scaling beyond 64 logical processors.
Fortunately, about two years ago Microsoft developed a concept called processor groups that enables developers of multi-threaded applications to transcend the previous restrictions. The functionality is available in 64-bit versions of Windows 7, Windows Server 2008 R2 and later, and Windows 10.
Beyond 64 Threads
So, why the 64-thread limit to begin with, and why did it take until just recently to implement scaling for more than 64 logical processors?
“Long ago, the thought was ‘surely no system would have more than 64 threads’, because back then each system had very few cores,” says Alex Shows, chair of the SPEC Workstation Performance Characterization (SPECwpc) subcommittee, developer of the SPECworkstation benchmark. “As the size of transistors shrank with each generation of hardware, it became possible to pack more and more processors into a single system.”
Microsoft defines processor groups as “a static set of up to 64 logical processors that is treated as a single scheduling entity.” When processor groups are implemented within multi-threaded code, systems with more than 64 logical processors will see better scaling due to improved partitioning and affinitization. Processor affinity enables the binding and unbinding of a process or a thread to a CPU or a range of CPUs, so that the process or thread will execute only on the designated CPU or CPUs.
Anshu Arya, vice chair and director of compute research for the SPECwpc subcommittee, explains how processor groups work.
“For any system with more than 64 logical threads, Windows will evenly divide the threads into processor groups such that no group has more than 64 threads. On a dual-socket system with two 28-core CPUs and 112 total threads, for example, Windows will create two processor groups, each with 56 threads. On a single socket system with 64 cores and 128 threads, two processor groups will be created, each with 64 threads.”
Demanding professional applications such as the open-source 3D content generator Blender were among the first to implement processor groups.
Designed For Demanding Applications
Arya says that the early implementers of processor groups have been developers of professional applications whose users are most affected by the 64-thread limit.
“Some of the most demanding professional applications, such as Blender, already support processor groups because some of their users demanded performance on high-end, dual-socket workstations. Applications using MPI or other multi-process APIs naturally circumvent processor group limits, while others are catching up quickly, especially if the application is already multi-threaded.”
Migrating to processor groups is relatively simple for those applications that have an existing multi-threaded library, according to Shows.
“For some applications it’s as easy as updating to the latest version of their multi-threading library. The library then handles support for processor groups in a way that requires no change to the application. Other applications, particularly those that implement their own multi-threading layers, need to include processor groups in their code that interrogates the system to identify logical processor counts. They also have to update the partitioning code that divides work among available resources.”
Processor Groups In SPECworkstation
While the vast majority of workstations currently have fewer than 64 logical processors, ISVs are trying to be proactive in helping users take advantage of systems with very high core counts as they become more commonplace on users’ desktops.
In anticipation of this movement, the SPECwpc subcommittee has implemented processor group awareness within most of the CPU workloads in version 3.0.4 of its industry-standard SPECworkstation benchmark. The update is available in two forms: as a minor-version patch for those already using SPECworkstation 3.0.3, and as a full benchmark installation for new users or those using a version before 3.0.3. Whether users download the full V3.0.4 benchmark or just the patch, they will retain comparability between 3.0.4 and all previous 3.0 versions.
SPECworkstation 3 testing results on the SPEC website will include the minor version information in order to disclose up front what version was used to obtain a published result.
Tests Verify Better Scaling
CPU tests on workloads using processor groups in SPECworkstation 3.0.4 show a wide range of scaling improvements compared to results from previous versions. Results for this article are based on a 64GB DDR4-3200 quad-channel system with a 64-core, 128-thread CPU running on a Windows 10 Pro 1909 operating system. Scoring is derived from run times for each test that are then composited based on a reference machine, yielding a “bigger is better” result.
Under the testing used for this article, SPECworkstation 3.0.4 workloads containing algorithms with nearly perfect CPU scaling typically double in performance when running with processor groups. Other CPU workloads show gains ranging from less than one percent for lightly threaded workloads to more than 90 percent for heavily threaded workloads. Overall average gains are more than 64 percent for the 14 workloads in which processor group support was implemented.
The chart accompanying this article shows gains for three different methods of implementing processor groups in SPECworkstation 3.0.4:
- Convolution – a workload that simulates a mathematical method of combining two signals to form a third signal – was modified from OpenMP to the Microsoft Parallel Patterns Library to support processor groups. Performance increased from 9.79 to 17.67, a more than 80 percent gain.
- Handbrake – an open-source media encoder – was modified to launch additional processes (one per processor group) when needed. Performance increased from 4.70 to 8.58, a more than 83 percent gain.
- Kirchoff – a wave equation used to calculate the back propagation of a seismic wave field for the oil and gas industry – was modified to correctly assign affinity across multiple processor groups. Performance increased from 8.32 to 14.16, a more than 70 percent gain.
Will 4 Million Be Enough?
While processor groups might seem to promise infinite scaling, there are limits to how many logical processors can be accommodated: 4,194,304 to be exact.
“Windows defines the data structure for processor groups as a processor number, and within that structure is a data value called a group,” says Shows. “A group is a word data type, which is defined as a 16-bit unsigned integer. This means that one could have a maximum of 65,536 processor groups containing 64 threads each.
“In all our wisdom, we’re saying ‘surely no system would have more than 4,194,304 processors!’ But eternity is a long time and if 4 million isn’t enough, we’ll need to modify the data type in the future to allow for more groups.”
Bob Cramblitt is communications director for SPEC. He writes frequently about performance issues and digital design, engineering and manufacturing technologies.
Want to stay informed about graphics and workstation performance? Visit the SPEC/GWPG website, subscribe to the SPEC/GWPG enewsletter, or join the Graphics and Workstation Benchmarking LinkedIn group: https://www.linkedin.com/groups/8534330.
A reminder: SPEC/GWPG benchmarks are available for free downloading to everyone except vendors of computers and related products and services that are not members of the SPEC Graphics and Workstation Performance Group (SPEC/GWPG).