Date: March 15, 2021
Author(s): Rob Williams
AMD has just launched its third generation EPYC server processor series, also known as Milan. This EPYC update brings AMD’s Zen 3 architecture to the data center, with its improved efficiency, faster performance, and bolstered security. With some of the new chips in-hand, we’re going to explore how AMD’s latest chips handle our most demanding workloads.
It was four years ago this month when AMD announced its EPYC processor series, which marked the company’s triumphant return to the battle at the heart of the data center. Out-of-the-gate, AMD offered 32 core options with dual socket (2P) potential, giving customers up to 128 threads in one node. In all this time, Intel has been stuck with a 28-core top-end, although its upcoming Ice Lake is set to finally shake that up.
Since the first-gen Naples launch in 2017, AMD improved upon its Zen architecture with Rome to reduce latencies, improve performance, and deliver chips equipped with twice as many cores. Today’s launch of the third-gen EPYC (Milan) represents the first server chips built around AMD’s most advanced Zen 3 architecture.
When AMD launched its Rome parts in 2019, it didn’t have a 280W option from the get-go, but did deliver one later with the EPYC 7H12. The new third-gen line-up launches with not just one high-performance 280W part, but two. As with the previous generation, AMD has special “F” series SKUs that focus on not just reaching higher clock speeds, but also bundling more cache.
|AMD EPYC 7003 Processor Line-up|
|Cores||Clock (Peak)||Cache||Min cTDP||Max cTDP||1Ku Price|
|7763||64C (128T)||2.45 GHz (3.5)||256MB||225W||280W||$7,890|
|7713||64C (128T)||2.0 GHz (3.675)||256MB||225W||240W||$7,060|
|7713P||64C (128T)||2.0 GHz (3.675)||256MB||225W||240W||$5,010|
|7663||56C (112T)||2.0 GHz (3.5)||256MB||225W||240W||$6,366|
|7643||48C (96T)||2.3 GHz (3.6)||256MB||225W||240W||$4,995|
|7543||32C (64T)||2.8 GHz (3.7)||256MB||225W||240W||$3,761|
|7543P||32C (64T)||2.8 GHz (3.7)||256MB||225W||240W||$2,730|
|7513||32C (64T)||2.6 GHz (3.65)||128MB||165W||200W||$2,840|
|7453||28C (56T)||2.75 GHz (3.45)||64MB||225W||240W||$1,570|
|7443||24C (48T)||2.85 GHz (4.0)||128MB||165W||200W||$2,010|
|7443P||24C (48T)||2.85 GHz (4.0)||128MB||165W||200W||$1,337|
|7413||24C (48T)||2.65 GHz (3.6)||128MB||165W||200W||$1,825|
|7343||16C (32T)||3.2 GHz (3.9)||128MB||165W||200W||$1,565|
|7313||16C (32T)||3.0 GHz (3.7)||128MB||155W||180W||$1,083|
|7313P||16C (32T)||3.0 GHz (3.7)||128MB||155W||180W||$913|
|“F” Series Processors|
|75F3||32C (64T)||2.95 GHz (4.0)||256MB||225W||280W||$4,860|
|74F3||24C (48T)||3.2 GHz (4.0)||256MB||225W||240W||$2,900|
|73F3||16C (32T)||3.5 GHz (4.0)||256MB||225W||240W||$3,521|
|72F3||8C (16T)||3.7 GHz (4.1)||256MB||165W||200W||$2,468|
|All EPYC 7003 SKUs support:
x128 PCIe 4.0 Lanes
Up to 4TB DDR4-3200 (w/ 128GB DIMM)
AMD had more eight-core SKUs last-generation, but the one that remains here is quite interesting, even if it’s not obvious at first. All of the “F” series chips prioritize clock speed, but the eight-core 72F3 also manages to pack 256MB of cache under its hood, for 32MB per core. That doubles the best cache count of the eight-core options last-generation, so the 72F3 is quite an intriguing chip.
At the top-end of AMD’s line-up, the EYPC 7763 effectively succeeds the 7H12 as AMD’s most powerful offering, sharing a peak TDP of 280W. All of these EPYC chips have adjustable TDPs, with both 280W SKUs in this current-gen line-up able to be tuned down to 225W. While many of the TDPs are shared from last- to current-gen, AMD has improved clock speeds in all cases that we can see.
Another interesting SKU in this line-up is the first 56-core model to hit EPYC: 7663. With 112 threads, the 7663 matches the number of threads available in a top-spec 2P Intel server, but with just a single processor. You could say AMD has a bit of a stranglehold on the core count game right now.
A huge number of cores may be great for certain workloads, but AMD naturally fills out its EPYC lineup with a large assortment to suit many different needs.
As with the previous-gen EPYC processors, these latest models support up to 32x 128GB L/RDIMM for a total of 4TB of DDR4-3200 memory. The eight-channel memory controller returns, as well, delivering some seriously impressive bandwidth, as we’ll see later. Lastly, all third-gen EPYC SKUs can be used in dual or single-socket configurations, with P models being restricted to single-socket only.
Since the launch of its Zen architecture in 2017, AMD has made great strides towards improving its chips’ instructions-per-clock performance, with notable gains being seen from generation to generation. This “from the ground-up” third-gen Zen core proves to deliver the biggest improvement yet, with AMD now stomping on Intel’s previous IPC performance leadership.
As with the previous generation of EPYC chips, the third-gen Milan awards a 256MB cache to its beefier models, with up to 32MB being available to a single core (such as with the 72F3.) That’s made possible AMD’s move to implement eight cores (from four) into each CCX.
AMD says that Milan brings a 19% improvement to its IPC performance, and based on the performance we’ve seen previously from AMD’s desktop and workstation Zen 3 CPUs, we believe that to be a pretty accurate statement. Much has been polished with this new design, including faster integer throughput, improved TAGE branch prediction, and a notable reduction to memory latencies.
AMD made some big improvements to its branch predictor between Zen 1 and 2, and Zen 3 continues these enhancements. For starters, the L1 BTB has doubled in size to 1,024 entries, and thanks to a new “no bubble” mechanism, branch targets can be pulled every single cycle. Further, improvements have been made to reduce latency penalties from branch mispredictions.
On the execution side, both integer and floating-point have seen many improvements. Integer gains a dedicated branch and data store picker, while the re-order buffer gains a larger window for more instructions in flight. Floating-point improvements can be seen with doubled throughput for IMAC and ALU pipes, one less cycle is needed for FMAC (from five to four), and dispatch and issue has been bumped from 2- to 6-wide, compared to Zen 2.
The load/store unit on Zen 3 sees an increase to both load and store for additional memory operations per cycle – three load and two store. The TLB has also seen a big improvement to include four additional table walkers (six total) that will reduce latency for sparse and random memory requests.
All of these changes contribute to the +19% IPC performance boost, highlighting the fact that it can require a lot of work with many different pieces of an architecture to accomplish such a feat.
One of the most notable additions to EPYC’s Milan variant is the addition of Secure Nested Paging, or what AMD calls SEV-SNP. This improved memory integrity will help protect guest instances against untrusted hypervisors. The company has also added Shadow Stack to protect against ROP (return-oriented programming) malware.
AMD has introduced a new instruction called Invalidate Page Broadcast which avoids using inner-processor interrupts to do TLB shootdowns; the invalidate can be broadcast to the fabric instead, ultimately increasing efficiency. AVX2 encryption support has also been bolstered with VAES and VPCLMULQDQ extensions added in, with 256-bit data now supported.
Over the next three pages, we’re going to explore performance from a few of AMD’s new EPYC chips in a variety of tests, and compare them against the last-gen EPYC 7742 64-core, and Intel’s Xeon Platinum 8280 28-core. Let’s move on:
For our testing, we’re using the most up-to-date version of Ubuntu Server, with the kernel updated in all cases. On the third-gen platform, the latest 5.11 was used, although in our quick-ish testing, there were no performance differences to be seen with a move from 5.8 to 5.11. As the older 5.4 kernel ships with the latest Ubuntu Server, we felt compelled to update from that before starting.
Because we want accurate, repeatable, and the best performance possible, we always adjust the scaling governor to put the CPU profile into performance mode before any testing is kicked-off. For the Xeon Platinum platform, DDR4-2933 memory is opted for, as it’s the maximum supported for that generation (the newer third-gen series bumps that to DDR4-3200 to match EPYC), while the last-gen EPYC 7742 uses DDR4-3200 – albeit with less overall density than what we’re testing the latest EPYCs with.
All tests are run at least three times over, but in some cases, even more runs are needed for more stable results. The nature of mutli-thread tests is that they don’t always execute the same way each time, so there are occasions when a handful (or two handfuls) of runs are needed to bolster the confidence level. If we don’t include a test here that you were hoping to see, please let us know in the comments.
We’d like to give a shout out to our good friend Michael Larabel from Phoronix for helping us generate fresh numbers for the last-gen EPYC 7742, as well as Intel’s Xeon Platinum 8280. We didn’t have access to these platforms in-house, and Michael was kind enough to run our test suite on his systems so we could have more useful and relevant comparison data to share today.
|Processors||2x AMD EPYC 7763 (128C/256T)
2x AMD EPYC 7713 (128C/256T)
2x AMD EPYC 75F3 (64C/128T)
2x AMD EPYC 7742 (128C/256T)
2x Intel Xeon Platinum 8280 (56C/112T)
|Memory||AMD EPYC 7003: 16 x 32GB Micron 36ASF4G72PZ-3G2E2
AMD EPYC 7742: 16 x 8GB Hynix HMA81GR7CJR8N-XN
Intel Xeon: 12 x 32GB Hynix HMA84GR7CJR4N-WM
|Et cetera||AMD EPYC 7003: Ubuntu 20.04.2 (5.11 kernel)
AMD EPYC 7742: Ubuntu 20.04.2 (5.8 kernel)
Intel Xeon: Ubuntu 20.04.2 (5.8 kernel)
To kick off our look at the performance of AMD’s latest EPYC 7003 chips, we’ll start with a crowd favorite: compiling. It’s often thought that 3D rendering is one of the de facto best examples of how to maximize the full potential of a CPU, but compiling stands to benefit, as well. The more threads that are available to tackle individual jobs, the faster the overall compile finishes – even if the entire process doesn’t top-out a CPU’s usage like some other hardcore work could.
Prior to AMD releasing its first 32-core and higher parts, it wasn’t entirely obvious whether or not compiling would still scale with so many additional cores, but we can see here that it can, with defined gains seen moving up the stack. AMD’s new clock-focused EPYC 75F3 beats out Intel’s Xeon Platinum 8280 pretty easily, and actually outperforms the 256-thread configurations with the LLVM compile.
As with many other workload types, not all molecular dynamics solutions are built alike, but all of them seem to prefer having as many cores as possible to get the job done quicker. With both the NAMD and GROMACS test, the new top-end EPYC 7763 soars to the top, carving itself out a nice performance delta between itself and the last-gen 7742 and lower-power current-gen 7713.
Despite having only a handful of more cores, the dual 32-core 75F3 chips pull ahead a notable distance from Intel’s 28-core 8280s. In the LAMMPS test, the 7763 performed on par with the last-gen 7742, but made a more obvious performance impact with the 20K Atoms test.
Oddly enough, the last-gen 7742 inched ahead of the new 7763 in the OpenFOAM test, although due to the more sporadic nature of many-core tests, re-running that test again next time might level them out more. As far as the interconnect HPCG test goes, all of the new chips perform about the same – but far ahead of the last-gen part.
This marks the first time we’ve run the toyBrot test, and we’re pretty pleased with the level of scaling we’re seeing there. AMD’s top 7763 soars to the top, while the 7713 sits just behind it. In our real-world use, the 7713 dual-CPU platform uses about 200W less than the 7763, so to keep up like it does is really great to see. It’s also clear as ever with a test like this that Intel could use more cores if it wants to place near the top of most of our performance charts.
We’re going to be tackling both video encoding and rendering performance on this page, and while that sounds like only two things, the reality is that there are a countless number of workloads available in each one, with loads of opportunity to see interesting scaling.
On the encoding front, we’re taking a look at both WebP2 image encoding (yes – that’s the successor to WebP, and not a typo), while Intel’s SVT video encoder is used to crunch some VP9.
Encoding is a strange beast, especially for many-core chips. In order for so many cores to actually enjoy an improved encoding performance, the jobs need to be properly parallelized. This is a grand feat, and plays out a lot better with a solution like SVT than with consumer products.
Both of these video and image encoding graphs paint a similar picture. In these particular workloads, the faster clock speeds of the 75F3 32-core parts is enough to push themselves ahead of the dual 64-core parts. As usual, it always pays to understand your workload. While the 7713 sits ahead of the 7763 here, we chalk it up to normal benchmark variance. The SVT test is one where Intel’s last-gen 8280 managed to inch past the last-gen EPYC 7742.
Chips like the super-fast 75F3 may see themselves leapfrog the higher core count chips in certain workloads, but rendering certainly isn’t one of those. Rendering will take advantage of as much CPU horsepower as you can deliver, so while higher clocks may help in some cases, the additional cores are going to make an even more notable improvement.
In all cases here, the new top-end EPYC 7763 is pegged to the top of the charts, with it distancing itself a fair bit from the last-gen 7742. While Intel showed some strength in the SVT test to not place last, rendering proves too much. Cores matter, and the more you have, the better. In the event that two CPUs have the same number of cores, then chances are the faster clocks will help the new chips prevail.
In another example of where not all workloads are alike, Intel’s Xeon Platinum actually scaled both Embree projects differently than the AMD chips did, proving stronger in the Asian Dragon test over the Crown one. Either way, those chips are still beat out by AMD’s latest 32-core 75F3 in both instances.
The previous pages have been focused on specific scenarios, like science, rendering, and compiling, so to help wrap up this initial EPYC third-gen performance look, this page is going to revolve around some more general scenarios, like compression and memory bandwidth.
We also have some more unusual tests on this page, including the largely synthetic NAS Parallel tests, as well as CoreMark, and even John the Ripper. Again, not all workloads are built alike, so it’s good to be thorough and see just how dominant one chip can be.
Yet again, we see an example above of how two similar workloads can behave quite differently. With 7-Zip, the core count matters more than anything, so it’s no surprise that AMD’s chips are glued to the top part of the chart. Interestingly, this gen’s 32-core clock-focused 75F3 performed the same as last-gen’s 64-core 7742, but the newer 64-core options, with their improved design over Zen 2, help propel them far beyond the rest.
With the Zstd compression test, clocks apparently matter quite a bit, as the 32-core 75F3 once again places on top, with both of the current-gen 64-core parts performing about the same (the 7713 wins against 7763 here, but when the performance level is so similar, they will flip-flop with subsequent runs.)
Chess engine performance isn’t exactly an important workload for most people who seek out server processors, but they still act as a perfect example of branching workloads if you take full advantage of all of the cores a CPU gives you. In the past, we’ve only tested with Stockfish, but we’re glad to have added Crafty this go-around, because you can see that once again, the 32-core 75F3 could make more sense in some cases over an even bigger option. It pays to know your workload.
It’s also worth noting that Crafty is kinder to Intel’s 28-core chips than most of the other tests. It really does prove that not one chip can be great at absolutely everything.
Here’s a chart that AMD’s competition can’t dig too much. AMD’s eight-channel memory controller is powerful, delivering an immense amount of bandwidth at the top-end compared to that Xeon competition. There’s such a stark divide here, that you’d imagine Intel’s controller is only quad-channel – but it’s actually six-channel. You can also see the performance uplift from AMD’s own previous generation CPU.
We regret not being able to include Intel’s latest-gen Xeon in here, as the memory support has been bumped to 3200MHz, but that still wouldn’t be enough to bring it that much closer to EPYC. For those with seriously heavy memory bandwidth needs, it’s hard to ignore data like this.
Note that EPYC 7003 CPUs with less than 128MB of cache can use four-channel memory, while all of the SKUs can take advantage of either six- or eight-channel memory.
Not even all synthetic benchmarks are built alike. The NAS Parallel (from NASA) suite of tests are in fact synthetic in nature, but are meant to show where one CPU will excel over another. With the Embarrassingly Parallel tests, we see pretty much expected scaling. The new-gen 7713 outperforms the last-gen 7742 ever-so-slightly. In the LU.C test, the higher clocks of the 75F3 helps give it a boost over the 7713 and 7742.
Both CoreMark and John the Ripper perform according to our expectations, aside from the fact that the 7713 came delivered notably more performance than the last-gen 7742 in JTR. It’s clear over and over throughout these results that AMD’s Zen 3 architecture has benefited a number of workloads to an obvious degree, which is great to see.
As some of our system sensors were not working correctly when we tested the latest EPYCs on AMD’s Daytona platform, we’re only able to include the three new EPYC chips that we’ve tested here via different power testing methods. Our stress test involves a Blender project with an obscene number of iterations, so that any CPU can be taxed for a long time.
The 32-core and 64-core parts use about the same amount of power, which is to be expected, as AMD itself gave both the same top-end TDP. What’s interesting to us, however, is that the 64-core 7713 used significantly less power than the 7763, more so than what the printed TDP differences suggest. Really – 637W for 256 threads is quite impressive. Don’t fret, though: the server will still be just as loud as you’d expect it to be!
It’s been said by many the past few years, but the CPU market has become really interesting since the launch of AMD’s EPYC four years ago. Despite those first-gen parts being impressive in their own right, AMD has continued to iterate nicely from gen-to-gen. The new Milan-based chips prove undeniably impressive.
As with the previous generation of EPYC chips, AMD has launched Milan with many different models to cater to the varied nature of the data center market. For cache or clock-bound workloads, the F series parts are alluring. We love that there is an eight-core option bundled with 256MB of cache, and not to mention the highest clock speed.
Not only does Milan bring the new Zen 3 architecture to AMD’s EPYC, clock speeds have been improved all over the place. Every little improvement made here enhances many different types of performance, and not to mention reduces latencies. There are no core count increases this generation, but AMD still leads the core density pack, able to deliver more cores in a 1P solution over an Intel 2P, and likewise more cores in a 2P solution over an Intel 4P.
Of course, not everyone needs systems with 256 threads, so AMD offers a wide range of core counts across the entire line-up. Interestingly, there are no 12-core models this go-around, nor are there multiple eight-core options. We have seen the introduction of a 56-core model, however, which is actually pretty interesting. We can’t help but wonder if AMD was inspired to create that 7663 offering because it matches the total thread count of a top-end 2P Intel server.
Something we couldn’t help but notice right away is that the least-expensive EPYC option for this third-gen launch is $913, whereas the second-gen Rome chips started at $450. We could see AMD augment this latest line-up with less-expensive models later, but for now, it seems to be focusing on pushing its clock-targeted F series if you’re needing fewer than 16 cores. If you want low-end EPYC, you’ll need to opt for the last-gen Rome.
One thing worth mentioning is that this generation’s EPYC chip lineup is priced a smidgen higher than the previous generation, something that could have tariffs to thank, or the fact that AMD knows it can command a higher price for what it’s offering. It’d be hard to argue with the latter, since Intel still charges more for its top chips, despite their lower core count.
Those wanting to upgrade previous-generation EPYC platforms with the latest chips can do so with the help of firmware updates. Do note, however, that if you update your firmware for Milan support, you could possibly lose Rome support until you reflash the previous firmware. We can’t see this impacting too many, but it’s worth being aware of anyway.
The EPYC 7003 processor series becomes AMD’s second server offering to include PCIe 4.0 support, something Intel’s not going to match until its Ice Lake Xeons land later this year. With 128 lanes available to each CPU, EPYC is capable of delivering truly impressive bandwidth performance. On the topic of bandwidth, we hit 230GB/s with the memory in our Stream testing, but with tuned software, it could reach even greater heights.
With regards to security enhancements this generation, there are not too many changes to speak about, but that’s not really a bad thing considering AMD had a great security feature set with its Rome series. What is new with this gen is Secure Nested Paging, which can protect guest instances against untrusted hypervisors. AMD’s Secure Processor makes a return this generation, allowing vendors to enable robust hardware root-of-trust to protect sensitive data from untrusted firmware or software.
After poring over what it’s capable of, and having tested it extensively, we can say that the third-gen EPYC is really good. It’s great to see AMD’s Zen 3 architecture finally hit the enterprise, and bring better clock speeds and IPC along with it. Effectively, the fastest server processors just got faster, so there’s a lot to like about Milan.
We’re not entirely sure when Intel will be launching its next-gen Ice Lake-based Xeon Scalable processors, but with a new design, more cores, and PCIe 4.0 added in, we’re looking forward to seeing both that and this third-gen EPYC series duke it out later this year.
Copyright © 2005-2021 Techgage Networks Inc. - All Rights Reserved.