Date: August 4, 2008
Author(s): Rob Williams
Intel today takes a portion of the veil off their upcoming Larrabee architecture, so we can better understand its implementation, how it differs from a typical GPU, why it benefits from taking the ‘many cores’ route, its performance scaling and of course, what else it has in store.
During a press briefing late last week, Intel discussed their upcoming Larrabee ‘visual computing’ architecture in much greater detail than ever. Some of what was discussed has been known for a while, but exactly how the architecture works and what it offers hasn’t been explored much in the past.
This briefing nicely prefaces the SIGGRAPH conference, which happens next week in Los Angeles, and also Intel’s own Developer Forum, occurring in two weeks in San Francisco. Although much was unveiled during last week’s briefing, even more may be covered at IDF.
As a quick refresher, Larrabee is Intel’s architectural solution to visual computing. While a Larrabee-equipped card could be considered a ‘graphics card’, in that it can render your game’s 3D graphics, ‘visual computing’ is the term Intel makes sure hovers around the name. It’s no secret that GPUs are excellent performers in specific non-gaming scenarios, and as a result, both Intel and NVIDIA have been going forth with their own projects in order to push the GPU far beyond gaming.
From an architectural standpoint, Larrabee will not be able to be compared to current GPUs on the market, as the innards are vastly different. While ATI’s and NVIDIA’s solutions pit one or two large GPU cores on a graphics card, Larrabee will be going in a slightly different direction, offering a processor with many smaller cores underneath.
As previously leaked, but now confirmed, the cores inside Larrabee feature a pipeline derivative of the Pentium processor, using a short executing pipeline with a fully coherent cache structure. Larrabee of course will carry it’s own improvements, however, which include 64-bit extensions, multi-threading, a wide vector processing unit and also an advanced pre-fetcher.
How many of these cores Larabee will feature is currently unknown, although it is sure to vary from model to model, just like current desktop CPUs. In an image I’ll show later in this brief article, graphs provided show usage of 8 cores, all the way up to 48.
The question of why Intel decided to go this route is a common one, but the simplified answer is that it just makes sense to go in this direction. Prior to the advent of Dual-Core processors, the issue we were facing were processor cores that were topping-out at a certain frequency, and no improvements could immediately be made to increase it. Even if it could be increased further, the benefits seen would far underwhelm the amount of technical work required to go that route.
To aide in the matter, Dual-Core and eventually Quad-Core processors were released. We’ve been fortunate to have the best of both worlds with the Core architecture, because it proved faster overall compared to previous Netburst-based CPUs, plus we had the benefit of being able to fit more than one under the hood.
The reason for Larrabee’s direction is that it’s easier to scale a handful of cores than it is to create one mammoth core. Taking a look at NVIDIA’s latest GTX core will verify this.
Likewise to the progression of CPUs, GPUs began off with a similar life. They were first Fixed Function, and limited in how they could be executed, and then moved on to become partially programmable, and finishes up to be fully programmable, with the help of Larrabee. It should be noted that NVIDIA’s CUDA architecture is similar in some regards to Larrabee, however, but until we see raw performance data from Larrabee, it’s impossible to assume who has the better product.
During the briefing, Intel also unveiled Larrabee diagrams and lots of information on what helps to make the architecture an innovative one. Sadly, we are unable to post most of the information we were given, but I would expect that to change either next week at SIGGRAPH, or later this month during their Developer Forum.
In the block diagram below, we can see how Larrabee is structured, and I should mention that this is unlikely to be an accurate representation of scale of each component. The center of the processor is comprised of the L2 Cache, which is shared amongst all of the available cores, and will likely grow in density depending on the number of cores implemented.
These multi-threaded cores are found on top and bottom, and are connected to other cores and the memory controller via a 1024-bit wide ring bus (512-bit in each direction), and also handles the fixed function logic. Each one of the cores offers four threads of execution, and includes 32KB instruction cache and also 32KB data cache.
The L2 Cache in Larrabee is designed a little differently than how it’s implemented on a normal desktop CPU. Rather than being ‘banked’, the Cache is divided into sub-sections where each section is directly connected to a specific core. If one core is reading data not being written by the other cores, it’s stored in its local cache, which improves latency and also bandwidth.
Further information of the cores themselves were provided to us, but as mentioned earlier, we are unable to share the slides themselves. However, I can describe that the x86 in-order scalar core features both a Scalar Unit and Vector Unit attached to the Instruction Decode, and are also connected to respective registers before being passed off to the L1 and L2 caches.
The Vector Unit that each core employs is one of the key features of Larrabee, and acts a little different than other vector units out there. We won’t get into the nitty gritty, but thanks to features included within the ‘Vector complete instruction set’, Larrabee features high computational density. Thanks to the fact that each vector unit operates at 16-wide and applications can take advantage of all 16 vector lanes, along with all other improvements in place, efficiency sits at a comfortable 85%.
That’s as much technical bits I’ll touch on now, but I’ll shift focus briefly to why Larrabee is ultimately different than typical GPUs. As mentioned earlier, the architecture employs numerous cores which act together to deliver the overall speed, whereas a normal GPU sticks with a single core, which is rather large and elaborate in itself. Intel believes that bigger is not better, but rather more is better.
Because each core is actually based on the x86 architecture, development is made so much easier on coders looking to utilize the processor to its full potential. C and C++ are widely used in development of all types of software, including games, so to make the shift over to developing for IA shouldn’t be too difficult.
Key features of a Larrabee core include context switching and preemptive multi-tasking, virtual memory support in addition to page swapping, and also fully coherent caches at all levels of the cache hierarchy. Intel also boasts an efficient inter-block communication with the help of the 512-bit wide ring bus. In layman’s terms, it means that the communication between all the various components will be swift, resulting in low latencies and higher performance.
One of the most important features is that while a fixed function logic exists, it doesn’t get in the way, allowing excellent load balancing and general functionality. Intel touched on the fact that there is no such thing as a ‘typical workload’, and then showed off slides that proved the theory. Different games will act differently during gameplay. One might be heavier on rasterization, while another is heavier on the pixel shader. Because this logic was in mind during the development of Larrabee, the architecture can handle all workloads effectively.
Though we can’t publish slides, the thing to take away from Larrabee is that it’s ultra-scalable and opens up many pipelines that used to be fixed. Development should be made easier, and a Larrabee chip should be extremely efficient and fast. Just how fast, we don’t know. Given that IDF takes place in two weeks, it would be unlikely to not see hands-on demos there.
Not surprisingly, no raw performance data has been revealed yet, but Intel did give us a few graphs that show how well binned rendering compares to immediate mode per frame, and also how game performance can scale with Larrabee.
Using the technology called ‘binned rendering’, Intel’s benchmarking results show that it’s possible for a game to use much less memory bandwidth than when aspects of the game are rendered with traditional methods. As you can see, using the new binned mode, each game was able to use substantially less memory overall, sometimes half.
As far as scalability is concerned, Intel also showed in another graph just how much gaming performance could be enhanced as the number of cores is increased. These numbers aren’t the result of typical benchmarks, as their tests were specific in nature.
It’s interesting to note the stark increases, which are almost completely linear in nature. The largest variation was 10%, which is rather modest compared to some even higher variations when doing these kinds of tests with our desktop Quad-Cores. Intel continued by noting that while the typical GPU has to ‘assume’ what typical workloads will be like in the future, Larrabee is completely scalable and ready to tackle whatever workload comes at it.
What’s that mean to the consumer? Instead of seeing certain games excel on certain GPUs, all games should scale similarly on Larrabee, regardless of the game engine. The three games shown above all use different engines, yet all scaled on a linear path. If only most desktop applications took such great advantage of our desktop CPUs…
There was a lot more discussed during this press briefing, but most of it is specific to game developers and has little relevance to the end consumer or enthusiast. During the briefing though, we were able to learn a lot more about the architecture and what benefits it holds, and overall, it’s all rather impressive. Still, until we see real-world examples of all its benefits, it’s of course going to be difficult to state much of an opinion.
The architecture itself is fascinating, because it differs substantially from the typical GPUs. Instead of a single massive GPU core on-board, Larrabee will feature numerous Pentium-derived cores that are redesigned to better handle applications of all sorts in the visual computing scheme of things. Just how effective this method of scaling will prove to be in real-world application is yet to be seen, and it’s sure going to be hard to wait.
Intel did provide performance scaling information, as seen in the chart above, and while it does give us a brief idea of the performance, it in no way gives us a real-world idea of what to expect. Eight cores might have delivered 25% of the performance of thirty-two cores, but that doesn’t mean much if the eight cores mustered 4FPS in a randomly selected title.
For those interested in even deeper specifics regarding how Larrabee can improve graphic rendering (transparency, handling irregular Z-buffer, etc), Intel will be giving an in-depth presentation at SIGGRAPH next week in Los Angeles, and I’m sure many game development websites will be publishing lots of information on that.
With Intel’s Developer Forum occurring in just two weeks, we can be sure that there will be some form of a Larrabee demo on display. I’m sure at that time we’ll learn even more about the new architecture and potentially see just how life-changing it could prove to be, when the processor is released in 2009/2010.
We will once again be at this years IDF, so stay tuned as we’ll be sure to pass along any of the cool happenings along to you as they take place.
Done reading? Have a comment you’d like to make? Please feel free to head on into our forums, where there is absolutely no need to register in order to reply to our content-related threads!
Copyright © 2005-2019 Techgage Networks Inc. - All Rights Reserved.