Intel's new heads of silicon development, led by Raja Koduri, the Senior Vice President of Core and Visual Computing, and Jim Keller, the Senior Vice President of Silicon Engineering, hosted its Architecture Day here in Santa Clara to outline the company's broad new vision for the future. Dr. Murthy Renduchintala, Intel's chief engineering officer and group president of the Technology, Systems Architecture & Client Group (TSCG) also presented at the event, which was held in the home of Intel co-founder Robert Noyce.
Highlights included the unveiling of the company's new Sunny Cove CPU microarchitecture, its new Gen11 integrated graphics, 'Foveros' 3D chip stacking technology, a teaser of the company's new Xe line of discrete graphics cards, and a new "One API" software designed to simplify programming across Intel's entire product stack. We also caught a glimpse of the first 10nm Ice Lake processor for the data center.
Intel has amassed a treasure trove of new technologies over the last several years as it has diversified into new areas like AI, autonomous driving, 5G, FPGAs, and IoT, and other areas. It's even added GPUs to the list. Intel's process technology touches every segment of all that tech, as well as the chips that power them, but its delayed 10nm process has slowed the company's progress.
To help get back on track, Intel brought in Raja Koduri and Jim Keller to outline a new cohesive vision that spans all facets of its operations. Together with the company's leadership, the pair identified six key building blocks that the company will focus on over the coming years. Those pillars include process technology, architectures, memory, interconnects, security, and software. The company hopes that focusing on these key areas will accelerate its pace of innovation and help it regain its competitive footing.
The event was a wide-ranging affair with an almost overwhelming amount of information and insight into the company's plans for the future, but a few new key technologies stood out as particularly promising. Let's take a look at some of the most interesting new technologies Intel is working on.
3D Chip Stacking With Foveros
Foveros (Greek for "awesome") is a new 3D packaging technology that Intel plans to use to build new processors stacked atop one another. The concept behind 3D chip stacking is a well-traveled topic that has been under development for decades, but the industry hasn't been able to circumvent the power and thermal challenges, not to mention poor yields, well enough to bring the technology to high-volume manufacturing.
Intel says it built Foveros upon the lessons it learned with its innovative EMIB (Embedded Multi-Die Interconnect Bridge) technology, which is a complicated name for a technique that provides high-speed communication between several chips. That technique allowed the company to connect multiple dies together with a high-speed pathway that provides nearly the same performance as a single large processor. Now Intel has expanded on the concept to allow for stacking die atop each other, thus improving density.
The key idea behind chip stacking is to mix and match different types of dies, such as CPUs, GPUs, and AI processors, to build custom SOCs (System-On-Chip). It also allows Intel to combine several different components with different processes onto the same package. That lets the company use larger nodes for the harder-to-shrink or purpose-built components. That's a key advantage as shrinking chips becomes more difficult.
Intel had a fully functioning Foveros chip on display at the event, that it built for an unnamed customer. The package consists of a 10nm CPU and an I/O chip. The two chips mate with TSVs (Through Silicon Via) that connect the die through vertical electrical connections in the center of the die. The channels then mate with microbumps on an underlying package. Intel also added a memory chip to the top of the stack using conventional a PoP (Package on Package) implementation. The company envisions even more complex implementations in the future that include radios, sensors, photonics, and memory chiplets.
The current design consists of two dies. The lower die houses all of the typical southbridge features, like I/O connections, and is fabbed on the 22FFL process. The upper die is a 10nm CPU that features one large compute core and four smaller 'efficiency' cores, similar to an ARM big.LITTLE processor. Intel calls this a "hybrid x86 architecture," and it could denote a fundamental shift in the company's strategy. The company later confirmed that it is working on building a new line of products based on the new hybrid x86 architecture, which could be the company's response to the Qualcomm Snapdragon processors that power Always Connected laptops. Intel representatives did confirm the first product draws less than 7 watts (2mW standby) and is destined for fanless devices but wouldn't elaborate further.
The package measures 12x12x1mm, but Intel isn't disclosing the measurements of the dies. Stacking small dies should be relatively simple compared to stacking larger dies, but Intel seems confident in its ability to bring the technology to larger processors. Ravishankar Kuppuswamy, Vice President & General Manager of Intel's Programmable Solutions Group, announced that the company is already developing a new FPGA using the Foveros technology. Kuppuswamy claims Foveros technology will enable up to a 2x order of magnitude performance improvement over the Falcon Mesa FPGAs.
Intel Xe Discrete Graphics and Gen11 Graphics
Intel also unveiled its new Gen11 integrated graphics engine and presented a demo of Tekken 4 playing amazingly well using the new graphics architecture. The demo ran on a 10nm processor, marking the company's first public demo of a graphics engine running on a 10nm processor.
Intel cautioned that the diagrams it used for the presentation aren't entirely to scale, but they do give us a close look under the hood of the new graphics engine. The Gen11 engineering team focused heavily on creating a dramatic performance improvement over its previous-gen graphics engine, stating that the goal was to cram 1 teraflop of 32-bit and 2 teraflops of 16-bit floating point performance into a low power envelope.
Intel employed the familiar modular arrangement with sub-slices that house eight execution units (EU). Intel brought the design up to eight sub-slices, or 64 execution units (EUs). That's a big improvement from Gen9's 24 EUs. The revamped engine also processes two pixels per clock.
The new design features support for tile-based rendering in addition to immediate mode rendering, which helps to reduce memory demands during some rendering workloads. The engineers also improved the memory subsystem by quadrupling the L3 cache to 3MB and separated the shared local memory to promote parallelism. The new design also has enhanced memory compression algorithms.
Other improvements include a new HEVC Quick Sync Video engine that provides up to a 30% bitrate reduction over Gen9 (at the same or better visual quality), support for multiple 4K and 8K video streams within a lower power envelope, and support for Adaptive Sync technology.
Intel Xe Graphics Technology, Now Including a Discrete GPU
Intel shocked the enthusiast community earlier this year when it announced that it would enter the discrete graphics market. Intel is quick to remind us that it "lights up quintillions of pixels across the planet every day," which is a true statement based on the fact that, courtesy of its integrated graphics chips in its CPUs, Intel is the world's largest GPU producer. Now the company is bringing that experience to the discrete GPU market, and yes, that means it is bringing gaming-focused GPUs to market.
Translating that experience in integrated graphics to its new lineup of discrete GPUs isn't going to be an easy task: Their last successful entry into the GPU space occurred 25 years ago. But Intel has an IP war chest (at one point it owned more graphics patents than the other vendors combined) and has been on a full-court press recruiting the right talent for the task.
aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9UL1ovODE0OTY3L29yaWdpbmFsLzIwMThfQXJjaGl0ZWN0dXJlRGF5X0Rh.jpg
Intel presented a slide outlining its new Xe architecture that will come after the Gen11 graphics engine. Intel says the next generation of its graphics architecture will denote a transition from the "Gen" naming convention and will scale from integrated on-chip graphics up to discrete GPUs that will span the mid-range, enthusiast, and data center markets. That means it will scale from teraflops of performance integrated into a standard processor up to petaflops of performance with discrete cards.
This announcement certainly hints that both the integrated graphics and discrete cards will share the same underlying architecture, but Intel wouldn't answer further questions. Intel is also on track to deliver on its previously-announced timeline, saying the Xe graphics cards will debut in 2020.
CPU Core Roadmap, Sunny Cove Microarchitecture, 10nm Ice Lake
Intel also took the wraps off its new CPU core roadmap and "Sunny Cove" microarchitecture at the event. The company's dominance in the chip market has long been predicated on process and microarchitecture leadership, but Intel's approach to designing new processor cores has been inextricably tied to its onward march to smaller process nodes. That means that its new CPU core designs (microarchitectures) have traditionally required a move to new, smaller manufacturing processes.
That approach became a liability as Intel encountered massive delays with its 10nm process. Instead of bringing out new core designs, the company was mired on the 14nm process for four years as it constantly refined the process through a cadence of "+" iterations. Each new iteration of the 14nm process found the company offering higher frequencies, and thus more performance, as it marched from 4.2 GHz up to 5.1 GHz. These improvements delivered up to 70% more performance since 14nm's debut in 2014, but the lack of a new microarchitecture, which typically improves the processors' instruction per clock (IPC) throughput, slowed its progress.
After learning a hard lesson exacerbated by the resurgent AMD nipping at its heels, Intel tells us that the company will now design new microarchitectures to be portable between nodes. That will allow the company to move forward even if it encounters roadblocks on its path to smaller transistors.
The Sunny Cove microarchitecture is the first new design that can be used on multiple nodes, and even though Intel has stated the new core will debut on the 10nm node, it hasn't verified that it will come with the Ice Lake chips. In line with its new design ethos, Intel also tells us that it will select different nodes for different products based on the needs of the segment. That's similar to the approach taken by third-party fabs like TSMC and Global Foundries, and it means Intel could choose to use Sunny Cove with 14nm processors as well.
Intel's CPU Core Roadmap
Intel has typically used the same naming convention for its microarchitectures as it does for its processors. Hence, Skylake processors came with the Skylake architecture, and Kaby Lake processors came packing the Kaby Lake architecture. That old paradigm changes now that Intel has decoupled its architectures from the end products, so the company debuted a new roadmap specifically for CPU cores.
222aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9TL1EvODE0OTIyL29yaWdpbmFsLzIwMThfQXJjaGl0ZWN0dXJlRGF5X.jpg
Intel presented its new roadmap for both its Core and Atom lineups. As usual, the Core series addresses the company's bread-and-butter high-performance chips, while the Atom chips serve the low power segment.
Intel's Sunny Cove will debut in 2019, bringing with it higher performance in single-threaded applications, a new instruction set architecture (ISA), and a design geared for scalability. Willow Cove will follow with an improved cache hierarchy, security features, and transistor optimizations. The Golden Cove microarchitecture will debut in 2021 with a focus on yet more single-threaded performance, AI performance, networking improvements, and new security features. Atom will receive a slower cadence of improvements, with Tremont debuting in 2019 and Gracemont in 2021. 'Next' Mont will arrive before 2023.
Intel plans for general performance improvements through three key design tenets of going deeper, wider, and smarter, but it is also improving what it calls 'special purpose' use cases, like AI, cryptography, and compression/decompression workloads.
333aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9ULzYvODE0OTM4L29yaWdpbmFsLzIwMThfQXJjaGl0ZWN0dXJlRGF5X.jpg
Intel presented us a quick refresh on its Skylake architecture that underpins its Skylake, Kaby Lake, Coffee Lake, and Cascade Lake processors. The design (left) processes operations through two reservation stations (RS). It can process seven operations simultaneously and propel them to the integer (INT), vector (VEC), store data, and address generation units (AGU).
Deeper
444aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9ULzgvODE0OTQwL29yaWdpbmFsLzIwMThfQXJjaGl0ZWN0dXJlRGF5X.jpg
The new Sunny Cove design features improvements in every level of the pipeline. Key improvements to the front end include larger reorder, load, and store buffers, along with larger reservation stations. This allows the processor to look deeper into the set of incoming instructions to find operations that are independent of each other but can run simultaneously. The operations are then executed in parallel to improve IPC.
Intel increased the L1 data cache from 32KB, the same capacity it has used in its chips for a decade, to 42KB. The L2 cache is also larger, but the capacity is dependent upon each specific type of product, such as chips designed for either the desktop or server market. Intel also expanded the micro-op cache (UOP) and the second-level translation lookaside buffer (TLB).
Wider
555aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9ULzcvODE0OTM5L29yaWdpbmFsLzIwMThfQXJjaGl0ZWN0dXJlRGF5X.jpg
A key facet of improving performance is to increase parallelism. That starts with the deeper buffer and reservation stations we covered above, but it also requires more execution units to process the operations.
Intel moved from a four-wide allocation to five-wide to allow the in-order portion of the pipeline (front end) to feed the out-of-order (back end) portion faster. Intel also increased the number of execution units to handle ten operations per cycle (up from eight with Skylake). The Store Data unit can now process two store data operations for every cycle (up from one). The address generation units (AGU) can now also handle two loads and two stores every cycle. These improvements are necessary to match the increased bandwidth from the larger L1 data cache, which now does two reads and two writes every cycle. Intel also tweaked the design of the sub-blocks in the execution units to enable data shuffles within the registers.