Nvidia calls Turing its biggest architectural leap forward in more than 10 years. To prove it, the company is showing off a compendium of forward-looking capabilities that speed up performance in today’s games, introduce artificial intelligence to desktop graphics, make real-time ray tracing viable for the first time, accelerate video streaming, and support the next generation of VR hardware.
But there’s one problem with taking a victory lap before the opening bell rings: expectations get set very, very high.
Most of Turing’s flashiest features can’t even be tested yet. And although GeForce RTX 2080 Ti and 2080 cards are starting to show up in the Tom’s Hardware lab, drivers remain closely guarded by Nvidia. Really, there’s no way to tell how these things perform across our benchmark suite. But we do know quite a bit about the underlying Turing architecture. We can also tell you about TU102, TU104, and TU106—the first three Turing GPUs—plus the Founders Edition products based on those processors.
It’s abundantly clear to everyone that Nvidia will emerge on the other side of this Turing launch with the fastest gaming graphics cards you can buy. What remains uncertain is whether the company’s eyebrow-raising prices, ranging from $600 to $1200, justify an upgrade now or encourage gamers to hold off until ray tracing gains momentum.
Grand Turing: Meet the TU102 GPU
The centerpiece of today’s graphics-focused smorgasbord is TU102, a 754-square-millimeter GPU that sits at the heart of Nvidia’s GeForce RTX 2080 Ti. Its 18.6 billion transistors are fabricated on TSMC’s 12nm FinFET manufacturing process, which purportedly reflects a slight density improvement over TSMC’s previous 16nm node. The foundry even classifies 12nm technology under the same umbrella as 16nm on its website. We’re not accustomed to covering Nvidia’s “big” gaming GPU at the same time as a new architecture. But Nvidia knows that for real-time ray tracing to entice enthusiasts, it needs to run at smooth frame rates. Getting TU102 into the hands of early adopters was critical this time around.
Compared to the biggest Pascal-based GPU used in a desktop graphics card, GP102, Nvidia’s TU102 is 60% larger with a 55%-higher transistor count. But it’s not the company’s most massive processor. The Turing-based flagship is eclipsed by GV100, a 21.1 billion-transistor behemoth measuring 815mm². That GPU was introduced in 2017 with an emphasis on data center applications, and is still found on the $3000 Titan V.
TU102 is aimed at a different target market than GV100, and it’s consequently provisioned with a list of resources to match. While elements of Turing do borrow from Nvidia’s work in Volta/GV100, pieces of the architecture that either don’t benefit gamers or aren’t cost-effective on the desktop are deliberately stripped out.
For example, each Volta Streaming Multiprocessor (SM) includes 32 FP64 cores for fast double-precision math, adding up to 2688 FP64 cores across GV100. They aren’t really useful in games though, and they eat up a lot of die space, so Nvidia pulled all but two of them from each Turing SM. As a result, TU102’s double-precision rate is 1/32 of its FP32 performance, leaving just enough FP64 compute to maintain compatibility with software dependent on it. Similarly, GV100’s eight 512-bit memory controllers attached to four stacks of HBM2 would have ended up being very expensive (just ask AMD about the trouble it had pricing HBM2-equipped Radeons competitively). They were consequently replaced with Micron-made GDDR6, facilitating a cheaper solution that’s still able to serve up a big bandwidth upgrade over Pascal-based predecessors.
A complete TU102 processor comprises six Graphics Processing Clusters (GPCs) made up of a Raster Engine and six Texture Processing Clusters (TPCs). Each TPC is composed of one PolyMorph Engine (fixed-function geometry pipeline) and two Streaming Multiprocessors (SMs). Again, at the SM level, we find 64 CUDA cores, eight Tensor cores, one RT core, four texture units, 16 load/store units, 256KB of register file space, four L0 instruction caches, and a 96KB configurable L1 cache/shared memory structure.
Multiply all of that out and you get a GPU with 72 SMs, 4608 CUDA cores, 576 Tensor cores, 72 RT cores, 288 texture units, and 36 PolyMorph engines.
Those resources are fed by 12 32-bit GDDR6 memory controllers, each attached to an eight-ROP cluster and 512KB of L2 cache yielding an aggregate 384-bit memory bus, 96 ROPs, and a 6MB L2 cache.
2aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0ovNzk3MzExL29yaWdpbmFsL2dlZm9yY2UtcnR4LTIwODAtdGktdG9.jpg
Putting It All Together: GeForce RTX 2080 Ti
The TU102 found on GeForce RTX 2080 Ti isn’t a complete processor, though. Whether Nvidia wanted to leave room for a Titan-class model or found yields of fully-functional GPUs unsatisfactory above a certain bin, the RTX 2080 Ti has two of its TPCs disabled, leaving the card with 4352 CUDA cores, 544 Tensor cores, 68 RT cores, 544 texture units, and 34 PolyMorph engines.
Moreover, one of TU102’s 32-bit memory controllers is turned off, creating an aggregate 352-bit bus that moves data to 88 ROPs and 5.5MB of L2 cache. Nvidia matches its strategically-hobbled GPU to Micron’s MT61K256M32JE-14:A modules. Eleven of these populate the RTX 2080 Ti’s PCB, leaving one emplacement vacant. Nevertheless, theoretical peak bandwidth rises sharply compared to the previous generation cards due to GDDR6’s higher data rate: at 14 Gb/s on a 352-bit interface, you’re looking at 616 GB/s. In comparison, GDDR5X at 11 Gb/s held GeForce GTX 1080 Ti to 484 GB/s.
At least on the Founders Edition card, a base core frequency of 1350 MHz jumps all the way up to a typical GPU Boost rate of 1635 MHz, so long as GeForce RTX 2080 Ti is running cool enough. And because Nvidia cites peak compute performance using GPU Boost numbers, its top-end model achieves up to 14.2 TFLOPS of single-precision math.
The reference specification calls for a GPU Boost frequency of 1545 MHz and a slightly lower TDP. Whereas the Founders Edition card’s overclock imposes a maximum board power of 260W, reference-class implementations should duck in around 250W.
Both configurations feature two NVLink interfaces for multi-GPU connectivity, though. This technology is covered in greater depth further along, but in short, each x8 link enables 50 GB/s of bi-directional bandwidth to support higher resolutions and faster refresh rates. On GeForce RTX 2080 Ti, 100 GB/s of total throughput is enough for 8K monitors in Surround mode.
Meet TU104 and GeForce RTX 2080
TU104: Turing With Middle Child Syndrome
It’s not that TU104 goes unloved, but again, we’re not used to introducing three GPUs alongside a new architecture. Then again, with GeForce RTX 2080 Ti starting at $1000, the RTX 2080, priced from $700, is going to find its way into more gaming PCs.
Similar to TU102, TSMC manufacturers TU104 on its 12nm FinFET node. But a transistor count of 13.6 billion results in a smaller 545 mm² die. “Smaller,” of course, requires a bit of context. Turing Jr out-measures the last generation’s 471 mm² flagship (GP102) and comes close to the size of GK110 from the 2013-era GeForce GTX Titan.
TU104 is constructed with the same building blocks as TU102; it just features fewer of them. Streaming Multiprocessors still sport 64 CUDA cores, eight Tensor cores, one RT core, four texture units, 16 load/store units, 256KB of register space, and 96KB of L1 cache/shared memory. TPCs are still composed of two SMs and a PolyMorph geometry engine. Only here, there are four TPCs per GPC, and six GPCs spread across the processor. Therefore, a fully enabled TU104 wields 48 SMs, 3072 CUDA cores, 384 Tensor cores, 48 RT cores, 192 texture units, and 24 PolyMorph engines.
A correspondingly narrower back end feeds the compute resources through eight 32-bit GDDR6 memory controllers (256-bit aggregate) attached to 64 ROPs and 4MB of L2 cache.
TU104 also loses an eight-lane NVLink connection, limiting it to one x8 link and 50 GB/s of bi-directional throughput.
GeForce RTX 2080: TU104 Gets A (Tiny) Haircut
After seeing the GeForce RTX 2080 Ti serve up respectable performance in Battlefield V at 1920x1080 with ray tracing enabled, we can’t help but wonder if GeForce RTX 2080 is fast enough to maintain playable frame rates. Even a complete TU104 GPU is limited to 48 RT cores compared to TU102’s 68. But because Nvidia goes in and turns off one of TU104’s TPCs to create GeForce RTX 2080, another pair of RT cores is lost (along with 128 CUDA cores, eight texture units, 16 Tensor cores, and so on).
In the end, GeForce RTX 2080 struts onto the scene with 46 SMs hosting 2944 CUDA cores, 368 Tensor cores, 46 RT cores, 184 texture units, 64 ROPS, and 4MB of L2 cache. Eight gigabytes of 14 Gb/s GDDR6 on a 256-bit bus move up to 448 GB/s of data, adding more than 100 GB/s of memory bandwidth beyond what GeForce GTX 1080 could do.
Reference and Founders Edition RTX 2080s have a 1515 MHz base frequency. Nvidia’s own overclocked models ship with a GPU Boost rating of 1800 MHz, while the reference spec is 1710 MHz. Peak FP32 compute performance of 10.6 TFLOPS puts GeForce RTX 2080 Founders Edition behind GeForce GTX 1080 Ti (11.3 TFLOPS), but well ahead of GeForce GTX 1080 (8.9 TFLOPS). Of course, the faster Founders Edition model also uses more power. Its 225W TDP is 10W higher than the reference GeForce RTX 2080, and a full 45W above last generation’s GeForce GTX 1080.
Meet TU106 and GeForce RTX 2070
A Turing Baby Is Born
GeForce RTX 2070 is the third and final card Nvidia announced at its Gamescom event. Unlike GeForce RTX 2080 and 2080 Ti, the 2070 won’t be available until sometime in October. Gamers who wait can expect to find reference models starting around $500 and Nvidia’s own Founders Edition model selling for $100 more.
The 2070 is built around a complete TU106 GPU composed of three GPCs, each with six TPCs. Naturally, the TPCs include two SMs each, adding up to 36 SMs across the processor. Those blocks are unchanged between Turing GPUs, so RTX 2070 ends up with 2304 CUDA cores, 288 Tensor cores, 36 RT cores, and 144 texture units. TU106 maintains the same 256-bit memory bus as TU104, and it’s likewise populated with 8GB of 14 Gb/s GDDR6 modules capable of moving up to 448 GB/s. A 4MB L2 cache and 64 ROPs carry over as well. The only capability blatantly missing is NVLink, which isn't supported on RTX 2070.
Although TU106 is the least-complex Turing-based GPU at launch, its 445 mm² die contains no fewer than 10.8 billion transistors. That’s still pretty enormous for what Nvidia might have once considered the middle of its portfolio. In comparison, GP106—“mid-range Pascal”—was a 200 mm² chip with 4.4 billion transistors inside. GP104 measured 314 mm² and included 7.2 billion transistors. Targeting greater-than GTX 1080 performance levels, RTX 2070 really does seem like an effort to drive Tensor and RT cores as deep as possible down the chip stack, while keeping those features useful. It’ll be interesting to see how practical they remain in almost-halved quantities versus RTX 2080 Ti once optimized software becomes available.
Pumped-up die size aside, reference GeForce RTX 2070 cards based on TU106 have a 175W TDP. That’s less than GeForce GTX 1080.