Nvidia’s Turing Architecture Explored: Inside the GeForce RTX 2080

**Rhialto** · 09-14-2018

Nvidia calls Turing its biggest architectural leap forward in more than 10 years. To prove it, the company is showing off a compendium of forward-looking capabilities that speed up performance in today’s games, introduce artificial intelligence to desktop graphics, make real-time ray tracing viable for the first time, accelerate video streaming, and support the next generation of VR hardware.

But there’s one problem with taking a victory lap before the opening bell rings: expectations get set very, very high.

Most of Turing’s flashiest features can’t even be tested yet. And although GeForce RTX 2080 Ti and 2080 cards are starting to show up in the Tom’s Hardware lab, drivers remain closely guarded by Nvidia. Really, there’s no way to tell how these things perform across our benchmark suite. But we do know quite a bit about the underlying Turing architecture. We can also tell you about TU102, TU104, and TU106—the first three Turing GPUs—plus the Founders Edition products based on those processors.

It’s abundantly clear to everyone that Nvidia will emerge on the other side of this Turing launch with the fastest gaming graphics cards you can buy. What remains uncertain is whether the company’s eyebrow-raising prices, ranging from $600 to $1200, justify an upgrade now or encourage gamers to hold off until ray tracing gains momentum.

1aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L1EvNzk3MzE4L29yaWdpbmFsL3J0eC1mYW1pbHkuanBn.jpg

Grand Turing: Meet the TU102 GPU

The centerpiece of today’s graphics-focused smorgasbord is TU102, a 754-square-millimeter GPU that sits at the heart of Nvidia’s GeForce RTX 2080 Ti. Its 18.6 billion transistors are fabricated on TSMC’s 12nm FinFET manufacturing process, which purportedly reflects a slight density improvement over TSMC’s previous 16nm node. The foundry even classifies 12nm technology under the same umbrella as 16nm on its website. We’re not accustomed to covering Nvidia’s “big” gaming GPU at the same time as a new architecture. But Nvidia knows that for real-time ray tracing to entice enthusiasts, it needs to run at smooth frame rates. Getting TU102 into the hands of early adopters was critical this time around.

Compared to the biggest Pascal-based GPU used in a desktop graphics card, GP102, Nvidia’s TU102 is 60% larger with a 55%-higher transistor count. But it’s not the company’s most massive processor. The Turing-based flagship is eclipsed by GV100, a 21.1 billion-transistor behemoth measuring 815mm². That GPU was introduced in 2017 with an emphasis on data center applications, and is still found on the $3000 Titan V.

TU102 is aimed at a different target market than GV100, and it’s consequently provisioned with a list of resources to match. While elements of Turing do borrow from Nvidia’s work in Volta/GV100, pieces of the architecture that either don’t benefit gamers or aren’t cost-effective on the desktop are deliberately stripped out.

For example, each Volta Streaming Multiprocessor (SM) includes 32 FP64 cores for fast double-precision math, adding up to 2688 FP64 cores across GV100. They aren’t really useful in games though, and they eat up a lot of die space, so Nvidia pulled all but two of them from each Turing SM. As a result, TU102’s double-precision rate is 1/32 of its FP32 performance, leaving just enough FP64 compute to maintain compatibility with software dependent on it. Similarly, GV100’s eight 512-bit memory controllers attached to four stacks of HBM2 would have ended up being very expensive (just ask AMD about the trouble it had pricing HBM2-equipped Radeons competitively). They were consequently replaced with Micron-made GDDR6, facilitating a cheaper solution that’s still able to serve up a big bandwidth upgrade over Pascal-based predecessors.

A complete TU102 processor comprises six Graphics Processing Clusters (GPCs) made up of a Raster Engine and six Texture Processing Clusters (TPCs). Each TPC is composed of one PolyMorph Engine (fixed-function geometry pipeline) and two Streaming Multiprocessors (SMs). Again, at the SM level, we find 64 CUDA cores, eight Tensor cores, one RT core, four texture units, 16 load/store units, 256KB of register file space, four L0 instruction caches, and a 96KB configurable L1 cache/shared memory structure.

Multiply all of that out and you get a GPU with 72 SMs, 4608 CUDA cores, 576 Tensor cores, 72 RT cores, 288 texture units, and 36 PolyMorph engines.

Those resources are fed by 12 32-bit GDDR6 memory controllers, each attached to an eight-ROP cluster and 512KB of L2 cache yielding an aggregate 384-bit memory bus, 96 ROPs, and a 6MB L2 cache.

2aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0ovNzk3MzExL29yaWdpbmFsL2dlZm9yY2UtcnR4LTIwODAtdGktdG9.jpg

Putting It All Together: GeForce RTX 2080 Ti

The TU102 found on GeForce RTX 2080 Ti isn’t a complete processor, though. Whether Nvidia wanted to leave room for a Titan-class model or found yields of fully-functional GPUs unsatisfactory above a certain bin, the RTX 2080 Ti has two of its TPCs disabled, leaving the card with 4352 CUDA cores, 544 Tensor cores, 68 RT cores, 544 texture units, and 34 PolyMorph engines.

Moreover, one of TU102’s 32-bit memory controllers is turned off, creating an aggregate 352-bit bus that moves data to 88 ROPs and 5.5MB of L2 cache. Nvidia matches its strategically-hobbled GPU to Micron’s MT61K256M32JE-14:A modules. Eleven of these populate the RTX 2080 Ti’s PCB, leaving one emplacement vacant. Nevertheless, theoretical peak bandwidth rises sharply compared to the previous generation cards due to GDDR6’s higher data rate: at 14 Gb/s on a 352-bit interface, you’re looking at 616 GB/s. In comparison, GDDR5X at 11 Gb/s held GeForce GTX 1080 Ti to 484 GB/s.

At least on the Founders Edition card, a base core frequency of 1350 MHz jumps all the way up to a typical GPU Boost rate of 1635 MHz, so long as GeForce RTX 2080 Ti is running cool enough. And because Nvidia cites peak compute performance using GPU Boost numbers, its top-end model achieves up to 14.2 TFLOPS of single-precision math.

The reference specification calls for a GPU Boost frequency of 1545 MHz and a slightly lower TDP. Whereas the Founders Edition card’s overclock imposes a maximum board power of 260W, reference-class implementations should duck in around 250W.

Both configurations feature two NVLink interfaces for multi-GPU connectivity, though. This technology is covered in greater depth further along, but in short, each x8 link enables 50 GB/s of bi-directional bandwidth to support higher resolutions and faster refresh rates. On GeForce RTX 2080 Ti, 100 GB/s of total throughput is enough for 8K monitors in Surround mode.

Meet TU104 and GeForce RTX 2080

TU104: Turing With Middle Child Syndrome

It’s not that TU104 goes unloved, but again, we’re not used to introducing three GPUs alongside a new architecture. Then again, with GeForce RTX 2080 Ti starting at $1000, the RTX 2080, priced from $700, is going to find its way into more gaming PCs.

3aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0kvNzk3MzEwL29yaWdpbmFsLzIwODAtdG9wLmpwZw==.jpg

Similar to TU102, TSMC manufacturers TU104 on its 12nm FinFET node. But a transistor count of 13.6 billion results in a smaller 545 mm² die. “Smaller,” of course, requires a bit of context. Turing Jr out-measures the last generation’s 471 mm² flagship (GP102) and comes close to the size of GK110 from the 2013-era GeForce GTX Titan.

TU104 is constructed with the same building blocks as TU102; it just features fewer of them. Streaming Multiprocessors still sport 64 CUDA cores, eight Tensor cores, one RT core, four texture units, 16 load/store units, 256KB of register space, and 96KB of L1 cache/shared memory. TPCs are still composed of two SMs and a PolyMorph geometry engine. Only here, there are four TPCs per GPC, and six GPCs spread across the processor. Therefore, a fully enabled TU104 wields 48 SMs, 3072 CUDA cores, 384 Tensor cores, 48 RT cores, 192 texture units, and 24 PolyMorph engines.

A correspondingly narrower back end feeds the compute resources through eight 32-bit GDDR6 memory controllers (256-bit aggregate) attached to 64 ROPs and 4MB of L2 cache.

TU104 also loses an eight-lane NVLink connection, limiting it to one x8 link and 50 GB/s of bi-directional throughput.

GeForce RTX 2080: TU104 Gets A (Tiny) Haircut

After seeing the GeForce RTX 2080 Ti serve up respectable performance in Battlefield V at 1920x1080 with ray tracing enabled, we can’t help but wonder if GeForce RTX 2080 is fast enough to maintain playable frame rates. Even a complete TU104 GPU is limited to 48 RT cores compared to TU102’s 68. But because Nvidia goes in and turns off one of TU104’s TPCs to create GeForce RTX 2080, another pair of RT cores is lost (along with 128 CUDA cores, eight texture units, 16 Tensor cores, and so on).

In the end, GeForce RTX 2080 struts onto the scene with 46 SMs hosting 2944 CUDA cores, 368 Tensor cores, 46 RT cores, 184 texture units, 64 ROPS, and 4MB of L2 cache. Eight gigabytes of 14 Gb/s GDDR6 on a 256-bit bus move up to 448 GB/s of data, adding more than 100 GB/s of memory bandwidth beyond what GeForce GTX 1080 could do.

Reference and Founders Edition RTX 2080s have a 1515 MHz base frequency. Nvidia’s own overclocked models ship with a GPU Boost rating of 1800 MHz, while the reference spec is 1710 MHz. Peak FP32 compute performance of 10.6 TFLOPS puts GeForce RTX 2080 Founders Edition behind GeForce GTX 1080 Ti (11.3 TFLOPS), but well ahead of GeForce GTX 1080 (8.9 TFLOPS). Of course, the faster Founders Edition model also uses more power. Its 225W TDP is 10W higher than the reference GeForce RTX 2080, and a full 45W above last generation’s GeForce GTX 1080.

Meet TU106 and GeForce RTX 2070

A Turing Baby Is Born

GeForce RTX 2070 is the third and final card Nvidia announced at its Gamescom event. Unlike GeForce RTX 2080 and 2080 Ti, the 2070 won’t be available until sometime in October. Gamers who wait can expect to find reference models starting around $500 and Nvidia’s own Founders Edition model selling for $100 more.

4aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0wvNzk3MzEzL29yaWdpbmFsLzIwNzAtZnJvbnQuanBn.jpg

The 2070 is built around a complete TU106 GPU composed of three GPCs, each with six TPCs. Naturally, the TPCs include two SMs each, adding up to 36 SMs across the processor. Those blocks are unchanged between Turing GPUs, so RTX 2070 ends up with 2304 CUDA cores, 288 Tensor cores, 36 RT cores, and 144 texture units. TU106 maintains the same 256-bit memory bus as TU104, and it’s likewise populated with 8GB of 14 Gb/s GDDR6 modules capable of moving up to 448 GB/s. A 4MB L2 cache and 64 ROPs carry over as well. The only capability blatantly missing is NVLink, which isn't supported on RTX 2070.

Although TU106 is the least-complex Turing-based GPU at launch, its 445 mm² die contains no fewer than 10.8 billion transistors. That’s still pretty enormous for what Nvidia might have once considered the middle of its portfolio. In comparison, GP106—“mid-range Pascal”—was a 200 mm² chip with 4.4 billion transistors inside. GP104 measured 314 mm² and included 7.2 billion transistors. Targeting greater-than GTX 1080 performance levels, RTX 2070 really does seem like an effort to drive Tensor and RT cores as deep as possible down the chip stack, while keeping those features useful. It’ll be interesting to see how practical they remain in almost-halved quantities versus RTX 2080 Ti once optimized software becomes available.

Pumped-up die size aside, reference GeForce RTX 2070 cards based on TU106 have a 175W TDP. That’s less than GeForce GTX 1080.

**Rhialto** · 09-14-2018

Turing Improves Performance in Today’s Games

Some enthusiasts have expressed concern that Turing-based cards don’t boast dramatically higher CUDA core counts than their previous-gen equivalents. The older boards even have higher GPU Boost frequencies. Nvidia didn’t help matters by failing to address generational improvements in today’s games at its launch event in Cologne, Germany. But the company did put a lot of effort into rearchitecting Turing for better per-core performance.

5aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0YvNzk3MzA3L29yaWdpbmFsL2NvbmN1cnJlbnQtZXhlY3V0aW9uLmp.jpg

To start, Turing borrows from the Volta playbook in its support for simultaneous execution of FP32 arithmetic instructions, which constitute most shader workloads, and INT32 operations (for addressing/fetching data, floating-point min/max, compare, etc.). When you hear about Turing cores achieving better performance than Pascal at a given clock rate, this capability largely explains why.

In generations prior, a single math data path meant that dissimilar instruction types couldn’t execute at the same time, causing the floating-point pipeline to sit idle whenever non-FP operations were needed in a shader program. Volta sought to change this by creating separate pipelines. Although Nvidia eliminated the second dispatch unit assigned to each warp scheduler, it also claimed that instruction issue throughput rose.

How is that possible? It's all about the composition of each architecture's SM.

Check out the two block diagrams below. Pascal has one warp scheduler per quad, with each quad containing 32 CUDA cores. A quad's scheduler can issue one instruction pair per clock through the two dispatch units with the stipulation that both instructions come from the same 32-thread warp, and only one can be a core math instruction. Still, that's one dispatch unit per 16 CUDA cores.

In contrast, Turing packs fewer CUDA cores into an SM, and then spreads more SMs across each GPU. There's now one scheduler per 16 CUDA cores (2x Pascal), along with one dispatch unit per 16 CUDA cores (same as Pascal). Gone is the instruction-pairing constraint. And because Turing doubles up on schedulers, it only needs to issue an instruction to the CUDA cores every other cycle to keep them full (with 32 threads per warp, 16 CUDA cores take two cycles to consume them all). In between, it's free to issue a different instruction to any other unit, including the new INT32 pipeline. The new instruction can also be from any warp.

Turing's flexibility comes from having twice as many schedulers as Pascal, so that each one has less math to feed per cycle, not from a more complicated design. The schedulers still issue one instruction per clock cycle. It's just that the architecture is better able to utilize resources thanks to its improved balance throughout the SM.

According to Nvidia, the potential gains are significant. In a game like Battlefield 1, for every 100 floating-point instructions, there are 50 non-FP instructions in the shader code. Other titles bias more heavily toward floating-point math. But the company claims there are an average of 36 integer pipeline instructions that would stall the floating-point pipeline for every 100 FP instructions. Those now get offloaded to the INT32 cores.

Despite the separation of FP32 and INT32 paths on its block diagrams, Nvidia says each Turing SM contains 64 CUDA cores to keep things straightforward. The Turing SM also comprises 16 load/store units, 16 special-function units, 256KB of register file space, 96KB of shared memory and L1 data cache, four texture units, eight Tensor cores, and one RT core.

On paper, an SM in the previous-generation GP102 appears more complex, sporting twice as many CUDA cores, load/store units, SFUs, texture units, just as much register file capacity, and more cache. But remember that the new TU102 boasts as many as 72 SMs across the GPU, while GP102 topped out at 30 SMs. The result is a Turing-based flagship with 21% more CUDA cores and texture units than GeForce GTX 1080 Ti, but also way more SRAM for registers, shared memory, and L1 cache (not to mention 6MB of L2 cache, doubling GP102’s 3MB).

That increase of on-die memory plays another critical role in improving performance, as does its hierarchical organization. Consider the three different data memories: texture cache for textures, L1 cache for load/store data, and shared memory for compute workloads. As far back as Kepler, each SM had 48KB of read-only texture cache, plus a 64KB shared memory/L1 cache. In Maxwell/Pascal, the L1 and texture caches were combined, leaving 96KB of shared memory on its own. Now, Turing combines all three into one shared and configurable 96KB pool.

The benefit of unification, of course, is that regardless of whether a workload is optimized for L1 or shared memory, on-chip storage is utilized rather than sitting idle as it may have before. Moving L1 functionality down has the additional benefit of putting it on a wider bus, doubling L1 cache bandwidth (at the TPC level, Pascal supports 64 bytes per clock cache hit bandwidth, while Turing can do 128 bytes per clock). And because those 96KB can be configured as 64KB L1 and 32KB shared memory (or vice versa), L1 capacity can be 50% higher on a per-SM basis.

6aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L00vNzk3MzE0L29yaWdpbmFsL3NoYXJlZC1tZW1vcnktYXJjaC5qcGc.jpg

Combined, Nvidia claims that the effect of its redesigned math pipelines and memory architecture is a 50% performance uplift per CUDA core. To keep those data-hungry cores fed more effectively, Nvidia paired TU102 with GDDR6 memory and further optimized its traffic reduction technologies (like delta color compression). Pitting GeForce GTX 1080 Ti’s 11 Gb/s GDDR5X modules against RTX 2080 Ti’s 14 Gb/s GDDR6 memory, both on an aggregate 352-bit bus, you’re looking at a 27%-higher data rate/peak bandwidth figure across the board. Then, depending on the game, when RTX 2080 Ti can avoid sending data over the bus, effective throughput increases even more by double-digit percentages.

7aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L1IvNzk3MzE5L29yaWdpbmFsL2VmZmVjdGl2ZS1iYW5kd2lkdGguanB.jpg

Designing for The Future: Tensor Cores and DLSS

Although the Volta architecture was full of significant changes compared to Pascal, the addition of Tensor cores was most indicative of GV100’s ultimate purpose: to accelerate 4x4 matrix operations with FP16 inputs, which form the basis of neural network training and inferencing.

Like the Volta SM, Turing exposes two Tensor cores per quad, or eight per Streaming Multiprocessor. TU102 does feature fewer SMs than GV100 (72 versus 84), and GeForce RTX 2080 Ti has fewer SMs enabled than Titan V (68 versus 80). So, the RTX 2080 Ti only has 544 Tensor cores to Titan V’s 640. But TU102’s Tensor cores are implemented differently in that they also support INT8 and INT4 operations. This makes sense of course; GV100 was designed to train neural networks, while TU102 is a gaming chip able to use trained networks for inferencing.

Nvidia claims that TU102’s Tensor cores deliver up to 114 TFLOPS for FP16 operations, 228 TOPS of INT8, and 455 TOPS INT4. The FP16 multiply with FP32 accumulation operations used for deep learning training are supported as well, but at half-speed compared to FP16 accumulate.

Most of Nvidia’s current plans for the Tensor cores revolve around neural graphics. However, the company is also researching other applications of deep learning on desktop cards. Intelligent enemies, for instance, would completely change the way gamers approach boss fights. Speech synthesis, voice recognition, material/art enhancement, cheat detection, and character animation are all areas where AI is already in use, or where Nvidia sees potential.

8aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS84L0cvNzk3MzQ0L29yaWdpbmFsL25neC1pbnRlZ3JhdGlvbi5qcGc=.jpg

But of course, Deep Learning Super Sampling (DLSS) is the focus for GeForce RTX. The process by which DLSS is implemented does require developer support through Nvidia’s NGX API. But the company claims integration is fairly easy, and has a list of games with planned support to demonstrate industry enthusiasm for what DLSS can do for image quality. This might be because the heavy lifting is handled by Nvidia itself. The company offers to generate ground truth images—the highest-quality representation possible, achieved through super-high resolution, lots of samples per frame, or lots of frames averaged together. Then, it’s able to train an AI model with the 660-node DGX-1-based SaturnV server to get lower-quality images as close to the ground truth as possible. These models are downloaded through Nvidia’s driver and accessed using the Tensor cores on any GeForce RTX graphics card. Nvidia claims that each AI model is measured in megabytes, making them relatively lightweight.

While we hoped Nvidia's GeForce Experience (GFE) software wouldn't be a requisite of DLSS, we suspected it probably would be. Sure enough, the company confirmed that the features of NGX are tightly woven into GFE. If the software detects a Turing-based GPU, it downloads a package called NGX Core, which determines if games/apps are relevant to NGX. When there's a match, NGX Core retrieves any associated deep neural networks for later use.

9aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS84L0gvNzk3MzQ1L29yaWdpbmFsL3RhYS5qcGc=.jpg
TAA enabled: notice the torn text

**Rhialto** · 09-14-2018

10aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS84L0kvNzk3MzQ2L29yaWdpbmFsL2Rsc3MuanBn.jpg
DLSS enabled: the artifacts are gone

Will DLSS be worth the effort? It’s hard to say at this point. We’ve seen one example of DLSS from Epic’s Infiltrator demo and it looked great. But it’s unclear if Nvidia can get the same caliber of results from any game, regardless of genre, pace, environmental detail, and so on. What we do know is that DLSS is a real-time convolutional auto-encoder trained on images sampled 64 times. It’s given a normal-resolution frame through the NGX API, and spits back a higher-quality version of that frame.

Shortly after its Gamescom announcement, Nvidia started teasing performance figures of GeForce RTX 2080 with DLSS enabled versus GeForce RTX 2080 versus GTX 1080. Those results made it seem like turning DLSS on improved frame rates absolutely, but there just wasn't much backup data to clarify how the benchmark numbers were achieved. As it turns out, DLSS improves performance by reducing the card's shading workload, while achieving similar quality.

Turing can produce higher-quality output from a given number of input samples compared to a post-processing algorithm like Temporal Anti-Aliasing (TAA). For DLSS, Nvidia turns that into a performance benefit by reducing input samples to the network until the final output (at the same resolution as TAA) is close to a similar quality. Even though Turing spends time running the deep neural network, the savings attributable to less shading work are greater. Set to 2x DLSS, Nvidia says it can achieve the equivalent of 64x super-sampling by rendering its inputs at the target resolution, while side-stepping the transparency artifacts and blurriness sometimes seen from TAA.

Twenty-five games are already queued up with DLSS support, including existing titles like Ark: Survival Evolved, Final Fantasy XV, and PlayerUnknown’s Battlegrounds, plus several others that aren’t out yet.

Hybrid Ray Tracing in Real-Time

Beyond the scope of anything that Volta touched, and arguably the most promising chapter in Turing’s story, is the RT core bolted to the bottom of each SM in TU102. Nvidia’s RT cores are essentially fixed-function accelerators for Bounding Volume Hierarchy (BVH) traversal and triangle intersection evaluation. Both operations are essential to the ray tracing algorithm. For more background about ray tracing and why it’s so visually appealing, see our feature: What is Ray Tracing and Why Do You Want it in Your GPU?

In short, BVHs form boxes of geometry in a given scene. These boxes help narrow down the location of triangles intersecting rays through a tree structure. Each time a triangle is found to be in a box, that box is subdivided into more boxes until the final box can be divided into triangles. Without BVHs, an algorithm would be forced to search through the entire scene, burning tons of cycles testing every triangle for an intersection.

11aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83LzMvNzk3Mjk1L29yaWdpbmFsL3JheS10cmFjaW5nLXNvZnR3YXJlLm.jpg

Running this algorithm today is entirely possible using the Microsoft D3D12 Raytracing Fallback Layer APIs, which use compute shaders to emulate DirectX Raytracing on devices without native support (and redirect to DXR when driver support is identified). On a Pascal-based GPU, for instance, the BVH scan happens on programmable cores, which fetch each box, decode it, test for an intersection, and determine if there’s a sub-box or triangles inside. The process iterates until triangles are found, at which point they’re tested for intersection with the ray. As you might imagine, this operation is very expensive to run in software, preventing real-time ray tracing from running smoothly on today’s graphics processors.

12aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83LzUvNzk3Mjk3L29yaWdpbmFsL3JheS10cmFjaW5nLWhhcmR3YXJlLm.jpg

By creating fixed-function accelerators for the box and triangle intersection steps, the SM casts a ray into the scene using a ray generation shader and hands off acceleration structure traversal to the fixed-function RT core. All of the intersection evaluation happens much more quickly as a result, and the SM’s other resources are freed up for shading just as they would for a traditional rasterization workload.

13aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83LzIvNzk3Mjk0L29yaWdpbmFsL2h5YnJpZC1yZW5kZXJpbmcuanBn.jpg

According to Nvidia, a GeForce GTX 1080 Ti can cast about 1.1 billion rays per second in software using its CUDA cores capable of 11.3 FP32 TFLOPs. In comparison, GeForce RTX 2080 Ti can cast about 10 billion rays per second using its 68 RT cores. It’s important to note that neither of those figures are based on calculated peaks like a lot of speeds and feeds. Rather, Nvidia took the geometric mean of results from several workloads to settle on its “10+ gigarays”value.

NVLink: A Bridge To…Anywhere?

TU102 and TU104 are Nvidia’s first desktop GPUs rocking the NVLink interconnect rather than a Multiple Input/Output (MIO) interface for SLI support. The former makes two x8 links available, while the latter is limited to one. Each link facilitates up to 50 GB/s of bidirectional bandwidth. So, GeForce RTX 2080 Ti is capable of up to 100 GB/s between cards and RTX 2080 can do half of that.

One link or two, though, SLI over NVLink only works across a pair of GeForce RTX boards with at least one empty slot between them for airflow. Officially, Pascal-era GPUs endured the same two-card maximum. Technically, however, as many as four top-end GeForce GTXes could be made to work together in a handful of benchmarks. These days, you’ll also have to purchase your own GeForce RTX NVLink Bridge for multi-GPU connectivity. Three- and four-slot sizes are both available for $80 from Nvidia’s website.

Some of the trouble last generation was caused by bandwidth constraints between SLI bridges. Compared to the original SLI interface’s 1 GB/s MIO link, Pascal’s implementation drove ~4 GB/s. That was fast enough to get the second card’s rendered frame back to the primary board in time for smooth output to a 4K monitor at 60 Hz. But it wouldn’t have been able to keep up at 120 Hz and higher, which is where today’s highest-end gaming displays operate.

14aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS84L0MvNzk3MzQwL29yaWdpbmFsL3NsaS1idy5qcGc=.jpg

Even in a single-link configuration, NVLink can move data so quickly that SLI on an 8K screen is possible. Driving three 4K monitors at 144 Hz in Surround mode is no problem at all. Two x8 links have the throughput needed for 8K displays in Surround.

Really, the question is: who cares anymore? AMD and Nvidia did such a good job of pumping the brakes on multi-GPU configurations that Tom’s Hardware readers rarely, if ever, ask for benchmark results from an SLI setup. Back in the day, value-minded gamers used SLI to match the performance of higher-end cards. Nvidia put a stop to that by removing support from lower-end models in its product stack. Now, even GeForce RTX 2070 lacks an NVLink connector. Older DirectX 11-based games still run well across two cards, and a handful of DirectX 12-based titles do exploit the API’s explicit multi-adapter control. But the fact that developers like EA DICE are pouring time into taxing features like real-time ray tracing and ignoring multi-GPU says a lot about SLI’s future.

We’ve heard Nvidia representatives say they’ll have more to discuss on this front in the future. For now, NVLink support on GeForce RTX 2080 Ti and 2080 is a novelty, particularly as we’re able to focus on playable frame rates at 4K and G-Sync technology to keep the action smooth.

**Rhialto** · 09-14-2018

Mesh Shading: A Foundation for More On-Screen Objects

Architectural enhancements are commonly split between changes that affect gaming today and new features that require support from future titles. At least at launch, Nvidia’s GeForce RTX cards are going to be judged predominantly on the former. Today’s games are what we can quantify. And although veterans in the hardware field have their own opinions of what real-time ray tracing means to an immersive gaming experience, I’ve been around long enough to know that you cannot recommend hardware based only on promises of what’s to come.

15aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0svNzk3MzEyL29yaWdpbmFsL21lc2gtc2hhZGluZy5qcGc=.jpg

The Turing architecture does, however, include a few more advanced graphics features loaded with potential but not yet accessible. Mesh shaders, for instance, augment the existing DirectX 11/12 graphics pipeline whereby host processors are responsible for calculating levels of detail, culling objects that aren’t in view, and issuing draw calls for each object. Up to a certain point, CPUs are fine for this. But in complex scenes with hundreds of thousands of objects, they have a hard time keeping up.

By using a mesh shader instead, game developers can offload LOD calculation and object culling to a task shader, which replaces the vertex shader and hull shader stages. Because the task shader is more general than a vertex shader, it’s able to take a list of objects from the CPU and run it through a compute program that determines where objects are located, what version of each object to use based on LOD, and if an object needs to be culled before passing it down the pipeline.

Variable Rate Shading: Get Smarter About Shading, Too

In addition to optimizing the way Turing processes geometry, Nvidia also supports a mechanism for choosing the rate at which 16x16 blocks of pixels are shaded in different parts of a scene to improve performance. Naturally, the hardware can still shade every single pixel in a 1x1 pattern. But the architecture also facilitates 2x1 and 1x2 options, along with 2x2 and 4x4 blocks.

16aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83LzYvNzk3Mjk4L29yaWdpbmFsL2NvbnRlbnQtYWRhcHRpdmUtb3JpZ2.jpg
Full-rate shading

17aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0EvNzk3MzAyL29yaWdpbmFsL2NvbnRlbnQtYWRhcHRpdmUtc2hhZG.jpg
Content-adaptive shading, color-coded

18aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0QvNzk3MzA1L29yaWdpbmFsL2NvbnRlbnQtYWRhcHRpdmUtZmluYW.jpg
Content-adaptive shading, final output

Nvidia offers several use cases where variable rate shading is practical (you don’t want to apply it gratuitously and negatively affect image quality). The first is content-adaptive shading, where less detailed parts of a scene don’t change as much and can be shaded at a lower rate. There’s actually a build of Wolfenstein II with variable rate shading active. In it, you can turn on the shading rate visualization to watch how complex objects aren’t affected at all by this technology, while lower-frequency areas get shaded at a lower resolution. A number of intermediate steps facilitate multiple rates. We must imagine that game developers looking to exploit variable rate shading in a content-adaptive manner will prioritize quality over performance. Still, we’d like to see this enabled as a toggleable option so third parties can draw comparisons with the feature on and off.

19aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L0IvNzk3MzAzL29yaWdpbmFsL3ZhcmlhYmxlLXJhdGUtc2hhZGluZy.jpg

Motion-adaptive shading is another interesting application of Nvidia’s variable rate shading technology, where objects flying by are perceived at a lower resolution than whatever subject we’re focused on. Based on the motion vector of each pixel, game developers can determine how aggressively to reduce the shading rate and apply the same patterns seen in the content-adaptive example. Doing this correctly does require an accurate frequency response model to ensure the right rates are used when you spin around, sprint forward, or slow back down.

Again, Nvidia presented a demo of Wolfenstein II with content- and motion-adaptive shading enabled. The performance uplift attributed to variable rate shading in that title was on the order of ~15%, if only because Wolfenstein II already runs at such high frame rates. But on a slower card in a more demanding game, it may become possible to get 20%+-higher performance at the 60ish FPS level. Perhaps more important, there was no perceivable image quality loss.

Although Nvidia hasn’t said much about how these capabilities are going to be utilized by developers, we do know that its Wolfenstein II demo was made possible through Vulkan extensions. The company is working with Microsoft to enable DirectX support for Variable Rate Shading. Until then, it'll expose Adaptive Shading functionality through the NVAPI software development kit, which allows direct access to GPU features beyond the scope of DirectX and OpenGL.

**Rhialto** · 09-14-2018

RTX-OPS: Trying to Make Sense of Performance

As modern graphics processors become more complex, integrating resources that perform dissimilar functions but still affect the overall performance picture, it becomes increasingly difficult to summarize their capabilities. We already use terms like fillrate to compare how many billions of pixels or texture elements a GPU can theoretically render to screen in a second. Memory bandwidth, processing power, primitive rates—the graphics world is full of peaks that become the basis for back-of-the-envelope calculations.

Well, with the addition of Tensor and RT cores to its Turing Streaming Multiprocessors, Nvidia found it necessary to devise a new metric that’d suitably encompass the capabilities of its INT32 and FP32 math pipelines, its RT cores, and the Tensor cores. Tom’s Hardware doesn’t plan to use the resulting “RTX-OPS” specification for any of its comparisons, but since Nvidia is citing it, we want to at least describe the equation’s composition.

The RTX-OPS model requires utilization of all resources, which is a bold and very future-looking assumption. After all, until games broadly adopt Turing’s ray tracing and deep learning capabilities, RT and Tensor cores sit idle. As they start coming online, though, Nvidia developed its own approximation of the processing involved in one frame rendered by a Turing-based GPU.

20aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L08vNzk3MzE2L29yaWdpbmFsL3J0eC1vcHMuanBn.jpg

In the diagram above, Nvidia shows roughly 80% of the frame consumed by rendering and 20% going into AI. In the slice dedicated to shading, there’s a roughly 50/50 split between ray tracing and FP32 work. Drilling down even deeper into the CUDA cores, we already mentioned that Nvidia observed roughly 36 INT32 operations for every 100 FP32 instructions across a swathe of shader traces, yielding a reasonable idea of what happens in an “ideal” scene leveraging every functional unit.

So, given that…

FP32 compute = 4352 FP32 cores * 1635 MHz clock rate (GPU Boost rating) * 2 = 14.2 TFLOPS

RT core compute = 10 TFLOPS per gigaray, assuming GeForce GTX 1080 Ti (11.3 TFLOPS FP32 at 1582 MHz) can cast 1.1 billion rays using software emulation = ~100 TFLOPS on a GeForce RTX 2080 Ti capable of casting ~10 billion rays

INT32 instructions per second = 4352 INT32 cores * 1635 MHz clock rate (GPU Boost rating) * 2 = 14.2 TIPS

Tensor core compute = 544 Tensor cores * 1635 MHz clock rate (GPU Boost rating) * 64 floating-point FMA operations per clock * 2 = 113.8 FP16 Tensor TFLOPS

…we can walk Nvidia’s math backwards to see how it reached a 78 RTX-OPS specification for its GeForce RTX 2080 Ti Founders Edition card:

(14 TFLOPS [FP32] * 80%) + (14 TIPS [INT32] * 28% [~35 INT32 ops for every 100 FP32 ops, which take up 80% of the workload]) + (100 TFLOPS [ray tracing] * 40% [half of 80%]) + (114 TFLOPS [FP16 Tensor] * 20%) = 77.9

Again, there are a lot of assumptions made in this model, we see no way to use it for generational or competitive comparisons, and we don’t want to get in the habit of generalizing ratings across many different resources. At the same time, it’s clear that Nvidia wanted a way to represent performance holistically and we cannot fault the company for trying, particularly since it didn’t just add the capabilities of each subsystem but rather isolated their individual contributions to a frame.

Display Outputs and the Video Controller

Display Outputs: A Controller For The Future

The GeForce RTX Founders Edition cards sport three DisplayPort 1.4a connectors, one HDMI 2.0b output, and one VirtualLink interface. They support up to four monitors simultaneously, and naturally are HDCP 2.2-compatible. Why no HDMI 2.1? That standard was released in November of 2017, long after Turing was finalized.

Turing also enables Display Stream Compression (DSC) over DisplayPort, making it possible to drive an 8K (7680x4320) display over a single stream at 60 Hz. GP102 lacked this functionality. DSC is also the key to running at 4K (3840x2160) with a 120 Hz refresh and HDR.

Speaking of HDR, Turing now natively reads in HDR content through dedicated tone mapping hardware. Pascal, on the other hand, needed to apply processing that added latency.

Finally, all three Founders Edition cards include VirtualLink connectors for next-generation VR HMDs. The VirtualLink interface utilizes a USB Type-C connector but is based on an Alternate Mode with reconfigured pins to deliver four DisplayPort lanes, a bi-directional USB 3.1 Gen2 data channel for high-res sensors, and up to 27W of power. According to the VirtualLink Consortium, existing headsets typically operate within a 15W envelope, including displays, controllers, audio, power loss over a 5m cable, cable electronics, and connector losses. But this new interface is designed to support higher-power devices with improved display capabilities, better audio, higher-end cameras, and accessory ports as well. Just be aware that VirtualLink’s power delivery is not reflected in the TDP specification of GeForce RTX cards; Nvidia says using the interface requires up to an additional 35W.

Video Acceleration: Encode And Decode Improvements

Hardware-accelerated video features don’t get as much as attention as gaming, but new graphics architectures typically do add improvements that support the latest compression standards, incorporate more advanced coding tools/profiles, and offload work from the CPU.

Encode performance is more important than ever for gamers streaming content to social platforms. A GPU able to handle the workload in hardware alleviates other platform resources, so the encode has a smaller impact on game frame rates. Historically, GPU-accelerated encoding didn’t quite match the quality of a software encode at a given bit rate, though. Nvidia claims Turing changes this by absorbing the workload and exceeding the quality of a software-based x264 Fast encode (according to its own peak signal-to-noise ratio benchmark). Beyond a certain point, quality improvements do offer diminishing returns. It’s notable, then, that Nvidia claims to at least match the Fast profile. But streamers are most interested in the GPU’s ability to minimize CPU utilization and bit rate.

21aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83LzEvNzk3MjkzL29yaWdpbmFsL3ZpZGVvLWVuZ2luZS5qcGc=.jpg

This generation, the NVEnc encoder is fast enough for real-time 8K HDR at 30 FPS. Optimizations to the encoder facilitate bit rate savings of up to 25% in HEVC (or a corresponding quality increase at the same bit rate) and up to 15% in H.264. Hardware acceleration makes real-time encoding at 4K a viable option as well, although Nvidia doesn’t specify what CPU it tested against to generate a 73% utilization figure in its software-only comparison.

Additionally, the decode block supports VP9 10/12-bit HDR. It’s not clear how that differs from GP102-based cards though, since GeForce GTX 1080 Ti is clearly listed in Nvidia’s NVDec support matrix as having VP9 10/12-bit support as well. Similarly, HEVC 4:4:4 10/12-bit HDR is listed as a new feature for Turing.

Nvidia’s Founders Edition: Farewell, Beautiful Blower

The GeForce RTX 2080 Ti, 2080, and 2070 Founders Edition cards represent a departure from the “reference” design we were first introduced to in 2012 with GeForce GTX 690. After the 690, we grew to appreciate cards like the original GeForce GTX Titan with its centrifugal fan that blew heated air out the dual-slot bracket and away from your PC’s guts. Many enthusiasts felt differently, though. Because centrifugal fans move air through heat sinks quickly, they tend to be noisier under load than more free-flowing thermal solutions with multiple axial fans. And because Nvidia tried to keep their reference boards running quietly, they were often accused of limiting peak GPU Boost clock rates or even outright throttling performance.

22aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS84LzUvNzk3MzMzL29yaWdpbmFsL2ZvdW5kZXJzLWVkaXRpb24uanBn.jpg

Sadly for us, those days are gone. The new Founders Edition design eschews a centrifugal fan for two axial fans. They sit atop a dense fin stack that surrounds what Nvidia calls the largest vapor chamber ever used on a graphics card. A forged aluminum cover encircles the cooler’s length but leaves the top and bottom open. Heated air is consequently directed up out the top and down towards your motherboard, possibly in the direction of an M.2-based SSD installed underneath. The aesthetic just isn’t as distinct, sacrificing the prestige of Nvidia’s previous-gen reference boards.

23aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS84L0EvNzk3MzM4L29yaWdpbmFsL2ZvdW5kZXJzLWVkaXRpb24tdG9wLm.jpg

Gamers who put a lot of care into building PCs with plenty of airflow should enjoy a better experience overall though; the axial fans can dissipate more power or drive lower temperatures at a given noise level. Alternatively, the axial fans offer improved acoustics compared to blowers in a power-limited condition. Either way, there’s a silver lining for anyone who preferred the elegance of a centrifugal fan exhausting heated air but is willing to entertain the merits of Nvidia’s latest creation.

Another potential selling point is improved overclocking headroom. Some of this comes from the cooler’s increased capacity. But Nvidia also says its power supply facilitates close to 60W of additional capacity beyond the stock 260W. At the same time, efficiency is optimized using an eight-phase power supply able to dynamically turn phases on and off based on load.

**Rhialto** · 09-14-2018

Overclocking: Making The Most Of Headroom With Nvidia Scanner

The Founders Edition cards are already overclocked beyond Nvidia’s base specification, but they aren’t tuned right up to their breaking points. In order for the company to define a frequency floor and typical GPU Boost rate, all samples must be capable of the same clocks at minimum. From there, though, every GeForce RTX hits a different limit before becoming unstable. That ceiling even changes depending on workload. Enthusiasts often make it a point of pride to push right up to this threshold by finessing whatever gears, levers, and dials are exposed through popular apps like Precision XOC and Afterburner.

24aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS84L0IvNzk3MzM5L29yaWdpbmFsL3ByZWNpc2lvbi14LmpwZw==.jpg

To the best of its ability, Nvidia is taking the trial and error out of overclocking with an API/DLL package that partners like EVGA and MSI can build into their utilities. Instead of an enthusiast going back and forth, testing one part of the frequency/voltage curve at a time and adjusting based on the stability of various workloads, Nvidia’s Scanner runs an arithmetic-based routine in its own process, evaluating stability without user input. Although Nvidia says the metric usually encounters math errors before crashing, the fact that it’s contained means the algorithm can recover gracefully if a crash does occur. This gives the tuning software a chance to increase voltage and try the same frequency again. Once the Scanner hits its maximum voltage setting and encounters one last failure, a new frequency/voltage curve is calculated based on the known-good results. From start to finish, the process purportedly takes fewer than 20 minutes.

25aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS84L0UvNzk3MzQyL29yaWdpbmFsL3ByZWNpc2lvbi14LXRlc3Rpbmcuan.jpg

Interestingly, Nvidia Scanner functionality won’t be limited to owners of GeForce RTX graphics cards. The company says it’ll characterize older boards as well (though it isn’t specific about how far back support will stretch).

Ray Tracing And AI: Betting It All on Black

Nvidia could have poured all of its efforts into features that make Turing faster than the Pascal architecture in today’s games, rather than splitting attention between immediate speed-ups and investments in the future. But the company saw an opportunity to drive innovation and took a calculated risk. And why not? Its performance and efficiency lead over AMD’s Vega RX series creates a comfortable margin to push next-gen technology more than we’ve seen in years.

26aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L1MvNzk3MzIwL29yaWdpbmFsL3RvbWItcmFpZGVyLmpwZw==.jpg

Dedicating gobs of transistors to Tensor cores and RT cores doesn’t pay off today. It doesn’t pay off tomorrow. Rather, it pays off months or even years down the road as developers become comfortable optimizing their games for a combination of real-time ray tracing and rasterization. Nvidia needs them to spend time experimenting with neural graphics and AI. Technologies like mesh shading and variable rate shading aren’t used anywhere currently; they require adoption and support. With all of that said, developer enthusiasm for real-time ray tracing appears unanimous. This truly seems like a vision everyone wants to see realized.

27aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L1QvNzk3MzIxL29yaWdpbmFsL2Fzc2V0dG8tY29yc2ljYS5qcGc=.jpg

In the meantime, gamers aren’t going to spend their hard-earned dollars on a future promise of better-looking visuals. They want instant gratification in the form of higher frame rates. We can talk architecture ad nauseam. However, the features found in today’s discussion only amount to half of the story (the academic half, no less). In a few days, we’ll have a battery of performance, power, and acoustic results to present. Only then can we tell you if GeForce RTX 2080 and 2080 Ti warrant the high prices that have enthusiasts gnashing their teeth.

28aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83LzcvNzk3Mjk5L29yaWdpbmFsL3J0eC1vbi5qcGc=.jpg

Torrent Invites! Buy, Trade, Sell Or Find Free Invites, For EVERY Private Tracker! HDBits.org, BTN, PTP, MTV, Empornium, Orpheus, Bibliotik, RED, IPT, TL, PHD etc!

Thread: Nvidia’s Turing Architecture Explored: Inside the GeForce RTX 2080

LinkBack

Thread Tools

Display

Nvidia’s Turing Architecture Explored: Inside the GeForce RTX 2080

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions