SPECviewperf 13
The most recent version of SPECviewperf employs traces from Autodesk 3ds Max, Dassault Systemes Catia, PTC Creo, Autodesk Maya, Autodesk Showcase, Siemens NX, and Dassault Systemes SolidWorks. Two additional tests, Energy and Medical, aren’t based on a specific application, but rather on datasets typical of those industries.
101aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS8wL1cvODIwNDAwL29yaWdpbmFsL1NQRUN2aWV3cGVyZi0xOTAweDEwN.jpg
In some workloads, Nvidia’s DirectX driver allows GeForce RTX 2080 Ti to match or even exceed the performance of Titan V. But Catia and NX, specifically, respond well to the professional driver optimizations that benefit Titan cards. The GeForce even loses to the older Titan Xp in those workloads.
Across the board, Titan RTX beats the still-formidable Titan V.
102aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS8wL1kvODIwNDAyL29yaWdpbmFsL1NQRUN2aWV3cGVyZi0yODAweDIxM.jpg
Titan V scores a slight win in the Energy tests, but again succumbs to Titan RTX everywhere else once we step the resolution up to 3800x2120.
Performance Results: Deep Learning
The introduction of Turing saw Nvidia’s Tensor cores make their way from the data center-focused Volta architecture to a more general-purpose design with its beginnings in gaming PCs. However, the company’s quick follow up with Turing-based Quadro cards and the inferencing-oriented Tesla T4 GPU made it pretty clear that DLSS wasn’t the only purpose of those cores.
Now that we have Titan RTX—a card with lots of memory—it’s possible to leverage TU102’s training and inferencing capabilities in a non-gaming context.
Before we get to the benchmarks, it’s important to understand how Turing, Volta, and Pascal size up in theory.
Titan Xp’s GP102 processor is more like GP104 than the larger GP100 in that it supports general-purpose IEEE 754 FP16 arithmetic at a fraction of its FP32 rate (1/64), rather than 2x. GP102 does not support mixed precision (FP16 inputs with FP32 accumulates) for training, though. This becomes an important distinction in our Titan Xp benchmarks.
Volta greatly improved on Pascal’s ability to accelerate deep learning workloads with Nvidia’s first-generation Tensor cores. These specialized cores performed fused multiply adds exclusively, multiplying a pair of FP16 matrices and adding their result to an FP16 or FP32 matrix. Titan V’s Tensor performance can be as high as 119.2 TFLOPS for FP16 inputs with FP32 accumulates, making it an adept option for training neural networks.
Although TU102 hosts fewer Tensor cores than Titan V’s GV100, a higher GPU Boost clock rate facilitates a theoretically better 130.5 TFLOPS of Tensor performance on Titan RTX. GeForce RTX 2080 Ti Founders Edition should be able to muster 113.8 TFLOPS. However, Nvidia artificially limits the desktop card’s FP16 with FP32 accumulates to half-rate.
Training Performance
Our first set of benchmarks utilizes Nvidia’s TensorFlow Docker container to train a ResNet-50 convolutional neural network using ImageNet. We separately charted training performance using FP32 and FP16 (mixed) precision.
103aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS8wL1UvODIwMzk4L29yaWdpbmFsL1Jlc05ldDUwLUZQMTYucG5n.jpg
104aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS8wL1YvODIwMzk5L29yaWdpbmFsL1Jlc05ldDUwLUZQMzIucG5n.jpg
The numbers in each chart’s legend represent batch sizes. In brief, batch size determines the number of input images fed to the network concurrently. The larger the batch, the faster you’re able to get through all of ImageNet’s 14+ million images, given ample GPU performance, GPU memory, and system memory.
In both charts, Titan RTX can handle larger batches than the other cards thanks to its 24GB of GDDR6. This conveys a sizeable advantage over cards with less on-board memory. Of course, Titan V remains a formidable graphics card, and it’s able to trade blows with Titan RTX using FP16 and FP32.
GeForce RTX 2080 Ti’s half-rate mixed-precision mode causes it to shed quite a bit of performance compared to Titan RTX. Why isn’t the difference greater? The FP32 accumulate operation is only a small part of the training computation. Most of the matrix multiplication pipeline is the same on Titan RTX and GeForce RTX 2080 Ti, creating a closer contest than the theoretical specs would suggest. Switching to FP32 mode erases some of the discrepancy between Nvidia’s two TU102-based boards.
It’s also worth mentioning the mixed precision results for Titan Xp. Remember, GP102 doesn’t support FP16 inputs with FP32 accumulates, so it operates in native FP16 mode. The benchmarks do run successfully. But their accuracy is inferior to what Volta and Turing enable. To compare generational improvements at a like level of accuracy, you’d need to pit Titan RTX with mixed precision (644 images/sec) against Titan Xp in FP32 mode (233 images/sec).
Next, we train the Google Neural Machine Translation recurrent neural network using the OpenSeq2Seq toolkit, again in TensorFlow.