Focusing the microscope deeper into the execution unit (EU), we can see a pair of SIMD floating-point units (ALUs), but these units actually support both floating point and integer operations. Intel lists the ALUs as capable of executing four 32-bit FP or integer operations, of up to eight 16-bit FP operations. That equates to 16 FP32 operations per clock, or 32 FP16 operations per clock. The EUs are multithreaded.
FireShot Capture 183 - Intel Slips Out New Gen11 Graphics Ar_ - https___www.tomshardware.com_new.jpg
A closer look at the shared local memory (SLM) design, which feeds the eight EU in each subslice, reveals that Intel brought the SLM into the subslice to reduce contention through the dataport when the L3 cache is being simultaneously accessed. SLM's closer proximity to the EUs also helps reduce latency and boosts efficiency.
FireShot Capture 184 - Intel Slips Out New Gen11 Graphics Ar_ - https___www.tomshardware.com_new.jpg
Here we see a birds-eye view of the memory hierarchy and the associated theoretical peak bandwidths between the components. Intel's move to support for LPDDR4 represents a significant step forward in bandwidth on the low-power front, but the true innovation lies in the shared memory design that reduces the need for copying data through buffers.
We're diving deeper on the details in the document and will update this post as necessary.