Put memory into the CPU, cancel hyperthreading, and greatly improve performance!
2024-03-18 / tech
Since the release of the 12th generation Core processors, Intel has only made adjustments to core specifications and frequencies, and made minor changes to the core microarchitecture on the 13th and 14th generation products. Overall, there has not been a significant overhaul of the entire processor architecture. Although the Intel Core i-series processors still dominate the market with their powerful performance, extremely high frequencies, and excellent overall performance, as AI technology rises and the industry situation develops, Intel needs to make more changes if it wants to continue to maintain a leading position and lead the development of the industry. On June 5, 2024, Intel unveiled a new generation of processor products codenamed Lunar Lake, bringing a completely new P-core, E-core, GPU, NPU, and SoC design, comprehensively innovating every aspect of the processor to better meet the computing needs of the AI era, while also maintaining extremely high traditional computing capabilities. Let's take a look at its main changes.
Continuing the modular strategy, memory is integrated with the processor packaging for the first time
Intel first adopted the Chiplet design on Meteor Lake, allowing different cores such as computing cores, GPU cores, IO cores, and SoC cores to be manufactured using different processes, and then integrating them together through advanced packaging technology. This decoupling of processes and cores, with each adopting a more suitable manufacturing process, has brought about a significant change in processor design. On Lunar Lake, Intel maintains this technology but innovatively incorporates memory packaging, bringing a more integrated product, further improving performance, energy efficiency, and application experience.
Advertisement
From an overall architectural perspective, memory packaging will bring about a system-level efficiency improvement. Since motherboard manufacturers do not need to arrange separate memory power supply and data transmission lines on the PCB, these functions are all transferred to the processor's PCB substrate, and the signal line layout and anti-interference design brought by high-frequency memory can all be eliminated. For Intel, moving memory to the processor substrate also results in more stable performance and higher energy efficiency, as well as the most important saving of internal area in mobile devices.
Intel data shows that moving memory to the processor substrate has resulted in a 40% reduction in physical power consumption, a saving of 250 square millimeters of area, and a transfer bandwidth rate of 8.5 GT/s per chip, with a capacity that can reach 32GB. This is sufficient for laptop products.
Performance and efficiency cores have evolved comprehensively, with a significant increase in IPC
Another major improvement of Lunar Lake in core microarchitecture compared to Meteor Lake is that the most critical microarchitecture of the entire processor has been updated. The performance core of Lunar Lake, also known as the P-core, has evolved to Lion Cove, and the E-core microarchitecture has evolved to Skymont, bringing a significant performance improvement compared to the previous generation.
From a macro perspective, Lion Cove mainly increases scale, improves internal execution capabilities, adds more execution ports, and carries out a large-scale innovation of the cache. Intel believes that Lion Cove has made improvements in performance and area efficiency while also being more in line with modern needs.
More specifically, the core improvements of the performance core mainly lie in several aspects: the branch prediction width has been increased to 8 times the previous, the out-of-order execution parts of VEC and INT have been separated and scheduled, and there are also wider scheduling units, an enhanced memory subsystem, the addition of an L0 level cache, and a complete overhaul of the memory subsystem. In terms of performance and power consumption, it has introduced AI-based power management and optimization for core area and performance.
If the above improvements are not obvious, then the most important change of this generation is that Lion Cove has eliminated Hyper-Threading technology and related transistor resources. Intel believes that the E-core has largely played the role of Hyper-Threading technology, and Hyper-Threading technology also requires a large amount of transistor resources. Therefore, this generation of processors has completely eliminated it to obtain a better area performance ratio, while also reducing core area, power consumption, and cost.In terms of performance, the performance cores have seen an average IPC increase of 14% compared to the previous generation, with more significant improvements at lower power consumption levels, and enhancements still exceeding 10% at higher power consumption levels. When considering the higher frequencies, the performance gains are even more substantial.
If the improvements in performance cores are considered significant, then the enhancements in the efficiency cores can be described as revolutionary. The main improvements in Skymont's efficiency cores are the overall increase in IPC, the ability to deliver performance across a wider range of workloads, and the addition of enhanced vector and AI computations.
Overall, the branch prediction in the efficiency cores has been significantly strengthened, the front-end instruction decode has been expanded to a 3x3, or 9-width design, and the architecture has seen a substantial increase in scale, scheduling ports, cache, and queue depth. In terms of vector computation, the SIMD has been increased to 4x128 bits, which means that the throughput capability has doubled compared to the previous generation, and the support for VNNI instructions has also been greatly improved. In summary, the efficiency cores are no longer just designed for energy-saving; with a significant expansion in scale, they have correspondingly achieved higher performance and can now be used as main cores.
In terms of performance, Skymont has increased single-thread floating-point capabilities by 1.68 times and multi-thread integer capabilities by up to 4 times (due to the expanded power range), or with only one-third of the power consumption of the previous generation. Since the overall performance of the previous generation Crestmont has already surpassed the performance of Intel's previously used Skylake and various "version ++" versions, after such significant changes, it can even be considered that Skymont may not be far behind Lion Cove in terms of micro-architectural IPC, but with better energy efficiency, which could be one of the important changes in Intel's future development.
In terms of cluster performance scheduling, Lunar Lake's scheduling is more mature. Thanks to the new process technology, the more excellent P and E cores, and the design with higher performance-to-power ratio, Skymont and Lion Cove have shifted significantly on the performance-to-power cross-point, allowing more tasks to be handled by Skymont, with Lion Cove only stepping in when higher performance is required, undoubtedly leading to a better performance-to-power ratio.
Due to the existence of big and small cores, Intel continues the practice of hardware thread schedulers but has made more improvements, such as better OS partition settings, better power integration management, and overall algorithm optimization, the addition of AI judgment, and more refined control, all of which have led to an increase in overall thread scheduling efficiency.
Lunar Lake's entire scheduling is currently more dynamic and autonomous. The scheduling priorities of P cores and E cores are more inclined towards improving the energy-to-performance ratio, but they are also well-optimized for performance demands. Since the E cores are now more powerful and cover a broader range of the best performance-to-power ratio, the probability of shifting to P cores is also lower, with P cores only going full throttle in the case of sudden heavy loads.
GPU and NPU enhancements, significant AI computation upgrades
Intel has made significant progress in its self-developed GPUs, and its products have been favored by many consumers for their high cost-performance ratio. In Lunar Lake, Intel has introduced the second-generation Xe GPU architecture, bringing new vector engines and significantly improved overall performance and efficiency.
The important improvements in the second-generation Xe GPU lie in its larger scale and stronger ray tracing and AI performance. For example, it includes 8 Xe cores, 8 stronger ray tracing units, and enhanced XeSS cores. A larger scale means stronger performance, which is a tried-and-true remedy for GPUs. The GPU performance of Lunar Lake is 1.5 times that of the previous generation, better meeting users' graphic needs.In the realm of AI computing, the new Xe GPU integrates a novel vector engine, which similarly brings a larger scale, such as native support for SIMD16, and supports a wider range of precisions, including INT2, INT4, INT8, INT16, as well as BF16 and FP16, which will overall enhance the efficiency and functionality of AI model computations.
Regarding the media engine, Intel has introduced a completely redesigned media engine for this generation, bringing support for AV1 and VVC codecs. The main features include energy-saving capabilities for eDP 1.5, which encompass adaptive synchronization of display frame rates and media frame rates to reduce screen flicker, content queuing sequences to save CPU power, and selective display content (Early Transport) to reduce overall display power consumption. In terms of specifications, it primarily adds support for H.266, also known as VVC decoding, with H.266 further reducing file sizes by about 10% compared to current AV1, in addition to features like adaptive encoding and screen content coding streams SSC. On the display front, it supports three display channels, DP 2.1, HDMI 2.1, etc.
Overall, the graphics performance of the entire Lunar Lake has been significantly enhanced, with Intel data showing an approximate 50% improvement, and AI performance reaching up to 67 TOPS, along with support for more new features. Thanks to the GPU upgrade, more users can directly opt for models with integrated graphics configurations and still enjoy a satisfactory experience in graphic computing applications, which is quite satisfying.
In terms of the NPU, the NPU of Lunar Lake has also seen significant enhancement and strengthening due to the development of AI applications. The overall computing power of the NPU reaches up to 48 TOPS. Although it may seem lower than the GPU, the NPU is more efficient and power-saving in its overall computation, allowing more AI computing tasks to be completed directly on the NPU without the need to engage the CPU and GPU. The changes in the NPU mainly bring new functionalities, such as support for native activation functions and data transformation, and support for embedding tokenization in large language models. Architecturally, this generation, which is the 4th generation NPU, has a larger scale, including 12 enhanced Shave DSPs and 6 neural network engines, doubled bandwidth, and optimized MAC architecture, leading to a significant overall performance improvement.
Intel concludes that Lunar Lake can currently provide up to 120 TOPS of computing power, capable of handling a large amount of AI computations, including text-to-image generation and local operation of large models. In the current era where more and more software has built-in AI capabilities, local AI computing remains very important, and Intel is also adapting to the times with its operations.
Ultra-high energy efficiency ratio of the new generation of high-performance AI mobile processors.
Finally, let's summarize the release of Lunar Lake and the technological applications associated with it. Lunar Lake represents a comprehensive innovation by Intel after entering the Chiplet era. The entire Lunar Lake, including both P-cores and E-cores, as well as the GPU, NPU, and interconnect performance, has undergone comprehensive changes and enhancements, bringing a plethora of new technologies and support for more new specifications. It is no exaggeration to say that the number and complexity of new technology applications in this Lunar Lake far exceed any previous product. Intel's technological evolution in recent years has been extremely aggressive, whether it is the previous generation Meteor Lake or this generation Lunar Lake, the architecture design, technological applications, and overall specifications are all undergoing a comprehensive shift. The performance of the actual product of Lunar Lake, the Core Ultra 200 series, after its market launch is anticipated, and we will also try our best to contact manufacturers to get the product as soon as possible and share the specific performance with everyone.
Comment