Understanding Qualcomm's Snapdragon 810: Performance Preview
by Joshua Ho & Andrei Frumusanu on February 12, 2015 9:00 AM EST- Posted in
- SoCs
- Qualcomm
- Mobile
- Gobi
- Snapdragon 810
Energy Aware Scheduler
Although big.LITTLE has been around for a few years at this point, it’s still worth going over the basics of big.LITTLE in mobile SoCs. Fundamentally, the smartphone SoC space has benefited greatly from playing catch-up to the PC space. At first, SoCs were on lagging process nodes and CPUs were simple and almost entirely in-order in nature. For the first few years, doubling CPU performance every year was possible by adding additional cores, increasing clock speeds, widening the pipeline, and jumping down a process node.
Once we reached the limit for optimizing in-order architectures, the only way to improve performance in a meaningful way was to focus on avoiding stalls in the CPU pipeline. In an in-order CPU architecture, any missing information for executing an instruction means that the CPU must wait for the information to arrive from DRAM or some other storage. Even if a CPU executes incredibly quickly otherwise, it is stuck waiting on dependencies that can significantly degrade performance.
The solution is to execute operations out-of-order. After all, if you have to build a PC, you don’t sit around waiting a few weeks for a graphics card to arrive before building the rest of the PC. Similarly, modern CPUs execute instructions out-of-order to improve performance and avoid stalls. However, implementing this logic in a CPU is far from a trivial matter, as a CPU has to be designed to know which instructions can be executed out-of-order and which must be executed in order. Even instructions with dependencies that have yet to be resolved can be executed speculatively, which can save a great deal of time if the results of this speculation are used. As a result of this speculative execution and the logic needed to implement out-of-order execution (OoOE), the number of transistors and interconnects grows dramatically, which means power consumption grows dramatically as well.
It is in the context of this fundamental problem that big.LITTLE came to be. While there are multiple solutions to solving the power problem that comes with OoOE, ARM currently sees big.LITTLE as the best solution. Fundamentally, big.LITTLE seeks to use in-order, low power processors for the vast majority of computing in mobile, but switches tasks to big, out-of-order processors when a task is too much for the little cores to handle. In theory, this seems to be the ideal solution as it makes it possible to retain the power-saving advantages of in-order cores and the performance advantages of big OoOE cores.
Meanwhile it should be noted that there are other ways of using two heterogeneous CPU clusters, such as cluster migration, which was employed in the first Samsung Exynos bL SoCs and is still employed in NVIDIA's Tegra X1. But for now we will only focus on big.LITTLE HMP operation, which allows all cores to be active and exposed to the operating system. To translate this simple idea into reality is a difficult task. Currently, the de-facto solution in the mobile space for big.LITTLE is the ARM and Linaro developed Global Task Scheduling, which relies on a per-entity load tracking (PELT) mechanism with two load thresholds that decide if a process should be migrated to a corresponding cluster.
There’s a significant amount of terminology flying around regarding how this works, and we've covered the mechanic in our Huawei Honor 6 review and more in depth in our recent ARM A57/A53 investigation of the Galaxy Note 4 with the Exynos 5433. To recap, at its core, the per-entity load tracking is the main mechanism at hand needed to make thread placement work in GTS.
This system is designed to track load per task by weighting recent load the greatest, and slowly reducing the impact of previous load by a decay factor, which is a geometric series by default. Unfortunately, this load metric does have some disadvantages. Primarily, if a task idles for a long period of time and suddenly demands a significant load, in a race to sleep scenario per-entity load tracking can take a significant amount of time to reach a given up differential to migrate a thread from the small cores to the big cores. Similarly, it can take a significant time for a system with per-entity load tracking system to view a task that has reached idle to migrate a task down to the little cores. This system is also completely unaware of the real-world energy characteristics of the CPU cores, as load is the only real consideration that comes into the scheduler.
For the Snapdragon 810, Qualcomm has fundamentally done away with per-entity load tracking, and uses a window-based system instead. While we weren’t told the size of each window or how many non-idle windows were retained, the load tracking system uses the average of load across all of the recent windows while also looking at the recent maximum value to determine if there is a task that suddenly requires a significant amount of CPU power. This means that there’s a much shorter waiting period for core scheduling when a thread that goes from mostly or completely idle to a high load or vice versa.
This scenario is common throughout smartphones and can be as simple as reading a web page and then opening a new link. The average metric over all of the windows is used to determine whether a thread needs to continue to run on a big core, or whether it can be safely moved down to a little core. In addition, this window-based system accounts for cases where cores are throttled from their maximum frequencies, which means that processes may stay on little cores even if the load for a task is high for a little core if it would perform worse on a throttled big core.
While there are some areas where we can compare and contrast current GTS solutions and Qualcomm’s solution for the Snapdragon 810, there are areas where no comparisons can be made at all. Although ARM is working on an energy model and an energy aware scheduler, we haven’t seen this working in any shipping SoC.
For the Snapdragon 810, there is an energy model for all CPUs that controls for changing power consumption with temperature and can provide a metric of performance per watt at all frequency states. However, unlike ARM’s energy cost model there’s no tracking for the power cost of a task that increases frequency on the cluster (synchronous architectures require that all CPUs run at the same frequency), nor are wake ups tracked and accounted for in energy modeling.
To be fair, there are a lot of aspects that are shared with the latest GTS mechanism such as packing small tasks onto already awake CPUs in order to avoid the cost of waking a CPU from a power collapse state. However, on the Snapdragon 810 there are evaluations throughout the execution of task to be able to move a task to a big core if its load increases from when the task first started, or if it’s necessary to move a thread from one big core to another big core depending upon the perf/W for each big core. In addition, if a single core is running a small load or task the scheduler can move the thread to another core and allow the other core to go to sleep and save power. The scheduler is also said to be aware of the power cost of migration.
Finally, the scheduler in Snapdragon 810 is used to help guide the CPU frequency governor policy by notifying the frequency governor appropriately to avoid cases where a task is migrated to another CPU and causes inefficient behavior. For example, if a task is at 100% load and is migrated in the middle of a sampling window to another core, the original core isn’t kept at an unnecessarily high frequency, and the core that the task was migrated to will go to the right frequency for the aggregate load of the task. This appears to be somewhat of a mitigation for the window-based system, as ARM’s scheduler uses events to handle these issues without having to resort these patchwork fixes.
In terms of how the power arbitration is actually implemented compared to traditional power management mechanisms in existing SoCs, Qualcomm replaces the old Intel-developed CPUIdle "Menu" and "Ladder" governors. These worked based on the achieved and target residency time of the individual idle states of a CPU core. Qualcomm's solution is a completely new approach (called the Low-Power-Mode CPUIdle driver, or LPM) as it ignores the time characteristic in its entirety and looks only at energy modeling. For this, the SoC's drivers need to have precise arbitration data to be able to properly model the SoC's real power consumption without actually measuring it. Thankfully Qualcomm does this, and it's the most complete model of a commercially available SoC's power characteristics to date.
We not only see the energy models for the various CPU and cluster idle states, but also the idle states of the CCI, something which is lacking in GTS's software stack.
Ultimately, while it’s clear that Qualcomm’s solution to the big.LITTLE problem has its inefficiencies, their solution appears to be far superior to anything else with big.LITTLE on the market. And as previously mentioned in our Note 4 Exynos review, ARM’s energy aware scheduler is still far from implementation on a shipping SoC. This issue is only compounded by ARM’s need to make a solution that works for all big.LITTLE SoCs, and OEM adoption is often slow in these scenarios. While the Snapdragon 810 could be behind other SoCs in process technology, advantages in areas such as the thread scheduler could narrow the gap.
119 Comments
View All Comments
twizzlebizzle22 - Thursday, February 12, 2015 - link
The speed on modern/flagship SoCs are phenomenal. The right implementation and power savings are what I'm focussed on this year.ddriver - Thursday, February 12, 2015 - link
Either there is a typo in the "PNG Comp ST" test, or Exynos 5433 is ~1000 times faster than the competition...MrCommunistGen - Thursday, February 12, 2015 - link
Probably a comma instead of a decimal point. You'll see that the Multithreaded PNG score for the Exynos 5433 is roughly in line with the other SoCs and much "lower" than the Single Threaded score.Mondozai - Thursday, February 12, 2015 - link
"The speed on modern/flagship SoCs are phenomenal."Yes, but not this chip. It's going to be Qualcomm's main chip in 2015, it's still getting beaten by year old tech. Then again, the OEMs want a "total solution" and while Nvidia is crushing them in the GPU benchmarks, Nvidia still doesn't have a good integrated LTE solution, for example.
Nevertheless, GPU power matters. This SoC will struggle with 4K and its supposed to be the high-end. Disappointing.
Makaveli - Thursday, February 12, 2015 - link
Does 4k really matter that much on a 5' display?fokka - Thursday, February 12, 2015 - link
i say no, but sadly that is where the market will go, especially onphablets and tablets. there already are rumours about an lg g4 with a 1800p screen and as we see on qualcomm's reference platform, i'm pretty sure we'll see some 4k tablets enter the market pretty soon.Frenetic Pony - Friday, February 13, 2015 - link
Then don't buy their bullshit, that's easy enough. Anything beyond 1080 for subs 6" is ridiculous and wasteful.Uplink10 - Friday, February 13, 2015 - link
I think anything beyond HD for a smartphone is worthless, difference is not worth the price and energy. Do people need 4K, FullHD, QHD screens because they edit photos and videos on their smartphone which we then see in the cinemas?xnay - Saturday, February 14, 2015 - link
I totally agree with you. And I'm waiting impatiently for the new HTC M9 because it's said to be using 1080p display.Laststop311 - Friday, February 20, 2015 - link
Im with you. I wish they woul stick to standard full HD and focus on improving reflectance of outside light to a lower percentage (better performance in this area is critical it allows easier viewing in sunlight without having to crank the brightness up and use more power), Luminance per watt for either brighter screen or same brightness but less power (which is easily possible if they quit using smaller pixels that block more of the backlight), better color accuracy and gamma with even a higher bit screen to display more color while keeping accuracy high. Pre calibrated with professional tools at the factory the way dell does with their high end u3014.Almost 100% of people I know would trade a couple extra hours of battery life to have less pixels. Less pixels = less power used by gpu, lower power backlight needed, less heat from backlight generated, smaller backlight needed (can make phone a bit thinner), more responsive phone when scrolling less pixels have to be renedered for the scroll animation so it's smoother and faster and uses less energy. And there isn't really a downside. You would have to have super human eagle eyes to see this difference between 1080 RGB strip and 1440 RGB stripe. Many more benefits sticking with 1080. Anything higher is utterly ridiculous for a 5-6 inch phone.
I could honestly get by with 1280 x 720 or 1366 x x756 or whatever it is. I loved the screen on my 5.5" galaxy note 2 with RGB stripe 1280x720 AMOLED. Everything looked plenty crisp and switching to the note 4 sure things do look a bit more crisp but just imagine the battery life saved if it was 1280x720. Bet hours would be added to it.