Contact: OlivierCertner
Motivation
For the amd64 architecture, Intel started shipping hybrid CPUs with the rather confidential Lakefield and then more massively with Alder Lake (Gen12). Apart from some models of Alder Lake, it is now impossible to buy an Intel chip that does not have at least P (Performance) and E (Efficiency) cores.
ARM first released incarnations of its big.LITTLE arrangement as soon as 2011. DynamIQ is an evolution where big and LITTLE CPUs can be grouped into a single cluster (in particular, sharing L3 cache), offering more flexibility for task migration.
See the Wikipedia page on Heterogeneous Computing.
Scope
- Make the scheduler (ULE) aware of CPUs having cores with a different performance/energy mix.
- Implement policies to steer scheduling decisions as the user/admin sees fit, with different possible goals (e.g., max performance, energy savings, latency, etc.).
- First, target amd64/Intel.
Design Ideas
Policies
Of course, choosing a core to execute a thread when cores have different performance and efficiency characteristics is not only an optimization process, as many degrees of liberty remain (consider the simple example of a single thread and two available cores, a P and a E one; which one to choose?). This selection must then be guided by specific policies decided by the administrator/user.
Currently, it is possible to use cpuset(1) to restrict threads/processes (and other objects) to a specific subset of cores. Each cpuset holds a mask designating the allowed cores (and domains, but we are not concerned with that here). Cpusets are useful for a number of reasons, in general to get more performance and more predictability (soft real-time), more specifically by, e.g., moving compute threads next to where data appears (such as NICs and their interrupts), reducing the overhead of thread migration, partitioning the system, etc. cpusets form a tree where each node restricts the mask of its descendants, allowing to . Specific "root" cpusets apply to the whole system and each jail, allowing to restrict jail processes to some cores only.
Performance/efficiency policies are conceptually different than cpusets, as the latter only say which cores are allowed, but not which ones should be actually chosen. Directly mapping policies to cpuset masks would lead to a number of practical problems, for example:
- Only trivial policies could be implemented.
- Changing a policy would amount to changing a mask, which needs to be propagated downwards the cpuset tree, possibly further restricting the policy's mask, to the point where no CPUs are actually available at some nodes.
That said, we may want to leverage the cpuset tree structure and the fact that interesting objects are already tagged with cpusets, by attaching a policy to a struct cpuset. Cpusets down the tree would only be able to further restrict parent policies, which implies that we must establish an ordering of policies.
Managing policies independently of cpuset masks allows to define partitions where each set comprises different types of cores, inside which the selected policy can apply. This seems to give maximum flexibility, where the administrator can still decide to forcibly rule out some cores for some threads, but also enjoy the benefits of some performance/efficiency policy, establish different policies for different system partitions, etc.
Among the possible policies, there should be at least a maximum performance one: All cores are kept busy when possible, and if not all can (not enough threads to run in parallel), the most performant ones are used in priority. Of course, there cannot be a maximum efficiency one without any other constraint (that would be equivalent to turning the computer off :-)). This is where optimization comes into play: Trying to keep the same performance, or degrading it slightly, while consuming less power, or conversely, reducing power consumption and in this envelope trying to reach maximum performance.
Nobody wants to reduce performance without reasons, so it seems that what is needed is some kind of efficiency control value. Let's say we define a scale without unit from 0 to 100 (this is arbitrary at this point, and just to get more concrete). Maximum efficiency would be 100, and we would define it as running threads only on the "lower" type of available cores (E or LP-E for Intel CPUs). Minimum efficiency (0) would correspond to maximum performance as explained above. As the value grows from 0, we would allow higher ratio of use of more performant cores to the most efficient ones (so we need a suitable mathematical function that grows and sends 0 to 0 and 100 to infinity, and probably sends 50 to 1; in practice, though, the ratios towards 0 would be discretized to avoid inefficiencies). The single value model is perhaps too simplistic when there are more than two types of cores; it may be desirable to have one such value per type of core, except for the most efficient type (this one being the base).
Processes/threads that should never consume too much power or, in the case of Meteor Lake, should not wake the compute tile, would be put in an appropriate cpuset whose policy completely rules out running on anything else than the most efficient cores. Those processes/threads may include power management daemon (e.g., powerd), interrupts for low speed devices, etc.
Interrupts
(TBD) - adrian
Kernel threads and default cpusets
(TBD) - adrian
Kernel threads and per-CPU workloads/workqueues/callbacks
(TBD) - adrian
Impacts on the Scheduler
Thread Migration Policy
This seems to be the most delicate part, and requires more thinking.
Currently, ULE essentially works as a per-core scheduler, and a long-term balancer rotates threads among available cores.
Taking just the example of long-running threads (but not necessarily at 100% CPU), we may want to rotate them among the different core types to average their progress per unit of time. Alternating between different core types only after long periods of time, as the longer-term balancer would do, will make visible the computing power discrepancy between them. For applications giving continuous feedback to users (games, multimedia, etc.), that may prove unacceptable. It is unclear to which time frame we can go down for migrating threads between core types without causing too much inefficiency, and if that can actually solve the problem just exposed (experiments will be needed). If that problem can't be solved this way, then such applications just will have to be put in a cpuset that only includes cores of a single type providing enough computing power for the task.
ULE has a notion of interactive versus batch threads (in the timesharing scheduling policy). We could leverage that to decide on which core type a resuming thread should start executing (e.g., P for interactive ones, E for batch ones), provided there are available cores of each type. If, e.g., there are more interactive tasks than there are P cores, again we must rotate them. Interactive tasks are by nature bursty, so rotation here should happen mainly while sleeping. For the same reason (burstiness), however, this situation is not expected to happen often. The converse situation, i.e., more batch threads then there are E cores, is expected to be more frequent, and requires rotation while executing.
Fairness
A scheduler has to preserve fairness of access to the CPU resource among threads with same priority. ULE currently does "fairness" for timesharing threads by allocating the same quantum to all of them. On hybrid architectures, though, a single quantum does not translate to the same available computing power, which in fine depends on the core type. So we'll have to compensate for that discrepancy, by taking into account allocated runtime weighted by the performance ratio of the used core versus the most efficient cores. The most natural path to do that is to actually compute that "virtual" runtime and schedule according to it. Incidentally, that would allow fixing some other problems that ULE has, in particular timesharing threads most of the time not using up all their quantum (they are quite penalized both in execution time and latency) and the dismal effect of nice values. However, this represents a significant departure from ULE, and might as well be considered a brand new scheduler (with similarities with, e.g., CFS). There may be mechanisms to achieve similar effects in ULE's current framework, I have some sketch of ideas but they require more research.
Intel Hybrid Architecture
On Alder Lake (Gen12), there are P and E cores in varying combinations (see, e.g., Wikipedia).
On Meteor Lake (Gen14), there are also LP-E cores, i.e., ultra-low power and part of a separate SoC, suitable to be used during deep sleep of the compute tile.
Detection
There is a CPUID leaf (0x1a) allowing to distinguish P and E cores. See the #References below.
Detecting LP-E cores is more complex, as these are reported as just E cores by CPUID 0x1a. It seems there is no forward compatible way of doing this detection (see https://community.intel.com/t5/Mobile-and-Desktop-Processors/Detecting-LP-E-Cores-on-Meteor-Lake-in-software/m-p/1578915).
Thread Director
Intel's Hybrid architectures feature a hardware component called the Thread Director. Its goal is to give feedback on threads instruction mix and report in real-time the package state with respect to attainable performance and efficiency. The latter reporting already exists as part of the Hardware Feedback Interface sub-functionality, but the Thread Director extends it by giving feedback for different classes of instruction mix.
Using this information, we expect to be able to schedule threads benefiting more from P cores onto them in priority when performance is desired, and conversely those benefiting more from E cores to improve efficiency. Intel's documentation talks about comparing performance or efficiency ratios of a thread on two cores with that of another thread to decide which thread should preferably go onto which core. That provides a policy to decide whether to exchange two threads from two different core types. This approach alone might be acceptable for short-lived threads, but for longer-lived ones the scheduler has to ensure fairness, which sometimes will go against that policy.
References
Some high-level documentation by Intel.
Intel Software Developer Manuals, in particular volume 2, CPUID instruction and volume 3, chapter 16.6 "HARDWARE FEEDBACK INTERFACE AND INTEL® THREAD DIRECTOR".