363 lines
16 KiB
Plaintext
363 lines
16 KiB
Plaintext
|
Energy cost model for energy-aware scheduling (EXPERIMENTAL)
|
||
|
|
||
|
Introduction
|
||
|
=============
|
||
|
|
||
|
The basic energy model uses platform energy data stored in sched_group_energy
|
||
|
data structures attached to the sched_groups in the sched_domain hierarchy. The
|
||
|
energy cost model offers two functions that can be used to guide scheduling
|
||
|
decisions:
|
||
|
|
||
|
1. static unsigned int sched_group_energy(struct energy_env *eenv)
|
||
|
2. static int energy_diff(struct energy_env *eenv)
|
||
|
|
||
|
sched_group_energy() estimates the energy consumed by all cpus in a specific
|
||
|
sched_group including any shared resources owned exclusively by this group of
|
||
|
cpus. Resources shared with other cpus are excluded (e.g. later level caches).
|
||
|
|
||
|
energy_diff() estimates the total energy impact of a utilization change. That
|
||
|
is, adding, removing, or migrating utilization (tasks).
|
||
|
|
||
|
Both functions use a struct energy_env to specify the scenario to be evaluated:
|
||
|
|
||
|
struct energy_env {
|
||
|
struct sched_group *sg_top;
|
||
|
struct sched_group *sg_cap;
|
||
|
int cap_idx;
|
||
|
int util_delta;
|
||
|
int src_cpu;
|
||
|
int dst_cpu;
|
||
|
int energy;
|
||
|
};
|
||
|
|
||
|
sg_top: sched_group to be evaluated. Not used by energy_diff().
|
||
|
|
||
|
sg_cap: sched_group covering the cpus in the same frequency domain. Set by
|
||
|
sched_group_energy().
|
||
|
|
||
|
cap_idx: Capacity state to be used for energy calculations. Set by
|
||
|
find_new_capacity().
|
||
|
|
||
|
util_delta: Amount of utilization to be added, removed, or migrated.
|
||
|
|
||
|
src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be
|
||
|
-1 if no source (e.g. task wake-up).
|
||
|
|
||
|
dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1
|
||
|
if utilization is removed (e.g. terminating tasks).
|
||
|
|
||
|
energy: Result of sched_group_energy().
|
||
|
|
||
|
The metric used to represent utilization is the actual per-entity running time
|
||
|
averaged over time using a geometric series. Very similar to the existing
|
||
|
per-entity load-tracking, but _not_ scaled by task priority and capped by the
|
||
|
capacity of the cpu. The latter property does mean that utilization may
|
||
|
underestimate the compute requirements for task on fully/over utilized cpus.
|
||
|
The greatest potential for energy savings without affecting performance too much
|
||
|
is scenarios where the system isn't fully utilized. If the system is deemed
|
||
|
fully utilized load-balancing should be done with task load (includes task
|
||
|
priority) instead in the interest of fairness and performance.
|
||
|
|
||
|
|
||
|
Background and Terminology
|
||
|
===========================
|
||
|
|
||
|
To make it clear from the start:
|
||
|
|
||
|
energy = [joule] (resource like a battery on powered devices)
|
||
|
power = energy/time = [joule/second] = [watt]
|
||
|
|
||
|
The goal of energy-aware scheduling is to minimize energy, while still getting
|
||
|
the job done. That is, we want to maximize:
|
||
|
|
||
|
performance [inst/s]
|
||
|
--------------------
|
||
|
power [W]
|
||
|
|
||
|
which is equivalent to minimizing:
|
||
|
|
||
|
energy [J]
|
||
|
-----------
|
||
|
instruction
|
||
|
|
||
|
while still getting 'good' performance. It is essentially an alternative
|
||
|
optimization objective to the current performance-only objective for the
|
||
|
scheduler. This alternative considers two objectives: energy-efficiency and
|
||
|
performance. Hence, there needs to be a user controllable knob to switch the
|
||
|
objective. Since it is early days, this is currently a sched_feature
|
||
|
(ENERGY_AWARE).
|
||
|
|
||
|
The idea behind introducing an energy cost model is to allow the scheduler to
|
||
|
evaluate the implications of its decisions rather than applying energy-saving
|
||
|
techniques blindly that may only have positive effects on some platforms. At
|
||
|
the same time, the energy cost model must be as simple as possible to minimize
|
||
|
the scheduler latency impact.
|
||
|
|
||
|
Platform topology
|
||
|
------------------
|
||
|
|
||
|
The system topology (cpus, caches, and NUMA information, not peripherals) is
|
||
|
represented in the scheduler by the sched_domain hierarchy which has
|
||
|
sched_groups attached at each level that covers one or more cpus (see
|
||
|
sched-domains.txt for more details). To add energy awareness to the scheduler
|
||
|
we need to consider power and frequency domains.
|
||
|
|
||
|
Power domain:
|
||
|
|
||
|
A power domain is a part of the system that can be powered on/off
|
||
|
independently. Power domains are typically organized in a hierarchy where you
|
||
|
may be able to power down just a cpu or a group of cpus along with any
|
||
|
associated resources (e.g. shared caches). Powering up a cpu means that all
|
||
|
power domains it is a part of in the hierarchy must be powered up. Hence, it is
|
||
|
more expensive to power up the first cpu that belongs to a higher level power
|
||
|
domain than powering up additional cpus in the same high level domain. Two
|
||
|
level power domain hierarchy example:
|
||
|
|
||
|
Power source
|
||
|
+-------------------------------+----...
|
||
|
per group PD G G
|
||
|
| +----------+ |
|
||
|
+--------+-------| Shared | (other groups)
|
||
|
per-cpu PD G G | resource |
|
||
|
| | +----------+
|
||
|
+-------+ +-------+
|
||
|
| CPU 0 | | CPU 1 |
|
||
|
+-------+ +-------+
|
||
|
|
||
|
Frequency domain:
|
||
|
|
||
|
Frequency domains (P-states) typically cover the same group of cpus as one of
|
||
|
the power domain levels. That is, there might be several smaller power domains
|
||
|
sharing the same frequency (P-state) or there might be a power domain spanning
|
||
|
multiple frequency domains.
|
||
|
|
||
|
From a scheduling point of view there is no need to know the actual frequencies
|
||
|
[Hz]. All the scheduler cares about is the compute capacity available at the
|
||
|
current state (P-state) the cpu is in and any other available states. For that
|
||
|
reason, and to also factor in any cpu micro-architecture differences, compute
|
||
|
capacity scaling states are called 'capacity states' in this document. For SMP
|
||
|
systems this is equivalent to P-states. For mixed micro-architecture systems
|
||
|
(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
|
||
|
performance relative to the other cpus in the system.
|
||
|
|
||
|
Energy modelling:
|
||
|
------------------
|
||
|
|
||
|
Due to the hierarchical nature of the power domains, the most obvious way to
|
||
|
model energy costs is therefore to associate power and energy costs with
|
||
|
domains (groups of cpus). Energy costs of shared resources are associated with
|
||
|
the group of cpus that share the resources, only the cost of powering the
|
||
|
cpu itself and any private resources (e.g. private L1 caches) is associated
|
||
|
with the per-cpu groups (lowest level).
|
||
|
|
||
|
For example, for an SMP system with per-cpu power domains and a cluster level
|
||
|
(group of cpus) power domain we get the overall energy costs to be:
|
||
|
|
||
|
energy = energy_cluster + n * energy_cpu
|
||
|
|
||
|
where 'n' is the number of cpus powered up and energy_cluster is the cost paid
|
||
|
as soon as any cpu in the cluster is powered up.
|
||
|
|
||
|
The power and frequency domains can naturally be mapped onto the existing
|
||
|
sched_domain hierarchy and sched_groups by adding the necessary data to the
|
||
|
existing data structures.
|
||
|
|
||
|
The energy model considers energy consumption from two contributors (shown in
|
||
|
the illustration below):
|
||
|
|
||
|
1. Busy energy: Energy consumed while a cpu and the higher level groups that it
|
||
|
belongs to are busy running tasks. Busy energy is associated with the state of
|
||
|
the cpu, not an event. The time the cpu spends in this state varies. Thus, the
|
||
|
most obvious platform parameter for this contribution is busy power
|
||
|
(energy/time).
|
||
|
|
||
|
2. Idle energy: Energy consumed while a cpu and higher level groups that it
|
||
|
belongs to are idle (in a C-state). Like busy energy, idle energy is associated
|
||
|
with the state of the cpu. Thus, the platform parameter for this contribution
|
||
|
is idle power (energy/time).
|
||
|
|
||
|
Energy consumed during transitions from an idle-state (C-state) to a busy state
|
||
|
(P-state) or going the other way is ignored by the model to simplify the energy
|
||
|
model calculations.
|
||
|
|
||
|
|
||
|
Power
|
||
|
^
|
||
|
| busy->idle idle->busy
|
||
|
| transition transition
|
||
|
|
|
||
|
| _ __
|
||
|
| / \ / \__________________
|
||
|
|______________/ \ /
|
||
|
| \ /
|
||
|
| Busy \ Idle / Busy
|
||
|
| low P-state \____________/ high P-state
|
||
|
|
|
||
|
+------------------------------------------------------------> time
|
||
|
|
||
|
Busy |--------------| |-----------------|
|
||
|
|
||
|
Wakeup |------| |------|
|
||
|
|
||
|
Idle |------------|
|
||
|
|
||
|
|
||
|
The basic algorithm
|
||
|
====================
|
||
|
|
||
|
The basic idea is to determine the total energy impact when utilization is
|
||
|
added or removed by estimating the impact at each level in the sched_domain
|
||
|
hierarchy starting from the bottom (sched_group contains just a single cpu).
|
||
|
The energy cost comes from busy time (sched_group is awake because one or more
|
||
|
cpus are busy) and idle time (in an idle-state). Energy model numbers account
|
||
|
for energy costs associated with all cpus in the sched_group as a group.
|
||
|
|
||
|
for_each_domain(cpu, sd) {
|
||
|
sg = sched_group_of(cpu)
|
||
|
energy_before = curr_util(sg) * busy_power(sg)
|
||
|
+ (1-curr_util(sg)) * idle_power(sg)
|
||
|
energy_after = new_util(sg) * busy_power(sg)
|
||
|
+ (1-new_util(sg)) * idle_power(sg)
|
||
|
energy_diff += energy_before - energy_after
|
||
|
|
||
|
}
|
||
|
|
||
|
return energy_diff
|
||
|
|
||
|
{curr, new}_util: The cpu utilization at the lowest level and the overall
|
||
|
non-idle time for the entire group for higher levels. Utilization is in the
|
||
|
range 0.0 to 1.0 in the pseudo-code.
|
||
|
|
||
|
busy_power: The power consumption of the sched_group.
|
||
|
|
||
|
idle_power: The power consumption of the sched_group when idle.
|
||
|
|
||
|
Note: It is a fundamental assumption that the utilization is (roughly) scale
|
||
|
invariant. Task utilization tracking factors in any frequency scaling and
|
||
|
performance scaling differences due to difference cpu microarchitectures such
|
||
|
that task utilization can be used across the entire system.
|
||
|
|
||
|
|
||
|
Platform energy data
|
||
|
=====================
|
||
|
|
||
|
struct sched_group_energy can be attached to sched_groups in the sched_domain
|
||
|
hierarchy and has the following members:
|
||
|
|
||
|
cap_states:
|
||
|
List of struct capacity_state representing the supported capacity states
|
||
|
(P-states). struct capacity_state has two members: cap and power, which
|
||
|
represents the compute capacity and the busy_power of the state. The
|
||
|
list must be ordered by capacity low->high.
|
||
|
|
||
|
nr_cap_states:
|
||
|
Number of capacity states in cap_states list.
|
||
|
|
||
|
idle_states:
|
||
|
List of struct idle_state containing idle_state power cost for each
|
||
|
idle-state supported by the system orderd by shallowest state first.
|
||
|
All states must be included at all level in the hierarchy, i.e. a
|
||
|
sched_group spanning just a single cpu must also include coupled
|
||
|
idle-states (cluster states). In addition to the cpuidle idle-states,
|
||
|
the list must also contain an entry for the idling using the arch
|
||
|
default idle (arch_idle_cpu()). Despite this state may not be a true
|
||
|
hardware idle-state it is considered the shallowest idle-state in the
|
||
|
energy model and must be the first entry. cpus may enter this state
|
||
|
(possibly 'active idling') if cpuidle decides not enter a cpuidle
|
||
|
idle-state. Default idle may not be used when cpuidle is enabled.
|
||
|
In this case, it should just be a copy of the first cpuidle idle-state.
|
||
|
|
||
|
nr_idle_states:
|
||
|
Number of idle states in idle_states list.
|
||
|
|
||
|
There are no unit requirements for the energy cost data. Data can be normalized
|
||
|
with any reference, however, the normalization must be consistent across all
|
||
|
energy cost data. That is, one bogo-joule/watt must be the same quantity for
|
||
|
data, but we don't care what it is.
|
||
|
|
||
|
A recipe for platform characterization
|
||
|
=======================================
|
||
|
|
||
|
Obtaining the actual model data for a particular platform requires some way of
|
||
|
measuring power/energy. There isn't a tool to help with this (yet). This
|
||
|
section provides a recipe for use as reference. It covers the steps used to
|
||
|
characterize the ARM TC2 development platform. This sort of measurements is
|
||
|
expected to be done anyway when tuning cpuidle and cpufreq for a given
|
||
|
platform.
|
||
|
|
||
|
The energy model needs two types of data (struct sched_group_energy holds
|
||
|
these) for each sched_group where energy costs should be taken into account:
|
||
|
|
||
|
1. Capacity state information
|
||
|
|
||
|
A list containing the compute capacity and power consumption when fully
|
||
|
utilized attributed to the group as a whole for each available capacity state.
|
||
|
At the lowest level (group contains just a single cpu) this is the power of the
|
||
|
cpu alone without including power consumed by resources shared with other cpus.
|
||
|
It basically needs to fit the basic modelling approach described in "Background
|
||
|
and Terminology" section:
|
||
|
|
||
|
energy_system = energy_shared + n * energy_cpu
|
||
|
|
||
|
for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
|
||
|
the lowest level. 'energy_shared' is included at the next level which
|
||
|
represents the group of cpus among which the resources are shared.
|
||
|
|
||
|
This model is, of course, a simplification of reality. Thus, power/energy
|
||
|
attributions might not always exactly represent how the hardware is designed.
|
||
|
Also, busy power is likely to depend on the workload. It is therefore
|
||
|
recommended to use a representative mix of workloads when characterizing the
|
||
|
capacity states.
|
||
|
|
||
|
If the group has no capacity scaling support, the list will contain a single
|
||
|
state where power is the busy power attributed to the group. The capacity
|
||
|
should be set to a default value (1024).
|
||
|
|
||
|
When frequency domains include multiple power domains, the group representing
|
||
|
the frequency domain and all child groups share capacity states. This must be
|
||
|
indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
|
||
|
all levels that share the capacity state must have the list of capacity states
|
||
|
with the power set to the contribution of the individual group.
|
||
|
|
||
|
2. Idle power information
|
||
|
|
||
|
Stored in the idle_states list. The power number is the group idle power
|
||
|
consumption in each idle state as well when the group is idle but has not
|
||
|
entered an idle-state ('active idle' as mentioned earlier). Due to the way the
|
||
|
energy model is defined, the idle power of the deepest group idle state can
|
||
|
alternatively be accounted for in the parent group busy power. In that case the
|
||
|
group idle state power values are offset such that the idle power of the
|
||
|
deepest state is zero. It is less intuitive, but it is easier to measure as
|
||
|
idle power consumed by the group and the busy/idle power of the parent group
|
||
|
cannot be distinguished without per group measurement points.
|
||
|
|
||
|
Measuring capacity states and idle power:
|
||
|
|
||
|
The capacity states' capacity and power can be estimated by running a benchmark
|
||
|
workload at each available capacity state. By restricting the benchmark to run
|
||
|
on subsets of cpus it is possible to extrapolate the power consumption of
|
||
|
shared resources.
|
||
|
|
||
|
ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
|
||
|
shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
|
||
|
benchmark workload on just one cpu in a cluster means that power is consumed in
|
||
|
the cluster (higher level group) and a single cpu (lowest level group). Adding
|
||
|
another benchmark task to another cpu increases the power consumption by the
|
||
|
amount consumed by the additional cpu. Hence, it is possible to extrapolate the
|
||
|
cluster busy power.
|
||
|
|
||
|
For platforms that don't have energy counters or equivalent instrumentation
|
||
|
built-in, it may be possible to use an external DAQ to acquire similar data.
|
||
|
|
||
|
If the benchmark includes some performance score (for example sysbench cpu
|
||
|
benchmark), this can be used to record the compute capacity.
|
||
|
|
||
|
Measuring idle power requires insight into the idle state implementation on the
|
||
|
particular platform. Specifically, if the platform has coupled idle-states (or
|
||
|
package states). To measure non-coupled per-cpu idle-states it is necessary to
|
||
|
keep one cpu busy to keep any shared resources alive to isolate the idle power
|
||
|
of the cpu from idle/busy power of the shared resources. The cpu can be tricked
|
||
|
into different per-cpu idle states by disabling the other states. Based on
|
||
|
various combinations of measurements with specific cpus busy and disabling
|
||
|
idle-states it is possible to extrapolate the idle-state power.
|