356 lines
15 KiB
Plaintext
356 lines
15 KiB
Plaintext
CPU frequency and voltage scaling code in the Linux(TM) kernel
|
|
|
|
|
|
L i n u x C P U F r e q
|
|
|
|
C P U F r e q G o v e r n o r s
|
|
|
|
- information for users and developers -
|
|
|
|
|
|
Dominik Brodowski <linux@brodo.de>
|
|
some additions and corrections by Nico Golde <nico@ngolde.de>
|
|
|
|
|
|
|
|
Clock scaling allows you to change the clock speed of the CPUs on the
|
|
fly. This is a nice method to save battery power, because the lower
|
|
the clock speed, the less power the CPU consumes.
|
|
|
|
|
|
Contents:
|
|
---------
|
|
1. What is a CPUFreq Governor?
|
|
|
|
2. Governors In the Linux Kernel
|
|
2.1 Performance
|
|
2.2 Powersave
|
|
2.3 Userspace
|
|
2.4 Ondemand
|
|
2.5 Conservative
|
|
2.6 Interactive
|
|
|
|
3. The Governor Interface in the CPUfreq Core
|
|
|
|
|
|
|
|
1. What Is A CPUFreq Governor?
|
|
==============================
|
|
|
|
Most cpufreq drivers (except the intel_pstate and longrun) or even most
|
|
cpu frequency scaling algorithms only offer the CPU to be set to one
|
|
frequency. In order to offer dynamic frequency scaling, the cpufreq
|
|
core must be able to tell these drivers of a "target frequency". So
|
|
these specific drivers will be transformed to offer a "->target/target_index"
|
|
call instead of the existing "->setpolicy" call. For "longrun", all
|
|
stays the same, though.
|
|
|
|
How to decide what frequency within the CPUfreq policy should be used?
|
|
That's done using "cpufreq governors". Two are already in this patch
|
|
-- they're the already existing "powersave" and "performance" which
|
|
set the frequency statically to the lowest or highest frequency,
|
|
respectively. At least two more such governors will be ready for
|
|
addition in the near future, but likely many more as there are various
|
|
different theories and models about dynamic frequency scaling
|
|
around. Using such a generic interface as cpufreq offers to scaling
|
|
governors, these can be tested extensively, and the best one can be
|
|
selected for each specific use.
|
|
|
|
Basically, it's the following flow graph:
|
|
|
|
CPU can be set to switch independently | CPU can only be set
|
|
within specific "limits" | to specific frequencies
|
|
|
|
"CPUfreq policy"
|
|
consists of frequency limits (policy->{min,max})
|
|
and CPUfreq governor to be used
|
|
/ \
|
|
/ \
|
|
/ the cpufreq governor decides
|
|
/ (dynamically or statically)
|
|
/ what target_freq to set within
|
|
/ the limits of policy->{min,max}
|
|
/ \
|
|
/ \
|
|
Using the ->setpolicy call, Using the ->target/target_index call,
|
|
the limits and the the frequency closest
|
|
"policy" is set. to target_freq is set.
|
|
It is assured that it
|
|
is within policy->{min,max}
|
|
|
|
|
|
2. Governors In the Linux Kernel
|
|
================================
|
|
|
|
2.1 Performance
|
|
---------------
|
|
|
|
The CPUfreq governor "performance" sets the CPU statically to the
|
|
highest frequency within the borders of scaling_min_freq and
|
|
scaling_max_freq.
|
|
|
|
|
|
2.2 Powersave
|
|
-------------
|
|
|
|
The CPUfreq governor "powersave" sets the CPU statically to the
|
|
lowest frequency within the borders of scaling_min_freq and
|
|
scaling_max_freq.
|
|
|
|
|
|
2.3 Userspace
|
|
-------------
|
|
|
|
The CPUfreq governor "userspace" allows the user, or any userspace
|
|
program running with UID "root", to set the CPU to a specific frequency
|
|
by making a sysfs file "scaling_setspeed" available in the CPU-device
|
|
directory.
|
|
|
|
|
|
2.4 Ondemand
|
|
------------
|
|
|
|
The CPUfreq governor "ondemand" sets the CPU depending on the
|
|
current usage. To do this the CPU must have the capability to
|
|
switch the frequency very quickly. There are a number of sysfs file
|
|
accessible parameters:
|
|
|
|
sampling_rate: measured in uS (10^-6 seconds), this is how often you
|
|
want the kernel to look at the CPU usage and to make decisions on
|
|
what to do about the frequency. Typically this is set to values of
|
|
around '10000' or more. It's default value is (cmp. with users-guide.txt):
|
|
transition_latency * 1000
|
|
Be aware that transition latency is in ns and sampling_rate is in us, so you
|
|
get the same sysfs value by default.
|
|
Sampling rate should always get adjusted considering the transition latency
|
|
To set the sampling rate 750 times as high as the transition latency
|
|
in the bash (as said, 1000 is default), do:
|
|
echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) \
|
|
>ondemand/sampling_rate
|
|
|
|
sampling_rate_min:
|
|
The sampling rate is limited by the HW transition latency:
|
|
transition_latency * 100
|
|
Or by kernel restrictions:
|
|
If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed.
|
|
If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is used, the
|
|
limits depend on the CONFIG_HZ option:
|
|
HZ=1000: min=20000us (20ms)
|
|
HZ=250: min=80000us (80ms)
|
|
HZ=100: min=200000us (200ms)
|
|
The highest value of kernel and HW latency restrictions is shown and
|
|
used as the minimum sampling rate.
|
|
|
|
up_threshold: defines what the average CPU usage between the samplings
|
|
of 'sampling_rate' needs to be for the kernel to make a decision on
|
|
whether it should increase the frequency. For example when it is set
|
|
to its default value of '95' it means that between the checking
|
|
intervals the CPU needs to be on average more than 95% in use to then
|
|
decide that the CPU frequency needs to be increased.
|
|
|
|
ignore_nice_load: this parameter takes a value of '0' or '1'. When
|
|
set to '0' (its default), all processes are counted towards the
|
|
'cpu utilisation' value. When set to '1', the processes that are
|
|
run with a 'nice' value will not count (and thus be ignored) in the
|
|
overall usage calculation. This is useful if you are running a CPU
|
|
intensive calculation on your laptop that you do not care how long it
|
|
takes to complete as you can 'nice' it and prevent it from taking part
|
|
in the deciding process of whether to increase your CPU frequency.
|
|
|
|
sampling_down_factor: this parameter controls the rate at which the
|
|
kernel makes a decision on when to decrease the frequency while running
|
|
at top speed. When set to 1 (the default) decisions to reevaluate load
|
|
are made at the same interval regardless of current clock speed. But
|
|
when set to greater than 1 (e.g. 100) it acts as a multiplier for the
|
|
scheduling interval for reevaluating load when the CPU is at its top
|
|
speed due to high load. This improves performance by reducing the overhead
|
|
of load evaluation and helping the CPU stay at its top speed when truly
|
|
busy, rather than shifting back and forth in speed. This tunable has no
|
|
effect on behavior at lower speeds/lower CPU loads.
|
|
|
|
powersave_bias: this parameter takes a value between 0 to 1000. It
|
|
defines the percentage (times 10) value of the target frequency that
|
|
will be shaved off of the target. For example, when set to 100 -- 10%,
|
|
when ondemand governor would have targeted 1000 MHz, it will target
|
|
1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0
|
|
(disabled) by default.
|
|
When AMD frequency sensitivity powersave bias driver --
|
|
drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter
|
|
defines the workload frequency sensitivity threshold in which a lower
|
|
frequency is chosen instead of ondemand governor's original target.
|
|
The frequency sensitivity is a hardware reported (on AMD Family 16h
|
|
Processors and above) value between 0 to 100% that tells software how
|
|
the performance of the workload running on a CPU will change when
|
|
frequency changes. A workload with sensitivity of 0% (memory/IO-bound)
|
|
will not perform any better on higher core frequency, whereas a
|
|
workload with sensitivity of 100% (CPU-bound) will perform better
|
|
higher the frequency. When the driver is loaded, this is set to 400
|
|
by default -- for CPUs running workloads with sensitivity value below
|
|
40%, a lower frequency is chosen. Unloading the driver or writing 0
|
|
will disable this feature.
|
|
|
|
|
|
2.5 Conservative
|
|
----------------
|
|
|
|
The CPUfreq governor "conservative", much like the "ondemand"
|
|
governor, sets the CPU depending on the current usage. It differs in
|
|
behaviour in that it gracefully increases and decreases the CPU speed
|
|
rather than jumping to max speed the moment there is any load on the
|
|
CPU. This behaviour more suitable in a battery powered environment.
|
|
The governor is tweaked in the same manner as the "ondemand" governor
|
|
through sysfs with the addition of:
|
|
|
|
freq_step: this describes what percentage steps the cpu freq should be
|
|
increased and decreased smoothly by. By default the cpu frequency will
|
|
increase in 5% chunks of your maximum cpu frequency. You can change this
|
|
value to anywhere between 0 and 100 where '0' will effectively lock your
|
|
CPU at a speed regardless of its load whilst '100' will, in theory, make
|
|
it behave identically to the "ondemand" governor.
|
|
|
|
down_threshold: same as the 'up_threshold' found for the "ondemand"
|
|
governor but for the opposite direction. For example when set to its
|
|
default value of '20' it means that if the CPU usage needs to be below
|
|
20% between samples to have the frequency decreased.
|
|
|
|
sampling_down_factor: similar functionality as in "ondemand" governor.
|
|
But in "conservative", it controls the rate at which the kernel makes
|
|
a decision on when to decrease the frequency while running in any
|
|
speed. Load for frequency increase is still evaluated every
|
|
sampling rate.
|
|
|
|
2.6 Interactive
|
|
---------------
|
|
|
|
The CPUfreq governor "interactive" is designed for latency-sensitive,
|
|
interactive workloads. This governor sets the CPU speed depending on
|
|
usage, similar to "ondemand" and "conservative" governors, but with a
|
|
different set of configurable behaviors.
|
|
|
|
The tunable values for this governor are:
|
|
|
|
above_hispeed_delay: When speed is at or above hispeed_freq, wait for
|
|
this long before raising speed in response to continued high load.
|
|
The format is a single delay value, optionally followed by pairs of
|
|
CPU speeds and the delay to use at or above those speeds. Colons can
|
|
be used between the speeds and associated delays for readability. For
|
|
example:
|
|
|
|
80000 1300000:200000 1500000:40000
|
|
|
|
uses delay 80000 uS until CPU speed 1.3 GHz, at which speed delay
|
|
200000 uS is used until speed 1.5 GHz, at which speed (and above)
|
|
delay 40000 uS is used. If speeds are specified these must appear in
|
|
ascending order. Default is 20000 uS.
|
|
|
|
boost: If non-zero, immediately boost speed of all CPUs to at least
|
|
hispeed_freq until zero is written to this attribute. If zero, allow
|
|
CPU speeds to drop below hispeed_freq according to load as usual.
|
|
Default is zero.
|
|
|
|
boostpulse: On each write, immediately boost speed of all CPUs to
|
|
hispeed_freq for at least the period of time specified by
|
|
boostpulse_duration, after which speeds are allowed to drop below
|
|
hispeed_freq according to load as usual. Its a write-only file.
|
|
|
|
boostpulse_duration: Length of time to hold CPU speed at hispeed_freq
|
|
on a write to boostpulse, before allowing speed to drop according to
|
|
load as usual. Default is 80000 uS.
|
|
|
|
go_hispeed_load: The CPU load at which to ramp to hispeed_freq.
|
|
Default is 99%.
|
|
|
|
hispeed_freq: An intermediate "high speed" at which to initially ramp
|
|
when CPU load hits the value specified in go_hispeed_load. If load
|
|
stays high for the amount of time specified in above_hispeed_delay,
|
|
then speed may be bumped higher. Default is the maximum speed allowed
|
|
by the policy at governor initialization time.
|
|
|
|
io_is_busy: If set, the governor accounts IO time as CPU busy time.
|
|
|
|
min_sample_time: The minimum amount of time to spend at the current
|
|
frequency before ramping down. Default is 80000 uS.
|
|
|
|
target_loads: CPU load values used to adjust speed to influence the
|
|
current CPU load toward that value. In general, the lower the target
|
|
load, the more often the governor will raise CPU speeds to bring load
|
|
below the target. The format is a single target load, optionally
|
|
followed by pairs of CPU speeds and CPU loads to target at or above
|
|
those speeds. Colons can be used between the speeds and associated
|
|
target loads for readability. For example:
|
|
|
|
85 1000000:90 1700000:99
|
|
|
|
targets CPU load 85% below speed 1GHz, 90% at or above 1GHz, until
|
|
1.7GHz and above, at which load 99% is targeted. If speeds are
|
|
specified these must appear in ascending order. Higher target load
|
|
values are typically specified for higher speeds, that is, target load
|
|
values also usually appear in an ascending order. The default is
|
|
target load 90% for all speeds.
|
|
|
|
timer_rate: Sample rate for reevaluating CPU load when the CPU is not
|
|
idle. A deferrable timer is used, such that the CPU will not be woken
|
|
from idle to service this timer until something else needs to run.
|
|
(The maximum time to allow deferring this timer when not running at
|
|
minimum speed is configurable via timer_slack.) Default is 20000 uS.
|
|
|
|
timer_slack: Maximum additional time to defer handling the governor
|
|
sampling timer beyond timer_rate when running at speeds above the
|
|
minimum. For platforms that consume additional power at idle when
|
|
CPUs are running at speeds greater than minimum, this places an upper
|
|
bound on how long the timer will be deferred prior to re-evaluating
|
|
load and dropping speed. For example, if timer_rate is 20000uS and
|
|
timer_slack is 10000uS then timers will be deferred for up to 30msec
|
|
when not at lowest speed. A value of -1 means defer timers
|
|
indefinitely at all speeds. Default is 80000 uS.
|
|
|
|
3. The Governor Interface in the CPUfreq Core
|
|
=============================================
|
|
|
|
A new governor must register itself with the CPUfreq core using
|
|
"cpufreq_register_governor". The struct cpufreq_governor, which has to
|
|
be passed to that function, must contain the following values:
|
|
|
|
governor->name - A unique name for this governor
|
|
governor->governor - The governor callback function
|
|
governor->owner - .THIS_MODULE for the governor module (if
|
|
appropriate)
|
|
|
|
The governor->governor callback is called with the current (or to-be-set)
|
|
cpufreq_policy struct for that CPU, and an unsigned int event. The
|
|
following events are currently defined:
|
|
|
|
CPUFREQ_GOV_START: This governor shall start its duty for the CPU
|
|
policy->cpu
|
|
CPUFREQ_GOV_STOP: This governor shall end its duty for the CPU
|
|
policy->cpu
|
|
CPUFREQ_GOV_LIMITS: The limits for CPU policy->cpu have changed to
|
|
policy->min and policy->max.
|
|
|
|
If you need other "events" externally of your driver, _only_ use the
|
|
cpufreq_governor_l(unsigned int cpu, unsigned int event) call to the
|
|
CPUfreq core to ensure proper locking.
|
|
|
|
|
|
The CPUfreq governor may call the CPU processor driver using one of
|
|
these two functions:
|
|
|
|
int cpufreq_driver_target(struct cpufreq_policy *policy,
|
|
unsigned int target_freq,
|
|
unsigned int relation);
|
|
|
|
int __cpufreq_driver_target(struct cpufreq_policy *policy,
|
|
unsigned int target_freq,
|
|
unsigned int relation);
|
|
|
|
target_freq must be within policy->min and policy->max, of course.
|
|
What's the difference between these two functions? When your governor
|
|
still is in a direct code path of a call to governor->governor, the
|
|
per-CPU cpufreq lock is still held in the cpufreq core, and there's
|
|
no need to lock it again (in fact, this would cause a deadlock). So
|
|
use __cpufreq_driver_target only in these cases. In all other cases
|
|
(for example, when there's a "daemonized" function that wakes up
|
|
every second), use cpufreq_driver_target to lock the cpufreq per-CPU
|
|
lock before the command is passed to the cpufreq processor driver.
|
|
|