Linux process priorities demystified
Simple questions often have not so simple answers. One example is the question, What priority does this process have? This question normally comes up when a customer asks us to explain choices made by the CPU scheduler.
To answer it we need to discuss further sub-questions.
- What thread of the process are you interested in?
- What scheduling policy has this thread?
- Who gave us the process priority we’re looking into?
Processes? Threads? Tasks?
Like most operating systems Linux supports threads, which can be seen as a distinct execution flow. In the simplest case we have a process with just one thread, the main thread. While a thread models the execution flow, a process models shared resources between threads. The most prominent resource is memory. Within a process all threads share the same memory. Hence, many threads that run in parallel can work on the same memory to solve a problem.
The Linux CPU scheduler does not operate on processes, in fact it schedules tasks. Within the kernel the smallest scheduling entity is a task, it can be anything that can run on a CPU. It might be a kernel thread, a userspace thread, a threaded interrupt, etc. So don’t get confused, for simplicity we can assume that threads and tasks are the same for now.
Since the scheduler works on tasks, it ensures that each task (and therefore,
thread) has its own priority.
To learn about the priority one must inspect all threads of a process.
By default most Linux tools just show processes, the priorities shown there
are the priorities of the main thread. Many offer a way to show threads
too. For example in the top utility by pressing
Now we are able to answer the first sub-question. We now know that process priorities actually apply per thread, so we need to inspect all threads of a process. If the process in question has just one thread or we’re only interested in a specific thread, things become easy again.
POSIX requires that an operating system offers different scheduling policies, also
sometimes called scheduling classes.
By default all tasks in Linux use the default schedulung policy, called
The scheduling algorithm behind
SCHED_OTHER is not defined by POSIX and is implementation
specific. Linux makes sure that all tasks within this scheduling policy are treated
fairly. This means no task will starve and from the user’s point of view it seems like
all programs are running all time, nothing will feel sluggish but heavy workloads
still get stuff done in a timely manner.
In this policy every task has a priority, it is called nice value.
The nice value denotes how nice a task is to other tasks. The value ranges from
-20 means that a task is not nice at all and wants to run as often as possible,
19 signifies it is very nice to other tasks and is fine when the scheduler picks
other tasks much more often.
0, meaning the task is neither selfish nor too devoted.
The current algorithm behind
SCHED_OTHER is called Completely Fair Scheduling (CFS).
CFS offers two more policies:
SCHED_IDLE. They make sure that a
thread either runs when possible in one batch, thus
interactivity does not matter, or when
SCHED_IDLE is selected that the thread shall
run when no other task can run, so the very least priority.
Please note that
SCHED_IDLE has nothing to do with the idle thread.
The idle thread is scheduled when absolutely nothing runable was found by the CPU
SCHED_IDLE are selected CFS’s priority have no meanining,
although some tools report a number.
Linux also implements the POSIX Realtime Extensions. Using these extensions time
critical programs can run on Linux that have to meet strict timing constraints.
It offers two policies:
These policies allow deterministic scheduling between threads.
Both policies have a priority between
1 is the lowest.
The scheduler will pick a runnable task depending on the priority.
Linux offers another scheduling policy,
SCHED_DEADLINE. This policy does not
allow the setting of a priority, so it’s irrelevant here.
As opposed to a priority it allows setting a deadline by which a given work
would be completed. So the CPU scheduler has to make sure that the task
runs for long enough before the deadline.
One might wonder how all these policies relate to each other since they can be mixed on the same system. The Linux CPU scheduler consists of five decision algorithms that select tasks, they get executed in the following order.
Decision algorithms for task selection in Linux CPU scheduler and their order
Stop-Task: This algorithm can only be used by one internal kernel thread to shut down a given CPU. The sole purpose is to preempt the tasks of a given CPU before it can be shut down.
Deadline: It implements the user visible
SCHED_DEADLINEpolicy. If a
SCHED_DEADLINEthread is present on the system it will run before all other user visible threads, always.
SCHED_RRpolicies are implemented here. So realtime threads always run before regular tasks and right after
SCHED_DEADLINEthreads, if present.
Fair (best known as CFS): CFS implements
SCHED_IDLE. Most threads present on a system are handled by this algorithm.
Idle: If no algorithm finds a runnable thread the idle algorithm schedules the idle task on the CPU. The idle task is always runnable and puts the CPU in idle mode.
With this knowledge of scheduling policies we can answer the second sub-question.
SCHED_OTHER has a priority between
19 is the lowest and
-20 the highest.
SCHED_RR threads use a priority between
is the lowest.
SCHED_BATCH tasks have a nice value too, but the scheduling algorithm does
not honour it. Don’t get confused when a tool reports a nice value for
The ps tool is powerful and allows us to query this information.
ps axHo pid,lwp,comm,policy,nice,rtprio.
So we ask ps for a full list of all threads on the system, as well as their process id, thread id, short name, scheduling policy, nice value and realtime-priority.
SCHED_DEADLINE as DLN,
SCHED_OTHER as TS,
SCHED_BATCH as B,
SCHED_IDLE as IDL,
SCHED_FIFO as FF and
SCHED_RR as RR.
Dealing with different priority representations
So far we saw that priorities of type nice range from
-20 and realtime from
These are the ranges you can always expect from process tools like ps.
Things get complicated when raw kernel interfaces are used or kernel internal data is analyzed. You’ll face such situations when developing your own tools or using Linux’s kernel instrumentation facilities.
getpriority() system call
A prominent example is the
getpriority() system call. It returns the nice value of a thread.
If used in C, it will return an integer between
19 which is the expected nice value,
but only if the C library performs a fixup.
The raw kernel system call will return a shifted value within the range
The reason behind the shift is Linux’s system call ABI. Negative return values
represent error numbers.
So if your application uses system calls directly without a libc, this might be confusing.
|shifted value||nice value|
Using this system call makes only sense if the current scheduling policy is
For all other policies the returned priority is meaningless.
When you work with kernel instrumentation tools such as ftrace you will encounter the
deep internals of Linux.
Developers commonly use Linux’s
sched_switch trace point to see which task is picked
by the CPU scheduler.
ftrace will report a line like this one:
 d... 565635.368831: sched_switch: prev_comm=kworker/u8:2 prev_pid=891 prev_prio=120 prev_state=R+ ==> next_comm=kworker/u9:4 next_pid=983 next_prio=100
We observe that the scheduler on CPU 2 schedules from PID 891 to PID 983.
Please note that the PID in this context is not a process id as userspace would expect - it is in fact a thread id. Within the kernel, PID is the thread id.
What userspace gets via the
getpid() system call is actually the thread group id.
The alert reader will notice that this trace point has more than just the PID, it also shows the priorities of both tasks. However, they don’t make sense:
100. Neither are in the range of a nice level or a real time priority. Additionally, the trace point does not indicate what scheduling policy is being used.
To decipher these values we need to know that Linux internally uses a unified priority among all scheduling policies.
It ranges from
-1 is the highest and
139 the lowest priority.
The unified priority is calcualted as table shows:
|policy||specific priority||unified priority|
With this information we can translate
prev_prio to a nice value of
next_prio to a nice value of
A common pitfall when working with real time applications is to confuse real time priorities from kernel internals and reported by user space tools. Both are in the same range but reversed.
Attentive readers will notice that unified priority
99 is not claimed by
any policy. This is an inconsistency within the Linux scheduling code and as of
writing (v5.17-rc3) still present.
The proc filesystem is the source of all process information tools such as ps display. For each thread it offers a stat file which contains, beside many other statisticts, all stats about the scheduling policy.
To learn the priority of a task, the 18th and 19th field of
If a task has the scheduling policy
SCHED_OTHER the 19th field is non-zero and exhibits
the nice value.
The 18th field is more generic and exposes the unified priority of a task,
but shifted by
|shifted value||unified priority|
- CPU scheduler decisions affect threads, not processes.
- Each thread has a scheduling class, either
- Scheduling classes
SCHED_OTHERpriority is the so-called nice value, ranging from
SCHED_RRsupport priorities between
- Raw kernel interfaces and instrumentation facilities expose a unified priority, often in a normalised/shifted way. Here be dragons.