Linux process priorities demystified

Simple questions often have not so simple answers. One example is the question, What priority does this process have? This question normally comes up when a customer asks us to explain choices made by the CPU scheduler.

To answer it we need to discuss further sub-questions.

  1. What thread of the process are you interested in?
  2. What scheduling policy has this thread?
  3. Who gave us the process priority we’re looking into?

Processes? Threads? Tasks?

Like most operating systems Linux supports threads, which can be seen as a distinct execution flow. In the simplest case we have a process with just one thread, the main thread. While a thread models the execution flow, a process models shared resources between threads. The most prominent resource is memory. Within a process all threads share the same memory. Hence, many threads that run in parallel can work on the same memory to solve a problem.

The Linux CPU scheduler does not operate on processes, in fact it schedules tasks. Within the kernel the smallest scheduling entity is a task, it can be anything that can run on a CPU. It might be a kernel thread, a userspace thread, a threaded interrupt, etc. So don’t get confused, for simplicity we can assume that threads and tasks are the same for now.

Since the scheduler works on tasks, it ensures that each task (and therefore, thread) has its own priority. To learn about the priority one must inspect all threads of a process. By default most Linux tools just show processes, the priorities shown there are the priorities of the main thread. Many offer a way to show threads too. For example in the top utility by pressing H.

Now we are able to answer the first sub-question. We now know that process priorities actually apply per thread, so we need to inspect all threads of a process. If the process in question has just one thread or we’re only interested in a specific thread, things become easy again.

Scheduling policy

POSIX requires that an operating system offers different scheduling policies, also sometimes called scheduling classes. By default all tasks in Linux use the default schedulung policy, called SCHED_NORMAL or SCHED_OTHER.

The scheduling algorithm behind SCHED_OTHER is not defined by POSIX and is implementation specific. Linux makes sure that all tasks within this scheduling policy are treated fairly. This means no task will starve and from the user’s point of view it seems like all programs are running all time, nothing will feel sluggish but heavy workloads still get stuff done in a timely manner. In this policy every task has a priority, it is called nice value. The nice value denotes how nice a task is to other tasks. The value ranges from 19 to -20. Where -20 means that a task is not nice at all and wants to run as often as possible, 19 signifies it is very nice to other tasks and is fine when the scheduler picks other tasks much more often. Default is 0, meaning the task is neither selfish nor too devoted. The current algorithm behind SCHED_OTHER is called Completely Fair Scheduling (CFS). CFS offers two more policies: SCHED_BATCH and SCHED_IDLE. They make sure that a thread either runs when possible in one batch, thus interactivity does not matter, or when SCHED_IDLE is selected that the thread shall run when no other task can run, so the very least priority. Please note that SCHED_IDLE has nothing to do with the idle thread. The idle thread is scheduled when absolutely nothing runable was found by the CPU scheduler. When SCHED_BATCH or SCHED_IDLE are selected CFS’s priority have no meanining, although some tools report a number.

Linux also implements the POSIX Realtime Extensions. Using these extensions time critical programs can run on Linux that have to meet strict timing constraints. It offers two policies: SCHED_FIFO and SCHED_RR. These policies allow deterministic scheduling between threads. Both policies have a priority between 1 and 99, where 1 is the lowest. The scheduler will pick a runnable task depending on the priority.

Linux offers another scheduling policy, SCHED_DEADLINE. This policy does not allow the setting of a priority, so it’s irrelevant here. As opposed to a priority it allows setting a deadline by which a given work would be completed. So the CPU scheduler has to make sure that the task runs for long enough before the deadline.

One might wonder how all these policies relate to each other since they can be mixed on the same system. The Linux CPU scheduler consists of five decision algorithms that select tasks, they get executed in the following order.

Decision algorithms for task selection in Linux CPU scheduler and their order

  1. Stop-Task: This algorithm can only be used by one internal kernel thread to shut down a given CPU. The sole purpose is to preempt the tasks of a given CPU before it can be shut down.

  2. Deadline: It implements the user visible SCHED_DEADLINE policy. If a SCHED_DEADLINE thread is present on the system it will run before all other user visible threads, always.

  3. Realtime: SCHED_FIFO and SCHED_RR policies are implemented here. So realtime threads always run before regular tasks and right after SCHED_DEADLINE threads, if present.

  4. Fair (best known as CFS): CFS implements SCHED_OTHER, SCHED_BATCH and SCHED_IDLE. Most threads present on a system are handled by this algorithm.

  5. Idle: If no algorithm finds a runnable thread the idle algorithm schedules the idle task on the CPU. The idle task is always runnable and puts the CPU in idle mode.

With this knowledge of scheduling policies we can answer the second sub-question. SCHED_OTHER has a priority between 19 and -20, where 19 is the lowest and -20 the highest. SCHED_FIFO and SCHED_RR threads use a priority between 1 and 99, where 1 is the lowest. Internally SCHED_IDLE and SCHED_BATCH tasks have a nice value too, but the scheduling algorithm does not honour it. Don’t get confused when a tool reports a nice value for SCHED_BATCH or SCHED_IDLE.

The ps tool is powerful and allows us to query this information. For example: ps axHo pid,lwp,comm,policy,nice,rtprio.

So we ask ps for a full list of all threads on the system, as well as their process id, thread id, short name, scheduling policy, nice value and realtime-priority.


Dealing with different priority representations

So far we saw that priorities of type nice range from 19 to -20 and realtime from 1 to 99. These are the ranges you can always expect from process tools like ps.

Things get complicated when raw kernel interfaces are used or kernel internal data is analyzed. You’ll face such situations when developing your own tools or using Linux’s kernel instrumentation facilities.

getpriority() system call

A prominent example is the getpriority() system call. It returns the nice value of a thread. If used in C, it will return an integer between -20 and 19 which is the expected nice value, but only if the C library performs a fixup. The raw kernel system call will return a shifted value within the range 1 to 40. The reason behind the shift is Linux’s system call ABI. Negative return values represent error numbers. So if your application uses system calls directly without a libc, this might be confusing.

shifted value nice value
1 -20
.. ..
21 0
.. ..
40 19

Using this system call makes only sense if the current scheduling policy is SCHED_OTHER. For all other policies the returned priority is meaningless.

Unified priorities

When you work with kernel instrumentation tools such as ftrace you will encounter the deep internals of Linux. Developers commonly use Linux’s sched_switch trace point to see which task is picked by the CPU scheduler. ftrace will report a line like this one:

[002] d... 565635.368831: sched_switch: prev_comm=kworker/u8:2 prev_pid=891 prev_prio=120 prev_state=R+ ==> next_comm=kworker/u9:4 next_pid=983 next_prio=100

We observe that the scheduler on CPU 2 schedules from PID 891 to PID 983. Please note that the PID in this context is not a process id as userspace would expect - it is in fact a thread id. Within the kernel, PID is the thread id. What userspace gets via the getpid() system call is actually the thread group id.

The alert reader will notice that this trace point has more than just the PID, it also shows the priorities of both tasks. However, they don’t make sense: prev_prio is 120 and next_prio is 100. Neither are in the range of a nice level or a real time priority. Additionally, the trace point does not indicate what scheduling policy is being used.

To decipher these values we need to know that Linux internally uses a unified priority among all scheduling policies.

It ranges from -1 to 139, where -1 is the highest and 139 the lowest priority. The unified priority is calcualted as table shows:

policy specific priority unified priority
RR/FIFO 99 0
RR/FIFO 98 1
.. .. ..
RR/FIFO 1 98
OTHER -20 100
.. .. ..
OTHER 0 120
.. .. ..
OTHER 19 139

With this information we can translate prev_prio to a nice value of 0, and next_prio to a nice value of -20.

A common pitfall when working with real time applications is to confuse real time priorities from kernel internals and reported by user space tools. Both are in the same range but reversed.

Attentive readers will notice that unified priority 99 is not claimed by any policy. This is an inconsistency within the Linux scheduling code and as of writing (v5.17-rc3) still present.

proc filesystem

The proc filesystem is the source of all process information tools such as ps display. For each thread it offers a stat file which contains, beside many other statisticts, all stats about the scheduling policy.

To learn the priority of a task, the 18th and 19th field of /proc/<pid>/task/<tid>/stat is of interest. If a task has the scheduling policy SCHED_OTHER the 19th field is non-zero and exhibits the nice value.

The 18th field is more generic and exposes the unified priority of a task, but shifted by -100.

shifted value unified priority
-101 -1
-100 0
.. ..
-51 49
.. ..
39 139

Executive summary

  • CPU scheduler decisions affect threads, not processes.
  • Each thread has a scheduling class, either SCHED_DEADLINE, SCHED_OTHER, SCHED_BATCH, SCHED_IDLE, SCHED_FIFO or SCHED_RR.
  • Scheduling classes SCHED_OTHER, SCHED_FIFO and SCHED_RR allow priorities.
  • SCHED_OTHER priority is the so-called nice value, ranging from -20 to 19.
  • SCHED_FIFO and SCHED_RR support priorities between 1 and 99.
  • Raw kernel interfaces and instrumentation facilities expose a unified priority, often in a normalised/shifted way. Here be dragons.