Linux Kernel – Process Scheduling

Process Scheduling

What ‘s process scheduler?

A sub system of kernel that puts processes to work. It decides which process runs, when and for how long.
The scheduler divides the resource of processor time between the runnable processes on a system

If there are more runnable processes than processors, some processes will not be running at a given moment. These processes are waiting to run

What’s multitasking?

A multitasking OS is one that can simultaneously interleave execution of more than one process.
Linux can have many processes in memory but only one in a runnable state

What’s preemptive multitasking?

Linux implements preemptive multitasking.
In preemptive multitasking, the scheduler decides when a process is to cease running and a new process is to begin running.

What’s preemption?

The action of involuntarily suspending a running process is called preemption

What’s timeslice?

The time a process runs before it’s preempted is usually predetermined and it’s called timeslice of the process

What’s the benefits of preemptive multitasking?

Managing the timeslice enables the scheduler to make global scheduling decisions for the system. It also prevents any one process from monopolizing the processor.

What’s cooperative multitasking?

A process does not stop running until voluntary decides to do so.

What’s yielding?

The act of a process voluntarily suspending itself is called yielding

What’s the bad thing of it?

Cannot make global decision
A process never yields can burn down the system

What’s the scheduler in kernel 2.6?

CFS Completely Fair Scheduler

What’s Policy?

Policy is the behavior of the scheduler that determines what runs when.
The scheduling policy must attempt to satisfy two goals:
– fast process response time(low latency)
– Maximal system utilization (high throughput)

What’s I/O-Bound process?

Runnable for only short duration because it eventually blocks waiting on more I/O (keyboard input, network I/O or disk I/O)

What’s process-Bound process?

Run less frequently but for longer durations e.g. matlab

What’s process priority?

Linux implements two priority ranks:
– Nice value: from -20 to +19. Larger nice values correspond to a lower priority- you are being nice to the other processes. Process with lower nice value receive a larger proportion of the system processor. In linux it’s a control of the timeslice proportion

  • Real-time priority: 0-99 larger value has high priority. All real-time processes are at a higher priority compared with normal processes

How does timeslice work?

The timeslice is the numeric value that represents how long a task can run until it is preempted.
Linux CFS scheduler assigns process a proportion of the processor. The proportion is affected by the nice value

In Linux the preemption decision is a function of how much a proportion of the processor the newly runnable process has consumed. If it has consumed a smaller proportion of the processor than the currently executing process, it runs immediately, preempting the current process. If not , it is scheduled to run at a later time

What’s Scheduler Classes?

Linux scheduler use scheduler classes to enable different algorithms to scheduler different types of processes.

Scheduler classes have priority. The base scheduler class has the highest priority. The highest schedule class that has a runnable process wins and select the next process

Linux CFS is a scheduler class for normal processes

What’s the Cons of the traditional Unix scheduler?

  1. Context switching
  2. The nice value is a relative term and the difference between 0,1 and 18, 19 is quite large
  3. The absolute mapping between timeslice and nice value can cause issues
  4. ….

What’s Perfect multitasking?

If we have 2 processes, we would run both processes simultaneously for the same time, each at 50% power.

How’s the Fair scheduling works?

CFS calculates how long a process should run as a function of the total runnable processes.
The nice value works as a weight regarding the proportion of the processor.
Each process runs for a “timeslice” proportional to its weight divided by the total weight of all runnable threads.

CFS set a minimum duration called targeted latency as a basis.
Targeted latency is defined by the number of processes. Therefore, it’s a trade off between better interaction and expensive context switch, worse throughput.

Therefore, if we have infinite processes, the targeted latency will be 0.
So we set a floor of the targeted latency called minimum granularity:1 millisecond

CFS is not perfect when we have too many processes but it works good for normal situations

How we account the time via CFS?

Linux is using struct sched_entity to save related info
sched_entity is in the process descriptor

What’s the virtual runtime?

Actual runtime normalized by the number of runnable processes
It’s the ideal estimated runtime for a process on multi processors system.

It’s updated by update_curr() which is invoked periodically by system timer and also whenever a process becomes runnable and unrunnable

Vruntime is an accurate measure of the runtime of a given process

How’s the process selection work?

CFS picks the process with the smallest vruntime to run next.

CFS use a red-black tree to manage the list of runnable processes and find the process with the smallest vruntime.
rbtree( red-black tree) is a self-balancing binary search tree

The key of the rbtree is the vruntime. So we just need to pick up the leftmost node in the tree.
In linux we don’t need to traverse the tree( O(height of the tree) O(LogN) if the tree has n node and is balanced) to find the leftmost node because the value is cached by rb_leftmost.

If leftmost node is null, there are no runnable processes, then CFS schedules the idle task

How do we add process to the tree?

The process will be added when becominng runnable or firstly created by fork()

Implemented by enqueue_entity()

When do we remove process from the tree?
CFS removes processes from the tree when the process blocks or terminates

Now What’s the workflow of the scheduler?

The main entry point is the function scheduler() which call pick_next_task()
This function will go through each scheduler class, starting with the highest priority and select the highest priority process in the highest priority class

CFS is a scheduler class for normal processes.

Where do we store unrunnable processes?

Wait queue: a simple list of processes waiting for an event to occur
Be careful of the implementation because it can lead to race conditions or dead lock

What’s the context switching?

The switch from one runnable task to another is handled by context_switch()
This function is called by schedule()
It does 2 things:
– Call switch_mm() to switch the virtual memory mapping from the previous process to the new process
– Call switch_to() to switch the process state from the previous state to current process’s state

Kernel use a flag to decide when to enable conext switch. Usually it happens when preemption or wake up

The flag is in the thread _info

Linux kernel is SMP-safe. What’s SMP-safe?

SMP is an acronym for Symmetric Multi Processing (meaning multiple CPUs of the same kind – pretty much.)
It means thread safe.
Here it means that it’s safe for multiple processors to avoid race condition from hardware level.

How’s the kernel process preemption happens?

Thread_info holds a preempt_count
If preempt_count is>0 , it means other processor is using it therefore it’s locked
If it’s 0, it’s safe to preempt this process and reschedule

What’s the real-time scheduling policies?

Linux has 2 real-time policies:
– SCHED_FIFO
– SCHED_RR

Those policies ‘re handled by a special real-time scheduler (schedule class same as CFS)

What’s the detail of SCHED_FIFO?

It’s a simple first in first out algorithm without timeslice
1. A runnable SCHED_FIFO task is always scheduled over any normal tasks.
2. When it runnable, it continues to run until it blocks or yields the processor
3. Same priority tasks will run round-robin
4. No timeslice and can run indefinitely.
5. Only a higher priority SCHED_FIFO and SCHED_RR can preempt it

What’s the detail of SCHED_RR?

Similar to SCHED_FIFO except it has timeslice
Real-time round-robin algorithm

Linux Kernel – Process Management

Process Management

This is a reading notes regarding the Linux Kernel Development

What is process and what is thread?

Process: is a program in the midst of execution, also include a set of resources

Thread: the objects of activity within the process. Includes: a unique program counter, process stack, Set of process registers

The kernel schedules threads not process.

To linux, a thread is implemented as a special kind of process

Process provides two virtualizations
– Virtualized processor (schedule): it gives the process the illusion that it alone monopolizes the system.
– Virtual memory (memory): let the process allocate the manage memory as if it alone owned all the memory in the system.

Threads share the virtual memory abstraction whereas each receives its own virtualized processor

Process is an active program
Two and more processes can exist that are executing the same program, sharing various resources

How the process begins?

The existing process will create another process by calling fork()
The existing process is parent
The new process is child
The parent process will resume execution and the child process will start the execution both when the fork() return

The fork() system call return twice from kernel: one in parent process another one in the child process

What happens after creation?

The process will execute a new, different program. The exec() family of function calls creates new address space and loads a new program into it.

How does the process end?

A program exits via exit() system call. This call terminates the process and free resources.
The parent process can inquire about the status of a terminated child via wait4() system call, which enables a process to wait for the termination of a specific process.
When the process exists, it’s placed into a special zombie state that represents terminated processes until the parent calls wait() or waitpid()

Another name for a process is a task!

How does the kernel store all those processes?

The kernel stores list of processes in a circular doubly linked list called Task List

What’s process descriptor?

Each element in the task list is a process descriptor. Each process descriptor is of the type: struct task_struct which is defined in <linux/sched.h>
The process descriptor contains all the info about a specific process.

The task_struct contains the data e.g. open files, the process’s address space, pending signals, the process’s state and etc

How to allocate the process descriptor

The task_struct is allocated via slab allocator

With the process descriptor dynamically created by slab allocator, a new structure thread_info will be generated and live at the bottom of the stack (for stack that grows down) or top of the stack (for stack that grows up)

thread_info has a pointer to the task_struct

Each task’s thread_info is allocated a the end of its stack.

How to get the current working process descriptor?

Each process has a unique id called PID saved in each porcess descriptor

Pid_max defines the max number of pid. A value to estimated concurrency

In the system, we can use current macro to directly use the current process
In x86, current is calculated via masking out 13 least-significant bits of the stack pointer to obtain the thread_info

This calculation is done within current_thread_info()
To get the current task_struct: current_thread_info()->task

What’s process state?

The state field in the process descriptor present the current condition of the process
There are 5 different states:
– TASK_RUNNING: the process is runnable. It’s either currently running or wait for running in the queue. It’s the only state for process executing in user-space. In kernel-space, it presents that the process is actively running
– TASK_INTERRPTIBLE: the process is sleeping(blocked) and wait for some conditions to exists. When it exists, the kernel will set the process to TASK_RUNNING. The process can also be awaken by signals
– TASK_UNINTERPTIBLE: it’s the same state of TASK_INTERRPTIBLE except that it will not be awaken by signals
– _TASK_TRACED: the process is been traced by other processes e.g. debugger via ptrace
– _TASK_STOPPED: process is stopped nor is it eligible to run. This occurs if the task receives SIGSTOP, SIGSTP,SIGTTIN or SIGTTOU signal or any signal while it’s being debugged

How to change the process state?

Call set_task_state(task, state)
set_current_state is sync to above function

What’s process context?

For each process it will read the execuable code from the execuable files and execute the code in the progress’s address space. Normal program is executed in user-space. If its’s called by system call or trigger an exception, then the process enters in the kernel space
At this point, the kernel is said to be ‘executing on behalf of the process’ and is in process context

After exiting the kernel, the process resumes execution in user-space

What’s process family tree?

All processes are descendants of the init process, whose pid is one.
The kernel starts init in the last step of the boot process and init process reads the system initscripts and execute more programs

Each process has one parent but 0-more children. Process that has same parent are siblings.
The relation is saved in process descriptor, with a pointer called parent
Another list pointers call children.
It’s easy to iterate over all processes in the system because task list is a circular doubly linked list
But expensive time cost

How to create a process in details?

Two steps in unix: fork() and exec()
– Fork() create a child process that is a copy of the current task except pid and certain resources and statistics
PPID is parent’s pid
– exec() loads a new execuable into the address space and executes it

Linux implement fork() via clone() system call via do_fork() via copy_process()
1. It calls dup_task_struct() to create new kernel stack, thread_info and task_struct. At this point the child and parent descriptor are the same
2. Check pid is within the max limit
3. Child process differentiate itself from the parent process
4. Call copy_flags() to update flag of task_struct
5. Child state is set to TASK_UNINTERRUPTIBLE to ensure that it doesn’t yet run
6. Call alloc_pid() to create a new pid
7. Either duplicate or share resources: open files, filesystem information, signal handlers, process address space and namespace. Those are shared between threads in a process or unique and copied for different process
8. copy_process() cleanup and return the caller a pointer to new process

What’s Copy-on-Write? (Known as COW in other system)

Rather than duplicate the process address space, the parent and the child can share a single copy.
The data is marked that if it is written to , a duplicate is made and each process receives a unique copy. COW delays the copying of each page in the address space until it is actually written to.

What’s kernel threads?

It’s useful for kernel to perform some operation in the background. It’s done via kernel threads.
It’s standard process that exist solely in kernel-space.
Kernel thread doesn’t have an address space. They never context switch into user-space

How to terminate a process?

When a process terminates , the kernel release the resources owned by the process and notifies the parent process
The process calls exit() to terminate itself.
The work is handled by do_exit():
1. Set the PF_EXITING flag of the task_struct
2. Call del_timer_sync() to remove any kernel timers
3. If BSD accounting is enabled, then release it
4. Call exit_mm() to release mm_struct held by this process. If no other process is holding it
5. Call exit_files() and exit_fs() to decrement the usage count of objects related to file descriptors and filesystem data. If the usage count is 0, then the object will be destoryed (GC)
6. It sets the exit_code in the task_struct
7. Call exit_notify() to send signals to the parent, reparents children process to another thread in their thread group or init process. Then set the exit state in the exit_state in task_struct to EXIT_ZOMBIE
8. do_exit() call schedule() to switch to a new process

All objects associated with the task are freed. The only memeory it holds is itskernel stack the thread_info and task_Struct. After the parent retrieves the info or notifies the kernel of the received. The remaining memory will be released