Linux Kernel – Process Scheduling

Process Scheduling

What ‘s process scheduler?

A sub system of kernel that puts processes to work. It decides which process runs, when and for how long.
The scheduler divides the resource of processor time between the runnable processes on a system

If there are more runnable processes than processors, some processes will not be running at a given moment. These processes are waiting to run

What’s multitasking?

A multitasking OS is one that can simultaneously interleave execution of more than one process.
Linux can have many processes in memory but only one in a runnable state

What’s preemptive multitasking?

Linux implements preemptive multitasking.
In preemptive multitasking, the scheduler decides when a process is to cease running and a new process is to begin running.

What’s preemption?

The action of involuntarily suspending a running process is called preemption

What’s timeslice?

The time a process runs before it’s preempted is usually predetermined and it’s called timeslice of the process

What’s the benefits of preemptive multitasking?

Managing the timeslice enables the scheduler to make global scheduling decisions for the system. It also prevents any one process from monopolizing the processor.

What’s cooperative multitasking?

A process does not stop running until voluntary decides to do so.

What’s yielding?

The act of a process voluntarily suspending itself is called yielding

What’s the bad thing of it?

Cannot make global decision
A process never yields can burn down the system

What’s the scheduler in kernel 2.6?

CFS Completely Fair Scheduler

What’s Policy?

Policy is the behavior of the scheduler that determines what runs when.
The scheduling policy must attempt to satisfy two goals:
– fast process response time(low latency)
– Maximal system utilization (high throughput)

What’s I/O-Bound process?

Runnable for only short duration because it eventually blocks waiting on more I/O (keyboard input, network I/O or disk I/O)

What’s process-Bound process?

Run less frequently but for longer durations e.g. matlab

What’s process priority?

Linux implements two priority ranks:
– Nice value: from -20 to +19. Larger nice values correspond to a lower priority- you are being nice to the other processes. Process with lower nice value receive a larger proportion of the system processor. In linux it’s a control of the timeslice proportion

  • Real-time priority: 0-99 larger value has high priority. All real-time processes are at a higher priority compared with normal processes

How does timeslice work?

The timeslice is the numeric value that represents how long a task can run until it is preempted.
Linux CFS scheduler assigns process a proportion of the processor. The proportion is affected by the nice value

In Linux the preemption decision is a function of how much a proportion of the processor the newly runnable process has consumed. If it has consumed a smaller proportion of the processor than the currently executing process, it runs immediately, preempting the current process. If not , it is scheduled to run at a later time

What’s Scheduler Classes?

Linux scheduler use scheduler classes to enable different algorithms to scheduler different types of processes.

Scheduler classes have priority. The base scheduler class has the highest priority. The highest schedule class that has a runnable process wins and select the next process

Linux CFS is a scheduler class for normal processes

What’s the Cons of the traditional Unix scheduler?

  1. Context switching
  2. The nice value is a relative term and the difference between 0,1 and 18, 19 is quite large
  3. The absolute mapping between timeslice and nice value can cause issues
  4. ….

What’s Perfect multitasking?

If we have 2 processes, we would run both processes simultaneously for the same time, each at 50% power.

How’s the Fair scheduling works?

CFS calculates how long a process should run as a function of the total runnable processes.
The nice value works as a weight regarding the proportion of the processor.
Each process runs for a “timeslice” proportional to its weight divided by the total weight of all runnable threads.

CFS set a minimum duration called targeted latency as a basis.
Targeted latency is defined by the number of processes. Therefore, it’s a trade off between better interaction and expensive context switch, worse throughput.

Therefore, if we have infinite processes, the targeted latency will be 0.
So we set a floor of the targeted latency called minimum granularity:1 millisecond

CFS is not perfect when we have too many processes but it works good for normal situations

How we account the time via CFS?

Linux is using struct sched_entity to save related info
sched_entity is in the process descriptor

What’s the virtual runtime?

Actual runtime normalized by the number of runnable processes
It’s the ideal estimated runtime for a process on multi processors system.

It’s updated by update_curr() which is invoked periodically by system timer and also whenever a process becomes runnable and unrunnable

Vruntime is an accurate measure of the runtime of a given process

How’s the process selection work?

CFS picks the process with the smallest vruntime to run next.

CFS use a red-black tree to manage the list of runnable processes and find the process with the smallest vruntime.
rbtree( red-black tree) is a self-balancing binary search tree

The key of the rbtree is the vruntime. So we just need to pick up the leftmost node in the tree.
In linux we don’t need to traverse the tree( O(height of the tree) O(LogN) if the tree has n node and is balanced) to find the leftmost node because the value is cached by rb_leftmost.

If leftmost node is null, there are no runnable processes, then CFS schedules the idle task

How do we add process to the tree?

The process will be added when becominng runnable or firstly created by fork()

Implemented by enqueue_entity()

When do we remove process from the tree?
CFS removes processes from the tree when the process blocks or terminates

Now What’s the workflow of the scheduler?

The main entry point is the function scheduler() which call pick_next_task()
This function will go through each scheduler class, starting with the highest priority and select the highest priority process in the highest priority class

CFS is a scheduler class for normal processes.

Where do we store unrunnable processes?

Wait queue: a simple list of processes waiting for an event to occur
Be careful of the implementation because it can lead to race conditions or dead lock

What’s the context switching?

The switch from one runnable task to another is handled by context_switch()
This function is called by schedule()
It does 2 things:
– Call switch_mm() to switch the virtual memory mapping from the previous process to the new process
– Call switch_to() to switch the process state from the previous state to current process’s state

Kernel use a flag to decide when to enable conext switch. Usually it happens when preemption or wake up

The flag is in the thread _info

Linux kernel is SMP-safe. What’s SMP-safe?

SMP is an acronym for Symmetric Multi Processing (meaning multiple CPUs of the same kind – pretty much.)
It means thread safe.
Here it means that it’s safe for multiple processors to avoid race condition from hardware level.

How’s the kernel process preemption happens?

Thread_info holds a preempt_count
If preempt_count is>0 , it means other processor is using it therefore it’s locked
If it’s 0, it’s safe to preempt this process and reschedule

What’s the real-time scheduling policies?

Linux has 2 real-time policies:

Those policies ‘re handled by a special real-time scheduler (schedule class same as CFS)

What’s the detail of SCHED_FIFO?

It’s a simple first in first out algorithm without timeslice
1. A runnable SCHED_FIFO task is always scheduled over any normal tasks.
2. When it runnable, it continues to run until it blocks or yields the processor
3. Same priority tasks will run round-robin
4. No timeslice and can run indefinitely.
5. Only a higher priority SCHED_FIFO and SCHED_RR can preempt it

What’s the detail of SCHED_RR?

Similar to SCHED_FIFO except it has timeslice
Real-time round-robin algorithm

Linux Kernel – Process Management

Process Management

This is a reading notes regarding the Linux Kernel Development

What is process and what is thread?

Process: is a program in the midst of execution, also include a set of resources

Thread: the objects of activity within the process. Includes: a unique program counter, process stack, Set of process registers

The kernel schedules threads not process.

To linux, a thread is implemented as a special kind of process

Process provides two virtualizations
– Virtualized processor (schedule): it gives the process the illusion that it alone monopolizes the system.
– Virtual memory (memory): let the process allocate the manage memory as if it alone owned all the memory in the system.

Threads share the virtual memory abstraction whereas each receives its own virtualized processor

Process is an active program
Two and more processes can exist that are executing the same program, sharing various resources

How the process begins?

The existing process will create another process by calling fork()
The existing process is parent
The new process is child
The parent process will resume execution and the child process will start the execution both when the fork() return

The fork() system call return twice from kernel: one in parent process another one in the child process

What happens after creation?

The process will execute a new, different program. The exec() family of function calls creates new address space and loads a new program into it.

How does the process end?

A program exits via exit() system call. This call terminates the process and free resources.
The parent process can inquire about the status of a terminated child via wait4() system call, which enables a process to wait for the termination of a specific process.
When the process exists, it’s placed into a special zombie state that represents terminated processes until the parent calls wait() or waitpid()

Another name for a process is a task!

How does the kernel store all those processes?

The kernel stores list of processes in a circular doubly linked list called Task List

What’s process descriptor?

Each element in the task list is a process descriptor. Each process descriptor is of the type: struct task_struct which is defined in <linux/sched.h>
The process descriptor contains all the info about a specific process.

The task_struct contains the data e.g. open files, the process’s address space, pending signals, the process’s state and etc

How to allocate the process descriptor

The task_struct is allocated via slab allocator

With the process descriptor dynamically created by slab allocator, a new structure thread_info will be generated and live at the bottom of the stack (for stack that grows down) or top of the stack (for stack that grows up)

thread_info has a pointer to the task_struct

Each task’s thread_info is allocated a the end of its stack.

How to get the current working process descriptor?

Each process has a unique id called PID saved in each porcess descriptor

Pid_max defines the max number of pid. A value to estimated concurrency

In the system, we can use current macro to directly use the current process
In x86, current is calculated via masking out 13 least-significant bits of the stack pointer to obtain the thread_info

This calculation is done within current_thread_info()
To get the current task_struct: current_thread_info()->task

What’s process state?

The state field in the process descriptor present the current condition of the process
There are 5 different states:
– TASK_RUNNING: the process is runnable. It’s either currently running or wait for running in the queue. It’s the only state for process executing in user-space. In kernel-space, it presents that the process is actively running
– TASK_INTERRPTIBLE: the process is sleeping(blocked) and wait for some conditions to exists. When it exists, the kernel will set the process to TASK_RUNNING. The process can also be awaken by signals
– TASK_UNINTERPTIBLE: it’s the same state of TASK_INTERRPTIBLE except that it will not be awaken by signals
– _TASK_TRACED: the process is been traced by other processes e.g. debugger via ptrace
– _TASK_STOPPED: process is stopped nor is it eligible to run. This occurs if the task receives SIGSTOP, SIGSTP,SIGTTIN or SIGTTOU signal or any signal while it’s being debugged

How to change the process state?

Call set_task_state(task, state)
set_current_state is sync to above function

What’s process context?

For each process it will read the execuable code from the execuable files and execute the code in the progress’s address space. Normal program is executed in user-space. If its’s called by system call or trigger an exception, then the process enters in the kernel space
At this point, the kernel is said to be ‘executing on behalf of the process’ and is in process context

After exiting the kernel, the process resumes execution in user-space

What’s process family tree?

All processes are descendants of the init process, whose pid is one.
The kernel starts init in the last step of the boot process and init process reads the system initscripts and execute more programs

Each process has one parent but 0-more children. Process that has same parent are siblings.
The relation is saved in process descriptor, with a pointer called parent
Another list pointers call children.
It’s easy to iterate over all processes in the system because task list is a circular doubly linked list
But expensive time cost

How to create a process in details?

Two steps in unix: fork() and exec()
– Fork() create a child process that is a copy of the current task except pid and certain resources and statistics
PPID is parent’s pid
– exec() loads a new execuable into the address space and executes it

Linux implement fork() via clone() system call via do_fork() via copy_process()
1. It calls dup_task_struct() to create new kernel stack, thread_info and task_struct. At this point the child and parent descriptor are the same
2. Check pid is within the max limit
3. Child process differentiate itself from the parent process
4. Call copy_flags() to update flag of task_struct
5. Child state is set to TASK_UNINTERRUPTIBLE to ensure that it doesn’t yet run
6. Call alloc_pid() to create a new pid
7. Either duplicate or share resources: open files, filesystem information, signal handlers, process address space and namespace. Those are shared between threads in a process or unique and copied for different process
8. copy_process() cleanup and return the caller a pointer to new process

What’s Copy-on-Write? (Known as COW in other system)

Rather than duplicate the process address space, the parent and the child can share a single copy.
The data is marked that if it is written to , a duplicate is made and each process receives a unique copy. COW delays the copying of each page in the address space until it is actually written to.

What’s kernel threads?

It’s useful for kernel to perform some operation in the background. It’s done via kernel threads.
It’s standard process that exist solely in kernel-space.
Kernel thread doesn’t have an address space. They never context switch into user-space

How to terminate a process?

When a process terminates , the kernel release the resources owned by the process and notifies the parent process
The process calls exit() to terminate itself.
The work is handled by do_exit():
1. Set the PF_EXITING flag of the task_struct
2. Call del_timer_sync() to remove any kernel timers
3. If BSD accounting is enabled, then release it
4. Call exit_mm() to release mm_struct held by this process. If no other process is holding it
5. Call exit_files() and exit_fs() to decrement the usage count of objects related to file descriptors and filesystem data. If the usage count is 0, then the object will be destoryed (GC)
6. It sets the exit_code in the task_struct
7. Call exit_notify() to send signals to the parent, reparents children process to another thread in their thread group or init process. Then set the exit state in the exit_state in task_struct to EXIT_ZOMBIE
8. do_exit() call schedule() to switch to a new process

All objects associated with the task are freed. The only memeory it holds is itskernel stack the thread_info and task_Struct. After the parent retrieves the info or notifies the kernel of the received. The remaining memory will be released

Reading Note – Cache and Downgrade solution for distribution system


there is a good explanation about multi-level cache

  • (L1) Level 1 Cache(2KB – 64KB) – Instructions are first searched in this cache. L1 cache very small in comparison to others, thus making it faster than the rest.
  • (L2) Level 2 Cache(256KB – 512KB) – If the instructions are not present in the L1 cache then it looks in the L2 cache, which is a slightly larger pool of cache, thus accompanied by some latency.
  • (L3) Level 3 Cache (1MB -8MB) – With each cache miss, it proceeds to the next level cache. This is the largest among the all the cache, even though it is slower, it’s still faster than the RAM.

Downgrade solution for distribution system


In order to make sure the main service is still available during extreme situations such as high pressure, server error, or breaking workflow, we can design the downgrade service. The downgrade service will only provide limited services.

The server can automatically downgrade the service regarding key variables. We can also implement a manual solution to switch on/off downgrade service.

Some services can not be downgraded.


  • automatically or manually
  • read service or write service
  • multiple level or single level


  • normal: service is not stable because of upgrade and internet. Use automatical solution
  • warning: the stability range from 95% to 100%。 Use automatical solution or manual solution
  • error: availability less than 90%. It may be caused by DB connection pool is full or request is too large. Use automatical solution or manual solution
  • serious error: manual solution


Reading Note – Interface development for getting products distributed services

Cache Design

  • The product data will be saved to cache (redis)
  • Provide the interface for refreshing interface
  • During the refreshing of redis, the front end should still have data
  • Data cache should have a version number. Every time the cache update should generate a new cache key
  • Save the cache key to the list which can filter the latest version by current time

API Design

  • interface parameter: version and page index
  • response data: data and current version
  • use sorted set to get the latest version number
  • the first time request will response the current latest version number
  • the client will cache the version number. The next page request will need the cached version number
  • return 304 if the client request the first page and local version number equal to the current version number. So the client can avoid re-rendering the page
  • if no version number within the request, Use the latest one


  • the infinite increase of version number: the version number will keep increasing. So we need to delete the version number of last day during the update. Use sorted set to filter the version number
  • when getting the latest version, capturing the version set for the last hour. Then use a filter to get the latest one. When getting the data for the product, find the version number which contains the data if the product does not exist in current version
  • SLB: Server Load Balancing


system design


Factors need to be considered:

  • The performance of the interface
  • The stability of the interface
  • Error Tolerance
  • Server side pressure: reduce the pressure of the server side
  • Service downgrade: during high pressure

System Design Practice (2)

Outline use cases, constraints, and assumptions

We need to ask following questions:

  • Who is going to use it?
  • How are they going to use it?
  • How money users are there?
  • What does the system do?
  • What are the inputs and outputs of the system?
  • How much data do we expect to handle?
  • How many requests per second do we expect?
  • What is the expected read to write ratio?


Sometimes we need to do back-of-the-envelope calculation

Back-of-the-envelope calculations are estimates you create using a combination of thought experiments and common performance numbers to a get a good feel for which designs will meet your requirements.

Common rules:

  • Memory is fast and disk is slow (yea obviously)
  • Writes are 40 times more expensive than reads (that’s why we may need to separate write db and read db)

Powers of two table

Power ExactValue ApproximateVale Bytes 2^7 128
2^8 256 2^10 1024 1k 1KB 2^16 65536 64KB 2^20 1048576 1m 1MB 2^30 1billion 1GB 2^32 4GB 2^40 1 trillion 1TB

System Design Practice (1)

Today I went through the basic knowledge about sys design. Actually, it includes a well-structured skill-tree. The majority of those technological terms is quite familiar for me. During my daily reading, I have already heard or used some aspects such as replication, master-slaver, load balance and federation. But I didn’t have time to review and combine my working experience with the best practice.

I have been using MongoDB for few months and I need to quick pick up the concept of SQL DB. There are some incorrect concepts regarding NoSQL design. But I still learned a lot from it

Tomorrow I will go through all topics and terms again and some methodologies about system design best practice. But we need to start with a handy and quick try. Then I can figure out my weakness and what I need to learn tomorrow.

Let’s begin and have some fun!

Step 1: Outline Use Cases and constraints

First, we need to have a clear understanding of what kind of situations and scenarios we need to handle. That’s why we need to abstract the use cases.

Previously, I worked on the Revel-Xero integration project.

Use cases

Here are some scoped use cases:

  • User register and connect with Revel and Xero account
  • Service extracts sales records from Revel account
    • Updates Daily
    • Categorizes sales orders by Product, Product Class, and Establishment
    • Analysis monthly spending by category
  • Service generate sales invoice and push to Xero account
    • Pushes Daily
    • Allow users to manually set the account mapping and pushing schedule
    • Sends notifications when approaching or fail to send
  • Service has high availability

Now we have three use cases. The real scenario is much more complex than this one. It also includes sales, purchase orders, payroll and item sync. The invoice also has multiple formats and the sales, payment and tax mapping should be flexible enough. But it has a similar workflow. Let’s focus on the current situation

Constraints and assumptions

(Question: what’s the best practice of calculating the constraints and assumptions? Need research)

State assumptions

  • Usually once the account setup and begin to work, user only come back on a monthly basis
  • There is no need for real-time update. Revel is not a strong consistency system so we need to delay 1-2 days and then sync the data
  • Revel only have around 1000 customers in AU. But our target is the entire Asia market. So let’s assume 10 thousand users
    • Usually, one user will only have 1 establishment. So 10k establishment
    • Each establishment usually will have around 1000 sales orders per day. 10 million transactions per day
    • 300 million transactions per month
    • one user has one revel account and one xero account. so 10k revel account and 10k xero account
    • 20k read request per month
    • 100:1 write to read ratio
    • write-heavy, user make transactions daily but few visit the site daily.

Calculate Usage

In case we forget: 1 English letter = 1 byte, 2^8 = 1 byte, 1 Chinese letter = 2 bytes

  • Size per transaction:
    • user_id: 8 bytes
    • created: 8 bytes
    • product: 32 bytes
    • product_class: 10 bytes
    • establishment: 12 bytes
    • amount: 5 bytes
    • Total: ~ 75 bytes
  • 20 GB of new transaction content per month, 240 GB per year
    • 720 GB of new transaction in 3 years
  • 116 transaction per second on average
  • 0.017 read request per second on average

Handy conversion guide:

  • 2.5 million seconds per month
  • 1 request per second = 2.5 million requests per month
  • 40 requests per second = 100 million requests per month
  • 400 requests per second = 1 billion requests per month

Step 2: Create a high-level design

Outline a high-level design with all important components.

(To be continued… It’s 12 am now so I will try to finish it tomorrow!)