Real-time programming with Linux, part 4: C++ application tutorial

By Shuhao Wu on 2022-05-23

Real-time programming with Linux: Part 1 - Part 2 - Part 3 - Part 4

As we have explored thoroughly in this series, building a real-time (RT) system involves ensuring the hardware, operating system (OS), and the application all have bounded latency. In the last post, I described a few sources of unbounded latency caused by interactions between the OS and the application. In this post, we will write the necessary boilerplate code to avoid these sources of latency in C++ with these few steps:

Lock all memory pages into physical RAM.
Setup RT process scheduling for the real-time thread(s).
Run a loop at a predictable rate with low jitter.
Ensure data can be safely passed from one thread to another without data races.

Since this code is required for basically all Linux RT applications, I refactored the code into a small RT application framework in my cactus-rt repository. All the examples in this post are also shown in full in that repository, along with a number of additional examples based on the refactored app framework.

Locking memory pages with `mlockall`

As noted in the previous post, code in the RT section needs to avoid page faults to ensure that memory access latency is not occasionally unbounded. This can be done by locking the application's entire virtual memory space into physical RAM via the mlockall(MCL_CURRENT | MCL_FUTURE) function call. Usually, this is done immediately upon application startup and before the creation of any threads, since all threads in a process share the same virtual memory space. The following code snippet shows how to do this:

 1 #include <cstring> // necessary for strerror
 2 #include <stdexcept>
 3 #include <sys/mman.h> // necessary for mlockall
 4 
 5 void LockMemory() {
 6   int ret = mlockall(MCL_CURRENT | MCL_FUTURE);
 7   if (ret) {
 8     throw std::runtime_error{std::strerror(errno)};
 9   }
10 }
11 
12 int main() {
13   LockMemory();
14 
15   // Start the RT thread... etc.
16 }

This code is straight-forward: line 6 shows the usage of mlockall, along with some error handling after.

Setting up real-time threads with pthreads

By default, threads created on Linux are scheduled with a non-RT scheduler. The non-RT scheduler is not optimized for latency and thus cannot generally be used to satisfy RT constraints. To setup an RT thread, we need to inform the OS to schedule the thread with a RT scheduling policy. As of the time of this writing, there are three RT scheduling policies on Linux: SCHED_RR, SCHED_DEADLINE, and SCHED_FIFO. Generally, SCHED_RR should probably not be used as it is tricky to use correctly[1]. SCHED_DEADLINE is an interesting but advanced scheduler that I may cover in another time. For most applications, SCHED_FIFO is likely good enough. With this policy, if a thread is runnable (i.e. not blocked due to mutex, IO, sleep, etc.), it will run until it is done, blocked, or preempted (interrupted) by a higher-priority thread[2]. With the right system setup, SCHED_FIFO can be used to program an RT loop with relatively low jitter (0.05 - 0.2ms depending on the hardware). This is something that you will know how to do by the end of this post.

In addition to configuring the thread with an RT scheduling policy, we also need to give it an RT priority level. If two threads are runnable, the higher-priority thread will run, even if it means preempting the lower-priority thread in the process. The priority of a normal Linux thread[3] is controlled by its nice values and ranges from -20 to +19, with the lower values taking a higher priority. However, these values are not applicable to RT threads[4]. Instead, the RT priority values of a thread scheduled by an RT scheduling policy ranges from 0 to 99. Confusingly, in this system, a higher value takes a higher priority. Fortunately, nice values and RT priority values are unrelated and RT threads always have higher priority than the non-RT threads. The scale for the nice and RT priority values is illustrated in Figure 1.

On a typical Linux distribution with the PREEMPT_RT patch, there should not be RT tasks running on the system except for a few built-in kernel tasks. The kernel interrupt request (IRQ) handlers handle interrupt requests originating from hardware devices and run with an RT priority value of 50. These are necessary for communication with the hardware and should generally not be changed[5]. Some critical kernel-internal tasks, such as the process migration tasks and the watchdog task, always run with a RT priority value of 99. To ensure that the RT application gets priority over the IRQ handlers, its RT priority is usually set to 80 as a reasonable default. Userspace RT applications should generally not set its RT to 99 to ensure kernel-critical tasks can run. These processes are also marked on Figure 1.

Figure 1: Diagram depicting the ranges of priority levels on Linux (not to scale). SCHED_OTHER is a non-RT scheduling policy while SCHED_FIFO is an RT scheduling policy.

To setup the RT scheduling policy and priority, we can interact with the pthread API[6]. The C++ standard library defines the std::thread class as a cross-platform abstraction around the OS-level threads. However, there is no C++-native ways to setup the scheduling policy and priority as the OS-level APIs (such as pthread) are not standardized across platforms. Instead, std::thread has a native_handle() method that returns the underlying pthread_t struct on Linux. With the right API calls, it is possible to set the scheduling policy and priority after the creation of the thread. However, I find this to be a bit tedious and prefer interact with the pthread API directly so that the thread is created with the right attributes. This code can then be wrapped into a Thread class for convenience (full code is shown here):

 1 // Other includes ...
 2 #include <pthread.h>
 3 
 4 class Thread {
 5   int priority_;
 6   int policy_;
 7 
 8   pthread_t thread_;
 9 
10   static void* RunThread(void* data) {
11     Thread* thread = static_cast<Thread*>(data);
12     thread->Run();
13     return NULL;
14   }
15 
16  public:
17   Thread(int priority, int policy)
18       : priority_(priority), policy_(policy) {}
19 
20   void Start() {
21     pthread_attr_t attr;
22 
23     // Initialize the pthread attribute
24     int ret = pthread_attr_init(&attr);
25     if (ret) {
26       throw std::runtime_error(std::strerror(ret));
27     }
28 
29     // Set the scheduler policy
30     ret = pthread_attr_setschedpolicy(&attr, policy_);
31     if (ret) {
32       throw std::runtime_error(std::strerror(ret));
33     }
34 
35     // Set the scheduler priority
36     struct sched_param param;
37     param.sched_priority = priority_;
38     ret = pthread_attr_setschedparam(&attr, &param);
39     if (ret) {
40       throw std::runtime_error(std::strerror(ret));
41     }
42 
43     // Make sure threads created using the thread_attr_ takes the value
44     // from the attribute instead of inherit from the parent thread.
45     ret = pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
46     if (ret) {
47       throw std::runtime_error(std::strerror(ret));
48     }
49 
50     // Finally create the thread
51     ret = pthread_create(&thread_, &attr, &Thread::RunThread, this);
52     if (ret) {
53       throw std::runtime_error(std::strerror(ret));
54     }
55   }
56 
57   int Join() {
58     return pthread_join(thread_, NULL);
59   }
60 
61   void Run() noexcept {
62     // Code here should run as RT
63   }
64 };
65 
66 void LockMemory() { /* See previous section */ }
67 
68 int main() {
69   LockMemory();
70 
71   Thread rt_thread(80, SCHED_FIFO);
72   rt_thread.Start();
73   rt_thread.Join();
74 
75   return 0;
76 }

The above code snippet defines the class Thread with three important methods:

void Start() which invokes the pthread API and starts an RT (or non-RT) thread.
int Join(), which calls pthread_join and wait for the thread to finish.
void Run() noexcept, which should contains the custom logic that should execute on the RT thread. As this is a demonstration, it is left empty. The method is defined with noexcept as C++ exceptions are not real-time safe.

Most of the magic is contained in the Start() method. The scheduling policy is set on line 30 and the scheduling priority is set on line 37 and 38. Note that policy_ = SCHED_FIFO and priority_ = 80 is set with the construction of the Thread object on line 71. The thread is finally started on line 51. This calls the method Thread::RunThread on the newly-created RT thread, which simply calls thread->Run(). This indirection is needed because pthread takes a function pointer with a specific signature and the Run() method does not quite have the right signature. Code written within the Run() method will be scheduled with the SCHED_FIFO policy. As previously noted, this means it won't be interrupted unless preempted by a higher-priority thread. With this scaffolding (note that LockMemory is also included in the example above), we can start writing an RT application. Since RT applications generally loop at some predictable frequency, we will look at how the loop itself is programmed for RT in the next section.

If you compile and run the full code, you will likely encounter a permission error when the program starts. This is because Linux restricts the creation of RT threads to privileged users only. You'll either need to run this program as root, or edit your user's max rtprio value in /etc/security/limits.conf as per the man page [7].

[1]	See 56:40 of this talk for more details about the problems of `SCHED_RR`.

[2]	`SCHED_FIFO` is a bit more complex than this, but not that much more complex especially for a case where there's only a single RT process. See `the man page for sched(7) for more details.

[3]	Thread, tasks, and processes are synonymous from the perspective of the OS scheduler.

[4]	Nice values are technically related to the RT priority values. However, the actual formula is very confusing. See the kernel source for details.

[5]

In some cases, you need to ensure some IRQ handlers can preempt your RT thread, which means you need to set these IRQ handlers' priority level to be higher than the application. For example, if the RT thread is waiting for network packets in a busy loop with higher priority than the network IRQ handler, it may be blocking the networking handler from receiving the packet being waited on. In other cases, stopping IRQ handlers from working for a long time may even crash the entire system.

[6]	It is also possible to set RT priority via the chrt utility without having to write code, but I find it cleaner to set the RT scheduling policy and priority directly in the code to better convey intent.

[7]	If you create the file `/etc/security/limits.d/20-USERNAME-rtprio.conf` with the content of `USERNAME - rtprio 98`, you may be able to run basic pthread program without using `sudo`. Your mileage may vary, so please consult with the man pages for `limits.conf`.

Looping with predictable frequency

Figure 2: Timeline view of a loop implemented with a) a constant sleep and b) a constant wake-up time

If an RT program must execute some code at 1000 Hz, you can structure the loop in two different ways as shown in Figure 2. This figure shows the timeline view of two idealized loops executing and sleeping, shown with the green boxes and the double-ended arrows respectively. The simplest way to implement this loop would be to sleep for 1 millisecond at the end of every loop iteration, shown in Figure 2a. However, unless the code within the loop executes instantaneously, this approach would not be able to reach 1000 Hz exactly. Further, if the duration of each loop iteration changes, the loop frequency would vary over time. Obviously, this is not an ideal way to structure an RT loop. A better way to structure the loop is to calculate the time the code should wake up next and sleep until then. This is effectively illustrated in Figure 2b with the following sequence of events:

At time = 0, the application starts the first loop iteration.
At time = 0.25ms, the loop iteration code finishes.
Since the application last woke up at t = 0, it calculates the next intended wake-up time to be 0 + 1 = 1ms.
The application instructs the OS to sleep until time = 1ms via the clock_nanosleep function.
At time = 1ms, the OS wakes up the application, which unblocks the clock_nanosleep function, and the loop advances to the next iteration.
This time, loop iteration code takes 0.375ms. The next wake up time is calculated by adding 1ms to the last wake-up time, resulting in a new wake-up time of 1 + 1 = 2ms. The application goes to sleep until then and the loop repeats.

Since this workflow is generic, most of it can be refactored into Thread::Run() as introduced in the previous section. We can leave a Thread::Loop() method that actually contains the application logic as follows (full code is shown here):

 1 // Other includes omitted for brevity
 2 #include <ctime> // For timespec
 3 
 4 class Thread {
 5   // Other variables omitted for brevity
 6 
 7   int64_t period_ns_;
 8   struct timespec next_wakeup_time_;
 9 
10   // Other function definition omitted for brevity
11 
12   void Run() noexcept {
13     clock_gettime(CLOCK_MONOTONIC, &next_wakeup_time_);
14 
15     while (true) {
16       Loop();
17       next_wakeup_time_ = AddTimespecByNs(next_wakeup_time_, period_ns_);
18       clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &next_wakeup_time_, NULL);
19     }
20   }
21 
22   void Loop() noexcept {
23     // RT loop iteration code here.
24   }
25 
26   struct timespec AddTimespecByNs(struct timespec ts, int64_t ns) {
27     ts.tv_nsec += ns;
28 
29     while (ts.tv_nsec >= 1000000000) {
30       ++ts.tv_sec;
31       ts.tv_nsec -= 1000000000;
32     }
33 
34     while (ts.tv_nsec < 0) {
35       --ts.tv_sec;
36       ts.tv_nsec += 1000000000;
37     }
38 
39     return ts;
40   }
41 }

The Run method is relatively simple with only 5 lines of code:

On line 13, the current time is obtained via clock_gettime before the loop starts. It is stored into the instance variable next_wakeup_time_.
On line 15, the loop starts.
On line 16, the Loop() method is called, which should be filled with custom application logic (but is empty for demonstration purposes).
On line 17, the code add period_ns_ to next_wakeup_time_. Although not embedded directly in this post, the full code sets period_ns_ to 1,000,000, or 1 millisecond.
- The addition is performed with a helper method AddTimespecByNs, which performs simple arithmetic on the timespec struct based on its definition.
On line 18, clock_nanosleep is called with the argument TIMER_ABSTIME[8], which instructs Linux to put the process to sleep until the moment specified in next_wakeup_time_. When the process is woken up again, clock_nanosleep returns and the code continues execution at line 15.

It is important to note the usage of CLOCK_MONOTONIC with clock_gettime and clock_nanosleep, which gets the current time and sleeps respectively. These function calls ultimately results in system calls, which are handled by the OS kernel. The CLOCK_MONOTONIC argument instructs the kernel to perform operations based on a "monotonic clock" which increases monotonically with the passage of time and usually has an epoch that coincides with the system boot time. This is not the same as the real clock (CLOCK_REALTIME), which can occasionally decrease its value due to clock adjustments such as the adjustments made for leap seconds. Sleeping until a particular time with the REALTIME clock can be very dangerous, as clock adjustments can cause the sleep interval to change, which may cause deadline misses. Thus, RT code should only use CLOCK_MONOTONIC for measurements of time durations.

Trick to deal with wake-up jitter

In part 1 and part 2 of this series, I discussed and demonstrated how Linux cannot instantaneously wake up your process at the desired time due to hardware + scheduling latency (a.k.a. wake-up latency). On a Raspberry Pi 4, I measured the wake-up latency to be up to 130 microseconds (0.13 ms). This means when clock_nanosleep returns, it could be late by up to 130 microseconds. Although the wakeup latency is close to 0 for the vast majority of the time, RT applications always need to account for the worst case. This was not considered in the previous example. The more realistic situation is shown in Figure 3a, where the gray boxes now denotes the wake-up latency. As shown in the figure, the actual start time of the loop iteration may be delayed by the maximum wake-up latency. This may not be tolerable for RT systems that cannot tolerate high jitter on the wake-up time.

To reduce this jitter, we can employ the method shown in Figure 3b: instead of sleeping until the next millisecond, the code subtracts the wake-up latency from the sleep time. The thread thus wakes up at the beginning of the blue box at the earliest. When the thread wakes up, it busy waits in a loop until the actual desired wake-up time at t = 1ms, before passing control to the Loop method. As long as the width of the blue box exceeds the worst-case wake-up latency, the process should always wake up before the actual desired wake-up time. In my experience, the actual wake-up time was kept within 10 microseconds of the target on a Raspberry Pi 4. That said, although the jitter is kept low, this approach uses significantly more CPU and requires accurate knowledge of the worst-case wake-up latency[9]. It is also somewhat more complex to implement correctly, which means I will not demonstrate the code directly in this post. Interested readers can look at the implementation of cactus_rt::CyclicFifoThread in the cactus-rt repository.

Figure 3: Timeline view of a loop affected by wake-up latency implemented with a) a constant wake-up time and b) premature wake-up and busy wait.

At this point, you basically have everything you need to setup a RT application. However, I do not recommend using the code snippets presented in this post directly, as they are very barebone and do not provide a very nice base to build on. Instead, I recommend you to take a look at my rt library as a part of the cactus-rt repository. In this library, I define cactus_rt::App, cactus_rt::Thread, and cactus_rt::CyclicFifoThread similar to the code introduced here. The library has more features, such as the ability to set CPU affinity, use busy wait to reduce jitter, and track latency statistics[10]. More features may also be added in the future with further development.

[8] The usage of clock_nanosleep is preferred over functions like usleep and std::this_thread::sleep_for as the latter cannot sleep until a particular time. The usage of std::this_thread::sleep_until might be OK if it is implemented via clock_nanosleep to ensure that high-resolution clocks are used. Personally, I prefer just using clock_nanosleep directly as I know that API is safe for RT.

[9]	You also "lose" the CPU time spent in the busy wait permanently, which can be an issue.

[10]	Some of these "advanced" configuration will be briefly discussed in the appendix below.

Passing data with a priority-inheriting mutex

Most RT applications require data to be passed between RT and non-RT threads. A simple example is the logging and display of data generated in RT threads. Since logging and displaying of data is generally not real-time safe, it must be done in a non-RT thread to not block the RT threads. Usually, the data generated by a RT thread is collected by the non-RT thread where it is logged into files and/or the terminal output. Data passing between concurrent threads are subject to data races, which must be avoided to ensure the correctness of the program behavior. As noted in the previous post, there are two ways to safely pass data: (1) with lock-less programming and (2) with a priority-inheriting (PI) mutex. Although lock-less programming is a very appealing option for RT, it is too large of a topic to cover now (I will discuss it in the next post). Instead, the remainder of this post will demonstrate the safe usage of a mutex in RT, as this is likely good enough for RT in most situations.

Much like std::thread, C++ defines the std::mutex, which is a cross-platform implementation of mutexes. Also like std::thread, the standard C++ API does not offer any ways to set the std::mutex to be priority-inheriting. While std::mutex also implements the native_handle() that which returns the underlying pthread_mutex_t struct, the attributes of a pthread mutex cannot be changed after it is initialized. Thus, unlike std::thread, std::mutex is completely unusable for real-time and must be replaced with a different implementation. As a part of my the rt library that is defined in the cactus-rt repository, I have created cactus_rt::mutex, which is a PI mutex (full code is shown here):

 1 #include <pthread.h>
 2 #include <cstring>
 3 #include <stdexcept>
 4 
 5 namespace rt {
 6 class mutex {
 7   pthread_mutex_t m_;
 8 
 9  public:
10   using native_handle_type = pthread_mutex_t*;
11 
12   mutex() {
13     pthread_mutexattr_t attr;
14 
15     int res = pthread_mutexattr_init(&attr);
16     if (res != 0) {
17       throw std::runtime_error{std::strerror(res)};
18     }
19 
20     res = pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
21     if (res != 0) {
22       throw std::runtime_error{std::strerror(res)};
23     }
24 
25     res = pthread_mutex_init(&m_, &attr);
26     if (res != 0) {
27       throw std::runtime_error{std::strerror(res)};
28     }
29   }
30 
31   ~mutex() {
32     pthread_mutex_destroy(&m_);
33   }
34 
35   // Delete the copy constructor and assignment
36   mutex(const mutex&) = delete;
37   mutex& operator=(const mutex&) = delete;
38 
39   void lock() {
40     auto res = pthread_mutex_lock(&m_);
41     if (res != 0) {
42       throw std::runtime_error(std::strerror(res));
43     }
44   }
45 
46   void unlock() noexcept {
47     pthread_mutex_unlock(&m_);
48   }
49 
50   bool try_lock() noexcept {
51     return pthread_mutex_trylock(&m_) == 0;
52   }
53 
54   native_handle_type native_handle() noexcept {
55     return &m_;
56   };
57 };
58 }

Most of this code is boilerplate to wrap the pthread mutex into a class that implements the BasicLockable and Lockable requirements, allowing it to be used by wrappers such as std::scoped_lock. This makes cactus_rt::mutex a drop-in replacement for std::mutex. The only line of interest is line 20, where the priority-inheritance protocol is set for the mutex. A toy example using the cactus_rt::mutex is given below (full code is shown here):

 1 rt::mutex mut;
 2 std::array<int, 3> a;
 3 
 4 void Write(int v) {
 5   std::scoped_lock lock(mut);
 6   a[0] = v;
 7   a[1] = 2 * v;
 8   a[2] = 3 * v;
 9 }
10 
11 int Read() {
12   std::scoped_lock lock(mut);
13   return a[0] + a[1] + a[2];
14 }

This just shows two functions that can read and write to the same array a without data races. As you can see, it is just as easy as std::mutex.

Although cactus_rt::mutex is safe for RT, simply converting normal mutexes into cactus_rt::mutex in the code does not guarantee the code to be safe for RT. This is because the usage of a PI mutex causes the critical sections protected by the mutex on the non-RT thread to be occasionally elevated to run with RT priority, and this code may cause unbounded latency due to things such as dynamic memory allocation and blocking system calls (i.e. everything mentioned in the previous post). Thus, all code protected by the PI mutex must be written in an RT-safe way. This is sometimes not feasible, which means lock-less programming must be employed.

Summary

In this post, I gave a tutorial on how to write an RT application with C++. Specifically, we went over the following steps:

Locking memory with mlockall on the process level at application startup.
Manually creating a pthread using the SCHED_FIFO scheduling policy with a default RT priority of 80 using the custom Thread class.
Setting up an RT loop by calculating the next wake-up time and sleeping with clock_nanosleep.
Safely passing data via a priority-inheriting mutex defined as the class cactus_rt::mutex, which is a drop-in replacement for std::mutex.

Along the way, we discussed:

The importance of using CLOCK_MONOTONIC as CLOCK_REALTIME does not increase monotonically and therefore could be dangerous for time duration calculations.
The usage of busy wait to minimize wake-up jitter.
The fact that PI mutexes cause code that are protected by the mutex on the non-RT thread to occasionally run with RT priority, which means they need to be RT safe and avoid unbounded latency.

All of the examples in this post can be found here. In the next post, I will briefly highlight a few lock-less programming techniques and hopefully conclude this series.

Appendix: advanced configurations

One way to further reduce wake-up latency is to use a Linux feature known as isolcpus. This flag instructs the Linux kernel to not schedule any processes (other than some critical kernel tasks) on certain CPUs. It is then possible to pin the RT thread onto those CPUs via the CPU affinity feature. This can further reduce wakeup latency, as the kernel will rarely have to preempt another thread to schedule and switch to the pinned RT thread. This is implemented in my cactus_rt::Thread implementation in cactus-rt.

In RT, memory allocation is to be avoided. In other words, all memory must be allocated before the start of the RT sections. Two additional things may be considered:

Stack memory (where all the local variables live) have a limited size on Linux. By default, this is 2MB. Since variables are pushed onto the stack as the application code executes, stack overflow can occur during execution if the stack variables became too large. This usually results in the process getting killed by the kernel, which is obviously undesirable. Since each thread has its own private stack, you may need to increase the stack size during thread creation via pthread_attr_setstacksize. This is also implemented in cactus_rt::Thread.
If an O(1) memory allocator implementation is used (i.e. malloc takes constant time excluding the time needed for page faults), it may be OK to dynamically allocate memory during the RT sections if the memory allocator already reserved the memory from the OS. However, reserved memory may be returned to the OS once free'd, which may result in page faults when new malloc calls are made as the total amount of reserved memory is reduced. If an O(1) memory allocator is used, you should consider reserving a large pool of memory at program startup, and disable the ability for the memory allocator to give back memory to the OS. This is currently partially implemented by cactus_rt::App in cactus-rt.