I/O Multiplexing in Linux

Ri Xu

8 years ago

I will talk about synchronous I/O and asynchronous I/O, blocking I/O and non-blocking I/O of network I/O in this article.

We need to understand a few concepts first:

User and Kernel Mode

In any modern operating system, the CPU is actually spending time in two very distinct modes:

Kernel Mode

In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC.

User Mode

In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode.

Process Switching

A process switch is a operating system scheduler change from one running program to another. This requires saving all of the state of the currently executing program, including its register state, associated kernel state, and all of its virtual memory configuration.

A process switch will through the following changes:

Save the processor context, including program counters and other registers
Update the process control block (PCB) informatrion
Move the process of the PCB into the appropriate queue, such as ready queue, event block queue and other queues
Select another process to execute and update its PCB
Update the data structure in the memory
Restore the PCB context

Blocking Processes

A blocking process is usually waiting for an event such as a semaphore being released or a message arriving in its message queue. In multitasking systems, such processes are expected to notify the scheduler with a system call that it is to wait, so that they can be removed from the active scheduling queue until the event occurs. A process that continues to run while waiting (i.e., continuously polling for the event in a tight loop) is said to be busy-waiting, which is undesirable as it wastes clock cycles which could be used for other processes. When the process into the blocked state, is not occupied by CPU resources.

Buffered I/O

Buffered output streams will accumulate write results into an intermediate buffer, sending it to the OS file system only when enough data has accumulated (or flush() is requested). This reduces the number of file system calls. Since file system calls can be expensive on most platforms (compared to short memcpy), buffered output is a net win when performing a large number of small writes. Unbuffered output is generally better when you already have large buffers to send -- copying to an intermediate buffer will not reduce the number of OS calls further, and introduces additional work. Data in the transmission process needs the data copy operation, and the operation operation brings CPU memory overhead is very high.

File Descriptor (FD)

In Unix and related computer operating systems, a file descriptor (FD, less frequently fildes) is an abstract indicator (handle) used to access a file or other input/output resource, such as a pipe or network socket. File descriptors form part of the POSIX application programming interface. It is a non-negative index value, many of the underlying program often based on it.

When a read operation occurs, the data goes through two phases:

Waiting for the data to be ready
Copying the data from the kernel to the process

Because these two phases, Linux system produced the following 5 network model:

Blocking I/O

The most prevalent model for I/O is the blocking I/O model (which we have used for all our examples in the previous sections). By default, all sockets are blocking.

Nonblocking I/O

When a socket is set to be nonblocking, we are telling the kernel "when an I/O operation that I request cannot be completed without putting the process to sleep, do not put the process to sleep, but return an error instead".

I/O Multiplexing

With I/O multiplexing, we call select or poll and block in one of these two system calls, instead of blocking in the actual I/O system call.

Signal Driven I/O (uncommonly used)

The kernel raises a signal (SIGIO) when something happens to a file descriptor, first establish a signal handler for SIGIO, then set the socket owner, enable signal driven I/O for the socket at last

Asynchronous I/O

Asynchronous I/O is defined by the POSIX specification, and various differences in the real-time functions that appeared in the various standards which came together to form the current POSIX specification have been reconciled.

select, poll and epoll

select, poll, epoll are I/O multiplexing module.

select

#include <sys/select.h>
#include <sys/time.h>

int select (int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
/* Returns: positive count of ready descriptors, 0 on timeout, –1 on error */

select() overwrites the fd_set variables whose pointers are passed in as arguments readfds, writefds and exceptfds, telling it what to wait for. This makes a typical loop having to either have a backup copy of the variables, or even worse, do the loop to populate the bitmasks every time select() is to be called. Since select() uses bitmasks for file descriptor info with fixed size bitmasks (1024) it is much less convenient.

poll

#include <sys/poll.h>

int poll (struct pollfd *fdarray, unsigned long nfds, int timeout);

/* Returns: count of ready descriptors, 0 on timeout, –1 on error */

poll() provides functionality that is similar to select(), but use a pollfd pointer to instead 2 bitmap in the select().

struct pollfd {
    int fd; /* file descriptor */
    short events; /* requested events to watch */
    short revents; /* returned events witnessed */
};

poll() handles many file handles, like more than 1024 by default and without any particular work-arounds. poll() doesn't destroy the input data, so the same input array can be used over and over.

epoll

#include <sys/epoll.h>

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

epoll() is the latest polling method in Linux (and only Linux). Well, it was actually added to kernel in 2002, so it is not so new. It differs both from poll and select in such a way that it keeps the information about the currently monitored descriptors and associated events inside the kernel, and exports the API to add/remove/modify those.

To use epoll, much more preparation is needed. A developer needs to:

Create the epoll descriptor by calling epoll_create
Initialize the struct epoll structure with the wanted events and the context data pointer. Context could be anything, epoll passes this value directly to the returned events structure. We store there a pointer to our Connection class.
Call epoll_ctl(... EPOLL_CTL_ADD) to add the descriptor into the monitoring set
Call epoll_wait() to wait for 20 events for which we reserve the storage space. Unlike previous

methods, this call receives an empty structure, and fills it up only with the triggered events. For example, if there are 200 descriptors and 5 of them have events pending, the epoll_wait will return 5, and only the first five members of the pevents structure will be initialized. If 50 descriptors have events pending, the first 20 would be copied and 30 would be left in queue, they won't get lost.

Iterate through the returned items. This will be a short iteration since the only events returned are those which are triggered.

Struct of epoll_event like this:

struct epoll_event {
  __uint32_t events;  /* Epoll events */
  epoll_data_t data;  /* User data variable */
};

epoll provides both edge-triggered and level-triggered modes. In edge-triggered mode, a call to epoll_wait will return only when a new event is enqueued with the epoll object, while in level-triggered mode, epoll_wait will return as long as the condition holds.

Polling with libevent

The libevent API provides a mechanism to execute a callback function when a specific event occurs on a file descriptor or after a timeout has been reached. Furthermore, libevent also support callbacks due to signals or regular timeouts.

Reference

Linux Programmer's Manual