Ri Xu Online

I/O Multiplexing in Linux

I will talk about synchronous I/O and asynchronous I/O, blocking I/O and non-blocking I/O of network I/O in this article.

We need to understand a few concepts first:

User and Kernel Mode

In any modern operating system, the CPU is actually spending time in two very distinct modes:

In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC.

In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode.

Process Switching

A process switch is a operating system scheduler change from one running program to another. This requires saving all of the state of the currently executing program, including its register state, associated kernel state, and all of its virtual memory configuration.

A process switch will through the following changes:

Blocking Processes

A blocking process is usually waiting for an event such as a semaphore being released or a message arriving in its message queue. In multitasking systems, such processes are expected to notify the scheduler with a system call that it is to wait, so that they can be removed from the active scheduling queue until the event occurs. A process that continues to run while waiting (i.e., continuously polling for the event in a tight loop) is said to be busy-waiting, which is undesirable as it wastes clock cycles which could be used for other processes. When the process into the blocked state, is not occupied by CPU resources.

Buffered I/O

Buffered output streams will accumulate write results into an intermediate buffer, sending it to the OS file system only when enough data has accumulated (or flush() is requested). This reduces the number of file system calls. Since file system calls can be expensive on most platforms (compared to short memcpy), buffered output is a net win when performing a large number of small writes. Unbuffered output is generally better when you already have large buffers to send -- copying to an intermediate buffer will not reduce the number of OS calls further, and introduces additional work. Data in the transmission process needs the data copy operation, and the operation operation brings CPU memory overhead is very high.

File Descriptor (FD)

In Unix and related computer operating systems, a file descriptor (FD, less frequently fildes) is an abstract indicator (handle) used to access a file or other input/output resource, such as a pipe or network socket. File descriptors form part of the POSIX application programming interface. It is a non-negative index value, many of the underlying program often based on it.

When a read operation occurs, the data goes through two phases:

Because these two phases, Linux system produced the following 5 network model:

The most prevalent model for I/O is the blocking I/O model (which we have used for all our examples in the previous sections). By default, all sockets are blocking.

When a socket is set to be nonblocking, we are telling the kernel "when an I/O operation that I request cannot be completed without putting the process to sleep, do not put the process to sleep, but return an error instead".

With I/O multiplexing, we call select or poll and block in one of these two system calls, instead of blocking in the actual I/O system call.

The kernel raises a signal (SIGIO) when something happens to a file descriptor, first establish a signal handler for SIGIO, then set the socket owner, enable signal driven I/O for the socket at last

Asynchronous I/O is defined by the POSIX specification, and various differences in the real-time functions that appeared in the various standards which came together to form the current POSIX specification have been reconciled.

select, poll and epoll

select, poll, epoll are I/O multiplexing module.

select

#include <sys/select.h>
#include <sys/time.h>

int select (int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
/* Returns: positive count of ready descriptors, 0 on timeout, –1 on error */

select() overwrites the fd_set variables whose pointers are passed in as arguments readfds, writefds and exceptfds, telling it what to wait for. This makes a typical loop having to either have a backup copy of the variables, or even worse, do the loop to populate the bitmasks every time select() is to be called. Since select() uses bitmasks for file descriptor info with fixed size bitmasks (1024) it is much less convenient.

poll

#include <sys/poll.h>

int poll (struct pollfd *fdarray, unsigned long nfds, int timeout);

/* Returns: count of ready descriptors, 0 on timeout, –1 on error */

poll() provides functionality that is similar to select(), but use a pollfd pointer to instead 2 bitmap in the select().

struct pollfd {
    int fd; /* file descriptor */
    short events; /* requested events to watch */
    short revents; /* returned events witnessed */
};

poll() handles many file handles, like more than 1024 by default and without any particular work-arounds. poll() doesn't destroy the input data, so the same input array can be used over and over.

epoll

#include <sys/epoll.h>

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

epoll() is the latest polling method in Linux (and only Linux). Well, it was actually added to kernel in 2002, so it is not so new. It differs both from poll and select in such a way that it keeps the information about the currently monitored descriptors and associated events inside the kernel, and exports the API to add/remove/modify those.

To use epoll, much more preparation is needed. A developer needs to:

Struct of epoll_event like this:

struct epoll_event {
  __uint32_t events;  /* Epoll events */
  epoll_data_t data;  /* User data variable */
};

epoll provides both edge-triggered and level-triggered modes. In edge-triggered mode, a call to epoll_wait will return only when a new event is enqueued with the epoll object, while in level-triggered mode, epoll_wait will return as long as the condition holds.

Polling with libevent

The libevent API provides a mechanism to execute a callback function when a specific event occurs on a file descriptor or after a timeout has been reached. Furthermore, libevent also support callbacks due to signals or regular timeouts.

Reference

Linux Programmer's Manual

Exit mobile version