Embedded 版 (精华区)
发信人: Zinux (Linux技工), 信区: Embedded_system
标 题: Making System Calls from Kernel Space
发信站: 哈工大紫丁香 (2001年10月26日18:20:48 星期五), 站内信件
Linux Magazine (http://www.linux-mag.com) November 2000
Copyright Linux Magazine ?2000
GEARHEADS
Making System Calls from Kernel Space
by Alessandro Rubini
Figure One: The steps involved in performing a call to read() from a
user space function.
One of the most renowned features of Unix is the clear distinction
between what occurs in "kernel space" and what occurs in "user space."
This column will describe how to invoke kernel system calls from
within kernel code. This is a first step towards understanding how to
build a kernel-resident application, such as a high-performance Web
build a kernel-resident application, such as a high-performance Web
server.
System Calls: the Facts
System calls have always been the means through which user space
programs can access kernel services. The Linux kernel implementation
is able to break the distinction between kernel space and user space
by allowing kernel code to invoke system calls as well. This allows
the kernel to perform tasks that have traditionally been reserved for
user applications, while retaining the same programming model.
The benefit of this approach is performance; the overhead of
scheduling a user application, and for that application to invoke
system calls back into the kernel, makes it undesirable for some
services to be performed in user space. For example, a
high-performance Web server may wish to reside in the kernel for
increased throughput and lower latency. However, there is also a
safety tradeoff; implementing complex services in the kernel can lead
to system crashes if those services are not extremely robust. For the
sake of maintaining, debugging, and porting code, what has always been
performed in user space should not be converted to run in kernel
space, unless that is absolutely necessary to meet performance or size
space, unless that is absolutely necessary to meet performance or size
requirements.
To keep the discussion simple, throughout the article I'll refer to
the PC platform and to x86 processor features, disregarding for a
while any cross-platform issues. At the time of this writing, the
official kernel is version 2.4.0-test8, and that is what I refer to in
both the discussion and the code. Sample code is available as
ksyscall.tar.gz from ftp. linux.it/pub/People/rubini.
System Calls: the Mechanisms
To understand the speed benefits achieved by invoking system calls
from kernel space, we should first analyze the steps performed by a
normal system call, such as read. The function's role is copying data
from a source to buffers held in the application.
Figure One shows the steps involved in performing a call to read from
a user space function. You can verify the exact steps by running
objdump on compiled code for the user-space part and browsing kernel
source files for the kernel-space part.
A system call is implemented by a "software interrupt" that transfers
A system call is implemented by a "software interrupt" that transfers
control to kernel code; in Linux, for the x86, this is software
interrupt (also called a "gate") 0x80. The code for the specific
system call being invoked is stored in the EAX register, and its
arguments are held in other processor registers. In our example, the
code associated with read is __NR_read, which is defined in
<asm/unistd.h>.
After the switch to kernel mode, the processor must save all of its
registers and dispatch execution to the proper kernel function, first
checking whether EAX is out of range. The system call we are looking
at is implemented in the sys_read function, and it must dispatch
execution to a file object. The file object itself must first be
looked up based on the file descriptor that the user application
passed to the system call. The read method for the file object finally
performs the data transfer, and all the previous steps are unwound up
to the calling user function.
Each arrow in the figure represents a jump in CPU instruction flow,
and each jump may require flushing the prefetch queue and possibly a
"cache miss" event. Transitions between user and kernel space are
especially important, since they take the most processing time and
prefetch behavior.
prefetch behavior.
Timing Execution
To add real-world figures to the theoretical discussion, let's look at
the overhead of an empty read system call -- that is, a call which
transfers no data. We'll invoke it on the stdin file descriptor, 0,
because stdin is always opened for reading. Moreover, it can be easily
redirected to check for differences according to the type of file
being read.
In order to measure overhead, we can use the Pentium timestamp
counter. This is a 64-bit register, which is incremented at each
processor clock tick and provides a very high-resolution timer. To
read the counter, the rdtsc assembly instruction is used. The header
file <asm/msr.h> includes the readtsc(low,high) macro, which reads the
value of the counter into two 32-bit words provided by the caller. The
readtscl(low) macro only retrieves the lower 32 bits of the counter,
sufficient for our purposes.
The code sample that follows, which is part of the usystime.c sample
file, can be used to measure the number of clock ticks the processor
takes to execute a read call.
The code tries several times, and only the best figure is considered,
because process execution can be interrupted or delayed because of
processor scheduling, extra cache misses, or other unexpected events.
int main()
{
unsigned long ini, end, now, best, tsc;
int i;
char buffer[4];
#define measure_time(code) \
for (i = 0; i < NTRIALS; i++) { \
rdtscl(ini); \
code; \
rdtscl(end); \
now = end - ini; \
if (now < best) best = now; \
if (now < best) best = now; \
}
/* time rdtsc (i.e. no code) */
best = ~0;
measure_time( 0 );
tsc = best;
/* time an empty read() */
best = ~0;
measure_time( read(STDIN_FILENO,
buffer, 0) );
/* report data */
printf("rdtsc: %li ticks\nread():%liticks\n",
tsc, best-tsc);
return 0;
}
Running the code on my 500 MHz box reports a count of 11 ticks for the
rdtsc instruction and 474 ticks for the empty system call; this
corresponds to about 0.95 microseconds. The same code executed on a
different processor takes 578 ticks (and 32 for reading the
different processor takes 578 ticks (and 32 for reading the
timestamp).
Listing One shows the assembly code generated by the compiler for the
code shown above. This corresponds to the activities shown in the
first column of Figure One, with the exception of the pop arguments
operation, which the compiler moved after the last rdtsc instruction.
Listing One: Assembly Code Generated By the Compiler
This is the pair of consecutive rdtsc after compilation:
8048150: 0f 31 rdtsc
8048152: 89 c3 movl %eax,%ebx ; ini
8048154: 0f 31 rdtsc
8048156: 89 c1 movl %eax,%ecx ; end
And this is the system call wrapped by two rdtsc:
804817c: 0f 31 rdtsc
804817e: 89 c3 movl %eax,%ebx ; ini
8048180: 6a 00 pushl $0x0 ; arg 3 = 0
8048182: 8b 45 f4 movl 0xfffffff4(%ebp),%eax
8048185: 50 pushl %eax ; arg 2 = buffer
8048186: 6a 00 pushl $0x0 ; arg 1 = 0
8048188: e8 23 49 00 00 call 804cab0 <__libc_read>
804818d: 0f 31 rdtsc
804818f: 89 c1 movl %eax,%ecx ; end
Doing it in Kernel Space
Now let's consider issuing the same read system call from kernel
space. The easiest way to accomplish the task is by exploiting the
definitions of read and several other system calls that <asm/unistd.h>
exports if KERNEL_SYSCALLS is defined. The sample code below declares
the macro before including any header.
the macro before including any header.
Before calling the system call, however, a preparation step must be
performed. Like any other function that transfers data to/from user
space using a user-provided pointer, the system call checks whether or
not the provided buffer is a valid address. During normal operation,
an address that lies in the user address range (0? GB for standard
kernel configuration) is considered valid, and an address that lies in
kernel address space (3? GB) is not. If the system call is invoked
from kernel space, however, we must prevent the usual check from
failing, because the virtual address of our destination buffer will be
in kernel space, above the 3 GB mark.
The field addr_limit in the task_struct structure is used to define
the highest virtual address that is to be considered valid; the macros
get_fs and set_fs can be used to read and write the value. The limit
that must be used when invoking system calls from kernel space (in
practice, the "no limit" case) is returned by the get_ds macro. See
the box in this page for an explanation of the names and meanings of
the macro calls.
So, kernel-to-kernel system calls must be wrapped by the following
code:
code:
mm_segment_t fs;
fs = get_fs(); /* save previous value */
set_fs (get_ds()); /* use kernel limit */
/* system calls can be invoked */
set_fs(fs); /* restore before returning to
user space */]
There's no need to wrap each individual system call, so several calls
can be performed between set_fs()...set_fs() pairs. It's important,
however, that the original fs is restored before returning to user
space. Otherwise, the user program that executed this code will retain
permission to overwrite kernel memory by passing bogus pointers to
further read (or ioctl) system calls.
Once equipped with these "grossly misnamed" tools, we can measure the
performance of a system call invoked from kernel space. The code shown
below is part of the ksystime.c source; it can be compiled into a
below is part of the ksystime.c source; it can be compiled into a
module that executes the code in kernel space (in init_module) and
then exits. Since the initialization of the module returns a failure
indication, you can reload the module to run the measurement again
without the need to unload it in advance.
/* time rdtsc (i.e. no code) */
best = ~0;
measure_time( 0 );
tsc = best;
ksys_print("tsc", tsc);
/* prepare to invoke a system call */
fs = get_fs();
set_fs (get_ds());
/* time an empty read() */
best = ~0;
measure_time( read(0 /* stdin */, buffer, 0) );
ksys_print("read()", best - tsc);
/* restore fs and make insmod fail */
/* restore fs and make insmod fail */
set_fs (fs);
return -EINVAL;
The code executed in kernel space reports 11 ticks for rdtsc (the same
reported in user space, as expected) and 424 ticks for the empty
system call -- a savings of 50 ticks.
Going Further
You may object that the reduced overhead of making system calls from
kernel space -- just 10 percent -- is not large enough to warrant such
an approach.
Actually, a quick look at the definition of the macro (in the header),
or at disassembled object code, shows that the implementation of read
as defined in <asm/unistd.h> still calls interrupt 0x80. The kernel
implementation of the system call is not optimized for speed, and is
only there for the convenience of a few kernel needs.
It's interesting how code for some Linux platforms invoke kernel
system calls by jumping to the sys_read (or equivalent) function
directly, thus skipping the overhead shown in the third column of
Figure One. This is not currently possible with the x86 platform
unless you do nasty hacks; with those hacks in place (shown and
explained in the code but not worth showing here) the call takes 216
ticks (54 percent less than the user-space case).
But if you are really interested in getting the best performance out
of your kernel system calls, the thing to do is directly invoke the
read file method after retrieving a pointer to the file structure
represented by the file descriptor (0 for stdin). This approach to
kernel-to-kernel system calls is the fastest possible: the call will
only incur the overhead associated with the last column of Figure One
(i.e. only the actual data transfer operation).
Listing Two shows the code that implements this technique in the
sample module ksystime.c. The set_fs and associated calls are not
shown, as they are the same as above.
Listing Two: Invoking read()
/* use the file operation directly */
file = fget(0 /* fd */);
if (file && file->f_op && file->f_op->read) {
best = ~0;
measure_time(
file->f_op->read(file, buffer, 0, &file->f_pos)
);
ksys_print("f_op->read()", best - tsc);
}
if (file) fput(file);
The execution time of this code is reported as 175 clock ticks -- 63
percent (or 0.6 microseconds) less than the user space case. You may
even try to cache the two pointers being used in the call (f_op->read
and &file->f_pos); this is reported in the sample code as well.
Unfortunately, it makes no real difference and, in some cases, it can
even make execution slower because of the inappropriately small size
of the PC register set.
This is how the output of the module looks like on my system (the
This is how the output of the module looks like on my system (the
output is found in /var/log/kern.log or equivalent):
kernel: ksystime: 11 -- tsc
kernel: ksystime: 424 -- read()
kernel: ksystime: 216 -- sys_read()
kernel: ksystime: 175 -- f_op->read()
kernel: ksystime: 173 -- cached_f_op_read()
So What?
Until now, we have collected a few figures and have found that making
system calls from kernel space can significantly reduce the overhead
of the system call mechanism. It's time to step back for a while and
ponder the figures we collected.
How could we still reduce the 175 clock ticks of overhead associated
How could we still reduce the 175 clock ticks of overhead associated
to the read system call?
The answer is in looking at the read file operation we are using; the
insmod program, whose standard input is being used, is connected to a
tty (specifically, a pseudo tty controlled by xterm in this case). If
the standard input of the test program is connected to a different
kind of file, we get completely different figures. Reading a disk
file, for example, is much faster (but it still depends on the
underlying filesystem), and reading /dev/null has almost no overhead
(seven clock ticks, but the read method of the file just returns
end-of-file). The numbers collected will also vary across processor
vendor and stepping, thus making all benchmarks almost pointless -- as
usual.
Table One (pg. 90) shows the times I collected on my PC to give an
idea of the great difference in the various read file operations. It
shows that my CPU has an overhead of 50 ticks (0.1usec) in crossing
the user/ kernel threshold twice; it also spends 210 ticks (0.4usec)
in processing generic system call entry/ exit and 40-75 ticks in
sys_read.
Table One: Clock Ticks for Empty read() Invoked on Different Files
Table One: Clock Ticks for Empty read() Invoked on Different Files
file type /proc/ net-pty local-pty /dev/hda /proc/sys nfs ext2fs
socket /dev/zero /dev/null
user space 570 507 474 411 402 353 329 320 324 313
kernel space 519 460 424 361 351 303 278 270 273 262
sys_read 307 244 216 150 140 91 67 59 63 52
file->f_op 263 170 175 105 98 49 23 15 21 8
Since actual data transfer takes two or three clock ticks per byte
(measured by copying a 64-byte buffer in the read calls), the overhead
that can be avoided by using kernel system calls is worth a data
transfer of 100-150 bytes. This is a non-trivial figure if performance
is your main concern and you transfer small data packets. On the other
hand, it may not be worth the effort for most applications.
While kernel-space system calls are an interesting tool, and playing
with them can teach you a lot about kernel internals, I still think
that their use should be as limited as possible.
For example, a device driver shouldn't read a configuration file using
kernel-space system calls; reading a file involves error management
kernel-space system calls; reading a file involves error management
and parsing of file contents -- not something suited for kernel code.
The best way to feed data to device drivers is through ioctl via a
user-space application.
After reading this column, you now know how to make system calls from
kernel space. Next month, we'll show you how to use kernel system
calls to build a kernel-resident Web server.
Why get_fs() is Called get_fs()
Once upon a time, when Linus was playing with his new 386 PC and Linux
wasn't even there, Linus said "Intel gave us the segments, let's use
the segments." And he used the segments.
A segment register, in 386 protected mode, acts mainly as an index
into a table of virtual-address descriptors, called the descriptor
table. Every memory access uses one of these registers (CS, DS, ES, or
FS) as its virtual-address space descriptor. CS is the code segment
and is the default descriptor for fetching instructions from memory.
DS is the data segment and is the default for most data-access
instructions. ES and FS are extra segments, which can be used by the
application or operating system in creative ways.
application or operating system in creative ways.
The first implementation of the Linux kernel-space memory map used
virtual addresses that mapped one-to-one to physical addresses. The
user-space memory map, on the other hand, was dictated by the binary
formats in use for executable files, and all of them use low virtual
addresses for executable and data pages. Therefore, executing system
calls required switching to a completely different memory map than the
one of user space. This was accomplished by using different
descriptors for the memory map associated to the code and data segment
in charge in user-space and kernel-space. Since several system calls
need to access the user address space, the FS register was reserved to
hold the user memory map while in kernel space.
This explains the name of the macros:
get_fs returns the current segment descriptor stored in FS.
get_ds returns the segment descriptor associated to kernel space,
currently stored in DS.
set_fs stores a descriptor into FS , so it will be used for data
transfer instructions.
This layout of virtual memory and segment descriptors remained in use
through version 2.0 of the kernel. The first great innovation brought
in by version 2.1 was the switch to a different approach, which was
consistent to what other platforms were already doing. The user and
the kernel descriptors now share the lower 3 GB of the virtual address
space, resulting in faster access to user space from the kernel.
The FS segment register has been put to rest and user memory is now
accessed by the DS register, just like kernel memory. FS only survives
in the names of a few kernel macros (including get_fs and set_fs).
These macros still perform the same function, but the FS segment
register is no longer involved.
--
puke!
技工而已
※ 来源:·哈工大紫丁香 bbs.hit.edu.cn·[FROM: 202.118.239.152]
Powered by KBS BBS 2.0 (http://dev.kcn.cn)
页面执行时间:208.809毫秒