The kernel is the part of the system that handles the hardware, allocates resources like memory pages and CPU cycles, and usually is responsible for the file system and network communication.
Linux kernels have a peculiar numbering system. After 0.01 and 0.02 that were very preliminary,
From: torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds) Newsgroups: comp.os.minix Subject: Free minix-like kernel sources for 386-AT Date: 5 Oct 91 05:41:06 GMT Do you pine for the nice days of minix-1.1, when men were men and wrote their own device drivers? Are you without a nice project and just dying to cut your teeth on a OS you can try to modify for your needs? Are you finding it frustrating when everything works on minix? No more all- nighters to get a nifty program working? Then this post might be just for you :-) As I mentioned a month(?) ago, I'm working on a free version of a minix-lookalike for AT-386 computers. It has finally reached the stage where it's even usable (though may not be depending on what you want), and I am willing to put out the sources for wider distribution. It is just version 0.02 (+1 (very small) patch already), but I've successfully run bash/gcc/gnu-make/gnu-sed/compress etc under it. Sources for this pet project of mine can be found at nic.funet.fi (128.214.6.100) in the directory /pub/OS/Linux. ...
we got 0.10, 0.11, 0.12, already usable, next 0.95, almost there. But from 0.95 to 1.0 was a long way, and after 0.99 we got 0.99pl1, ..., 0.99pl15, where the later patch levels were again subdivided 0.99pl14a, ..., 0.99pl14z, 0.99pl15a, ..., 0.99pl15j. After the stable 1.0 the development series was called 1.1, and the next stable kernel was 1.2. At this point a rule was defined:
There are development kernels, that quite possibly may eat your disks. Never boot them without making sure that you have a good backup. These are the versions with an odd number in the middle, like 2.5.37. And there are "stable" versions, with an even number in the middle, like 2.0.39, or 2.2.20, or 2.4.19. No version is bugfree, so also the stable versions change slowly (e.g. from 2.4.0 to 2.4.19) but a more or less recent stable version should work fine on ordinary hardware. Vendors like SuSE and RedHat often have private kernel modifications, so system behaviour may change when you switch to an "official" kernel.
Since 2.6 the situation is a bit different: stable and development series have merged now. Patches are first tried out in the -mm series, release candidates are suffixed -rcN for some N, bug fixes after a stable release get a fourth digit (like 2.6.8.1 after 2.6.8).
The command
will tell you the version numbers of various kernel versions.finger linux@kernel.org
Make sure you have half a GB free somewhere.
Get the kernel by anonymous ftp from ftp://ftp.nl.kernel.org
(with nl
replaced by something else if this mirror doesn't have
the most recent stuff yet; de
and fi
do more frequent
updates).
For the most recent kernel versions, the appropriate directory
is /pub/linux/kernel/v2.6
. There are full distributions,
like linux-2.6.8.1.tar.gz
(44 MB), or patches
relative to the previous version, like patch-2.6.8.1.gz
.
Release candidates live in the testing
subdirectory, and have
names like patch-2.6.9-rc1.bz2
.
Very recent snapshots (as patch relative to the current kernel version)
live in the snapshots
subdirectory, and have names like
patch-2.6.9-rc1-bk9.bz2
.
Finally, the individual changesets still live in a v2.5
directory,
see
kernel/v2.5/testing/cset/.
filename | unpack command |
xxx.gz | gunzip xxx.gz |
xxx.bz2 | bzip2 -d xxx.bz2 |
xxx.tar.gz | tar xfz xxx.tar.gz |
xxx.tar.bz2 | tar xfj xxx.tar.bz2 |
will give you a kernel source tree (212 MB).% ncftp ftp.nl.kernel.org ncftp / > cd /pub/linux/kernel/v2.6 ncftp /pub/linux/kernel/v2.6 > get linux-2.6.0-test11.tar.bz2 ncftp /pub/linux/kernel/v2.6 > quit % tar xvjf linux-2.6.0-test11.tar.bz2
There are various other access methods, by http, rsync, git etc, see kernel.org.
Distribution vendors have their own kernel patches. On kernelnewbies one can browse recent vendor patches.
See lxr.linux.no.
In the beginning, the Linux source management tools were email, diff and patch. Since Feb 2002, Linus used Larry McVoy's bitkeeper distributed source management system. The fact that this is a commercial system, and not open source, was a source of a lot of controversy. In April 2005 Linus announced that he dropped bitkeeper. Within a few weeks he developed a new source management system, git.
Marcelo's old 2.4 tree is still visible in bitkeeper format at linux.bkbits.net.
Linus' current tree is visible as linux-2.6.git. The actual files.
Some notes on the passive use of git: First repository download:
This gets the kernel source tree, today 536 MB. Of this, the git metadata takes 218 MB, almost all of which belongs to% git-clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6
.git/objects/pack/pack-f91...99fc.idx
(15 MB) and
.git/objects/pack/pack-f91...99fc.pack
(200 MB).
What did we get? Compare with the result of getting linux-2.6.23.tar.bz2, upgrading to 2.6.24-rc3 using testing/patch-2.6.24-rc3.bz2, upgrading to 2.6.24-rc3-git6 using snapshots/patch-2.6.24-rc3-git6. Now
So this git clone produced roughly the same result as getting the tar file, except that the additional subdirectory .git contains the entire metadata and history since git was started.% diff -r tar/linux-2.6* git/linux-2.6 Only in git/linux-2.6: .git diff -r tar/linux-2.6*/Makefile git/linux-2.6/Makefile 4c4 < EXTRAVERSION = -rc3-git6 --- > EXTRAVERSION = -rc3 %
We can look at the history since 2005-04-16 (Linux-2.6.12-rc2) by
and see that there were 74426 commits in less than 1000 days, about 77 per day on average, with a maximum of 914 commits on Tue Oct 16 2007. A graphical version of the history is given by gitk.% git log | less
More history:
yields a repository with the history from 2002-02-05 to 2005-04-04, all data from the bitkeeper time.% git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/old-2.6-bkcvs.git old-2.6-bkcvs
One can browse a lot of git repositories at git.kernel.org.
Upgrading to current is done by
% cd git/linux-2.6 % git pull git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
Looking at the changes for a file, say fs/autofs/inode.c
is done by
(for the commit messages), or% git-whatchanged fs/autofs/inode.c
(for the actual changes).% git-whatchanged -p fs/autofs/inode.c
Do a cd
into the top source directory
(with arch
, Documentation
, etc.),
then do something like
make oldconfig make depend make bzImage
There is a number of configuration targets, like
xconfig
, menuconfig
, oldconfig
.
Pick one and specify what kind of a kernel you want.
(What hardware must be supported, what filesystem types, etc.)
If you are greedy and make a big kernel, you cannot boot from floppy.
If you are very greedy, you may not be able to boot at all,
so asking for "everything" may be counterproductive.
The target oldconfig
means "the same as last time",
and uses the file .config
that you hopefully still have.
When starting from scratch it may take a number of attempts
before you have a kernel that boots and supports your hardware.
Recent kernels may require recent compiler or binutils to compile,
and recent utilities to use. See Documentation/Changes
.
Probably you use some boot loader, like lilo
or grub
.
Or perhaps you want to boot from a floppy.
After compilation you have a file arch/i386/boot/bzImage
(assuming that was on a PC). For a bootfloppy, dd
this file to
an empty diskette. Otherwise, copy the file to /boot
,
add it to /etc/lilo.conf
or /boot/grub/grub.conf
or so, run lilo
and lilo -R
if you are a lilo user,
and reboot into your new kernel.
Grub allows you a menu with possible kernels to boot. That is nice.
Lilo has a much better feature: lilo -R
allows you to
set the kernel to boot into for the next time only.
So, for kernel development lilo is easier than grub: make a new
kernel, try a boot, probably something will fail, and the next reboot
is into the good old solid kernel again.
On the other hand, lilo has a disadvantage: you must
rerun lilo
after installing a new kernel, and very obscure
things will happen if you forget.
Never delete your old kernel. Maybe the new one doesn't work. (And if you use Appletalk only once a month, and Appletalk doesn't work in the new kernel, it'll take a month before you notice.)
What if the kernel doesn't boot? There are many possible explanations.
If the boot crashes very early, say, after
Uncompressing Linux... Ok, booting the kernel.then check that it was compiled for the right hardware. If the processor type was chosen correctly, check whether things go better with fewer options selected. A too big kernel will crash.
If the crash comes later, then hopefully the boot messages give some indication of what point in the boot process was reached when things went wrong.
A common mistake is to want to have as many modules as possible. If the driver needed to access disk (or partition, or filesystem) for the root filesystem is a module, then it must be loaded before it can be loaded from disk, and that is impossible. The typical reaction is the panic
Kernel panic: VFS: Unable to mount root from 08:07where the hex numbers indicate the device.
Example of a grub.conf
file:
# /boot/grub/grub.conf # default=0 timeout=10 splashimage=(hd0,1)/boot/grub/splash.xpm.gz title 2.4.18-pre7-ac3a-unclip-scsi root (hd0,1) kernel /boot/bzImage-2.4.18-pre7-ac3a-unclip-scsi ro root=/dev/hda2 title 2.4.20pre4bs root (hd0,1) kernel /boot/bzImage-2.4.20pre4bs ro root=/dev/hda2 title Red Hat Linux (2.4.7-10) root (hd0,1) kernel /boot/vmlinuz-2.4.7-10 ro root=/dev/hda2 initrd /boot/initrd-2.4.7-10.img title WINNT rootnoverify (hd1,1) chainloader +1 title DOS rootnoverify (hd0,0) chainloader +1
Example of a lilo.conf
file:
# /etc/lilo.conf # boot=/dev/hda map=/boot/map install=/boot/boot.b prompt timeout=50 image=/boot/vmlinuz-2.0.34-0.6 label=linux root=/dev/hda5 read-only # 2.2.1 plus disk output image=/boot/bzImage-test label=test root=/dev/hda5 read-only other=/dev/hda1 label=w95 table=/dev/hda
and another one:
# /etc/lilo.conf # boot = /dev/hda # disk = /dev/hda bios = 0x80 disk = /dev/hdb bios = 0x81 disk = /dev/hde bios = 0x82 disk = /dev/hdf bios = 0x83 # change-rules reset read-only menu-scheme = Wg:kw:Wg:Wg lba32 prompt timeout = 80 message = /boot/message image = /boot/bzImage-2.5.38a label = 2.5.38a root = /dev/hdb6 append = "rootfstype=ext3 hdc=ide-scsi" image = /boot/bzImage-2.4.19a label = 2.4.19a root = /dev/hdb6 append = "rootfstype=ext3 hdc=ide-scsi" # SuSE kernel; warning: eth0 and eth2 interchanged image = /boot/vmlinuz label = suse root = /dev/hdb6 vga = 791 initrd = /boot/initrd append = " ide=nodma apm=off acpi=off hdc=ide-scsi" image = /boot/memtest.bin label = memtest86
Exercise Get, compile and boot 2.4.17. It is stable.
It is possible to load kernel code at run time without rebooting.
Most hardware and filesystem drivers exist as module, and the
commands lsmod
, insmod
, rmmod
allow
you to manipulate them. Normally life is easier if you just
compile all you need into the kernel, but when developing a
new driver it is very useful to be able to insmod
the latest version, try it, rmmod
again, edit and
compile the driver, and repeat.
There is a kernel module loader that, when enabled, will find and load modules automatically when needed. Since modules come from disk, the code required to find them (disk driver, filesystem driver) cannot itself be a module.
During configuration there are the choices Y=yes, N=no, M=module.
The command make modules
compiles the modules.
It is very easy to make modules oneself. An example:
/* * demo-module.c * * Compile with * gcc -I/path-to-linux-tree/include -D__KERNEL__ -DMODULE -O2 \ * -Wall -Wstrict-prototypes -c -o demo-module.o demo-module.c */ #include <linux/init.h> #include <linux/module.h> static int __init demo_init(void) { printk("initializing..\n"); return 0; } static void __exit demo_exit(void) { printk("goodbye!\n"); } module_init(demo_init); module_exit(demo_exit); MODULE_LICENSE("GPL");
Now after compilation an insmod demo-module.o
will produce
the message initializing..
, while rmmod demo-module
will say goodbye!
. (Where are these messages? Wherever you
are sending kernel output. Maybe on some virtual console, or maybe
in a system log, like /var/log/messages
. Recent messages
are also visible in the output of the dmesg
command.)
Written like this, the module will work both when compiled into the kernel, and when inserted as a separate module.
Later we'll make less trivial modules.
Why this MODULE_LICENSE? After getting many kernel bug reports caused by the insertion of third party modules that were binary-only and hence cannot be fixed, developers made the kernel keep track of any non-open code encountered, so that bug reports involving non-open code could be simply disregarded. One says that the kernel was tainted.
Some docs on kernel modules live on www.faqs.org. Note - this is rather outdated material, written mostly for 2.0 and 2.2. Many details are a bit different now.
The above was good enough for normal purposes under Linux 2.4.
Things have become more complicated under Linux 2.5, and
it is now easiest not to invoke gcc by hand but to let
the kernel make system do the job.
Instead of invoking gcc
we make a 1-line
Makefile
:
# cat Makefile obj-m := demo-module.o #and then compile the module with
make -C /path-to-linux-tree SUBDIRS=$PWD modulesThis time the module is called
demo-module.ko
with .ko
instead of .o
. And everything works:
# insmod demo-module.ko initializing.. # rmmod demo-module goodbye!
Above we mentioned that the kernel is distributed under GPL. What about modules?
Copyright law protects a work and derived works.
Remains the question how far "derived" reaches in case of the kernel.
Are user space programs "derived works"? The file COPYING
says
making clear that no copyright on user space programs is claimed. Good. Are modules derived works? Generally the answer is claimed to be Yes. Here some fragments of conversation.NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work".
The module programming guide (2001).
The kernel initialization code (after the architecture-specific part
has finished) is found in init
.
The process handling (fork, signal, exit) lives in kernel
.
The Unix filesystem interface (open, close, read, write, chdir, link, unlink,
stat, mount, umount) lives in the subdirectory fs
.
SysV interprocess communication in ipc
.
Next, the various filesystems. They live in subdirectories
below fs
(like adfs
, affs
, autofs
,
bfs
, coda
, cramfs
, devfs
, devpts
,
driverfs
, efs
, ext2
, ext3
,
fat
, hfs
, hpfs
, intermezzo
,
isofs
, jfs
, minix
, msdos
,
ncpfs
, nfs
, ntfs
, proc
,
qnx4
, ramfs
, reiserfs
, romfs
,
smbfs
, sysv
, udf
, ufs
,
vfat
, xfs
).
And so does the partition table reading code (in fs/partitions
).
Then the memory management, kmalloc()
, swapping, etc.
is found in mm
.
Then the network code (not the device drivers, but the various
protocols, such as TCP/IP), in net
.
The device drivers, say for ethernet or so, live in
drivers/net
.
Linux runs on lots of different architectures: Intel and non-Intel PC,
DEC Alpha, Apple MacIntosh, PPC, Atari, etc. etc.
All architecture-specific code is found under arch
(maybe with subdirectories alpha
, arm
, arm26
,
cris
, h8300
, i386
, ia64
, m68k
,
m68knommu
, mips
, mips64
, parisc
,
ppc
, ppc64
, s390
, s390x
, sh
,
sparc
, sparc64
, um
, v850
, x86_64
).
The I/O subsystem and device drivers live mostly under drivers
.
It has subdirectories scsi
, ide
, usb
etc.
The include files for the kernel source live under include
(and not under /usr/include
). It is a very bad idea to
make a symlink /usr/include/linux
pointing at some
development source tree.
Under lib
some useful library routines.
Under scripts
some scripts used for kernel compilation.
A random collection of docs lives under Documentation
,
some parts nicely formatted with DocBook markup.
The commands make htmldocs
and make psdocs
will take the specially formatted doc-comments from the kernel source,
producing kernel-api.html
and kernel-api.ps
,
describing the public interfaces to the various kernel subsystems.
The Linux kernel is written in C, and compiled by gcc. Various gcc extensions are used, so it is not easy to use a different compiler. A few architecture-specific fragments are in assembler.
The code is self-contained, no library is linked in.
In particular, standard routines like printf()
are not available. (But there is printk()
that does
more or less the same.) However, the kernel has its own library,
and things like strcpy()
exist.
The desired style should be obvious from looking at the main parts of the kernel. Some device drivers are painful to behold.
Linus commented on the desired coding style in the file
Documentation/CodingStyle
. A quote:
First off, I'd suggest printing out a copy of the GNU coding standards, and NOT read it. Burn them, it's a great symbolic gesture.
Generally, Kernigham & Ritchie style layout is used, with 8-space tabs.
Try indent -kr -i8
.
Note that goto
's are not avoided. Instead of
err = allocate_foo(); if (!err) { err = allocate_bar(); if (!err) { err = allocate_baz(); if (!err) { ... } else { deallocate_bar(); deallocate_foo(); } } else deallocate_foo(); } return err;or
err = allocate_foo(); if (err) return err; err = allocate_bar(); if (err) { deallocate_foo(); return err; } err = allocate_baz(); if (err) { deallocate_bar(); deallocate_foo(); return err; } ...one writes the much clearer
err = allocate_foo(); if (err) goto out; err = allocate_bar(); if (err) goto out1; err = allocate_baz(); if (err) goto out2; ... out2: deallocate_bar(); out1: deallocate_foo(); out: return err;Not only is this cleaner, it is also faster: the main path of the code is not cluttered with error-handling code.
Linus wrote a C parser meant to keep track of the distinction
between pointers to kernel and to user space. The latter are annotated
with __user
. The parser is known as sparse
,
and has a bitkeeper home. There is also a
non-bitkeeper source.
After installing this parser, one may run a check by doing
make C=1
(check the source files that get recompiled)
or make C=2
(check all). This is very incomplete, work in progress.
A standard idiom found throughout the kernel are the list
handling primitives list_add
, list_del
,
list_entry
, list_for_each
, list_for_each_entry
.
These primitives describe doubly linked circular lists, with nodes
struct list_head { struct list_head *next, *prev; };with the obvious
/* add new node after given node */ void list_add(struct list_head *new, struct list_head *before) { struct list_head *after = before->next; after->prev = new; new->next = after; new->prev = before; before->next = new; }and
void list_del(struct list_head *entry) { entry->prev->next = entry->next; entry->next->prev = entry->prev; }
Now list_for_each
runs over the list nodes different
from the starting node:
#define list_for_each(pos, head) \ for (pos = (head)->next; pos != (head); pos = pos->next)It is called with a local temp variable
pos
.
Usually a list_head is member of a larger structure.
For example, a struct inode
contains
struct list_head i_hash, i_list, i_dentry, i_devices
four such structs, so that each inode is on four cyclic lists.
If we walk such a cyclic list, then we find for example the
address of the i_dentry
field of the inode; in order
to get the address of the inode itself we have to subtract the
offset of the i_dentry
field in a struct inode.
This is done by the macro container_of
:
#define container_of(ptr, type, member) ({ \ const typeof( ((type *)0)->member ) *__mptr = (ptr); \ (type *)( (char *)__mptr - offsetof(type,member) );})Thus,
container_of(x, struct inode, i_dentry)
returns the address of the inode that has an i_dentry
field with address x
.
In this list handling context we have the alias list_entry
of container_of
:
/** * list_entry - get the struct for this entry * @ptr: the &struct list_head pointer. * @type: the type of the struct this is embedded in. * @member: the name of the list_struct within the struct. */ #define list_entry(ptr, type, member) \ container_of(ptr, type, member)
Now we can iterate over all structs containing the nodes
of a given cyclic list using list_for_each_entry
:
#define list_for_each_entry(pos, head, member) \ for (pos = list_entry((head)->next, typeof(*pos), member); \ &pos->member != (head); \ pos = list_entry(pos->member.next, typeof(*pos), member))Again, this visits all such structs except for the starting one.
Another idiom found all over the place is defined in err.h
:
ERR_PTR casts a long to a pointer, PTR_ERR casts a pointer to a long,
and IS_ERR applied to a pointer checks whether the numerical value
lies between -999 and -1 (inclusive) and hence by convention is the
negative of an error number.
Rusty's unreliable kernel hacking guide.
The kernel prints messages using the printk()
function.
We want to see them somewhere on a console, and also log
them to a file.
The function printk()
(see kernel/printk.c
)
is very similar to the well-known printf()
in user space.
One difference is that texts printed start with a priority level
in the form of a string like "<4>"
.
#define KERN_EMERG "<0>" /* system is unusable */ #define KERN_ALERT "<1>" /* action must be taken immediately */ #define KERN_CRIT "<2>" /* critical conditions */ #define KERN_ERR "<3>" /* error conditions */ #define KERN_WARNING "<4>" /* warning conditions */ #define KERN_NOTICE "<5>" /* normal but significant condition */ #define KERN_INFO "<6>" /* informational */ #define KERN_DEBUG "<7>" /* debug-level messages */
In /proc/sys/kernel/printk
we see four numbers:
console_loglevel
(7),
default_message_loglevel
(4),
minimum_console_loglevel
(1),
default_console_loglevel
(7).
Messages with priority larger than console_loglevel
are printed
to the console. (Higher priority means smaller priority level.)
Lines without explicit priority indication are
printed with priority default_message_loglevel
.
The smallest allowed value of console_loglevel
is
minimum_console_loglevel
. The default value of
console_loglevel
is default_console_loglevel
.
(Thus, by default, all is printed to the console, except for debug messages.
Distributions often set the console_loglevel
to 1, suppressing
messages to the console that might disturb users.)
These values can be changed using the syslog()
system call
(see below), or using the sysctl()
system call, or by
echoing new values to /proc/sys/kernel/printk
.
# cat /proc/sys/kernel/printk 1 4 1 7 # echo "2 4 3 7" > /proc/sys/kernel/printk # cat /proc/sys/kernel/printk 2 4 3 7 #The
sysctl()
version goes as follows:
% cat > printk_sysctl.c << EOF #include <stdio.h> #include <sys/sysctl.h> #define SIZE(a) (sizeof(a)/sizeof((a)[0])) int name[] = { CTL_KERN, KERN_PRINTK }; int printk_params[4]; int new_params[4]; int main(int argc, char **argv) { int paramlth = sizeof(printk_params); if (argc == 1) { /* report */ if (sysctl(name, SIZE(name), printk_params, ¶mlth, 0, 0)) { perror("sysctl"); exit(1); } printf("got %d bytes:\n", paramlth); printf("console_loglevel: %d\n", printk_params[0]); printf("default_message_loglevel: %d\n", printk_params[1]); printf("minimum_console_loglevel: %d\n", printk_params[2]); printf("default_console_loglevel: %d\n", printk_params[3]); } else if (argc == 5) { int i; for (i=0; i<4; i++) new_params[i] = atoi(argv[i+1]); /* set */ if (sysctl(name, SIZE(name), 0, 0, new_params, sizeof(new_params))) { perror("sysctl"); exit(1); } printf("set new printk parameters\n"); } else { fprintf(stderr, "Call: %s [N N N N]\n", argv[0]); exit(1); } return 0; } EOF % cc -o printk_sysctl printk_sysctl.c % ./printk_sysctl got 16 bytes: console_loglevel: 2 default_message_loglevel: 4 minimum_console_loglevel: 3 default_console_loglevel: 7 % ./printk_sysctl 1 4 1 7 sysctl: Operation not permitted % su # ./printk_sysctl 1 4 1 7 set new printk parameters
Messages are printed to a ring buffer, so that later messages overwrite earlier ones. The size of this buffer can be set at compile time. Long ago it was 4096. Today 16384 is the default, on some architectures up to 131072.
This ringbuffer is available for reading via /proc/kmsg
.
However, this is a read-once interface: data disappears as soon as
it is read. There is also the dmesg
command, that reads
nondestructively.
The console(s) printed to are determined at boot time (but see below).
A kernel boot parameter console=
gives a virtual or serial
console to be printed to, and one may have several such lines.
The format is console=device,option
, e.g.,
console=tty0
: print to the foreground VT (this is the default), or
console=tty12
: print to /dev/tty12
, or
console=ttyS1,2400
: print to the serial line /dev/ttyS1
at 2400 baud, or
console=lp0
: print to the printer.
The option for a serial line is a number specifying baud rate,
followed by a letter (one of n
,o
,e
)
specifying parity (none, odd, even),
followed by a number of bits. Default is 9600n8.
In order to use a serial console, serial port support (CONFIG_SERIAL_8250) and console on serial port (CONFIG_SERIAL_8250_CONSOLE) must be enabled in the kernel config.
The file /dev/console
refers to the last console mentioned,
or to /dev/tty0
by default.
A 2400 baud serial console.
Instead of an old terminal, one may also use another PC as serial console. Connect both machines with a null-modem cable. Run some communications program on the Serial Console machine, for example kermit.
(SC) # kermit C-kermit> set port /dev/ttyS1 C-kermit> set speed 38400 C-kermit> set carrier-watch off C-kermit> connect ...These commands can be put into
.kermrc
.
Kermit will become unhappy when programs on the other side fiddle with
the serial line. The typical result is "Communications disconnect".
Instead of convincing all programs that touch serial lines
(such as the setserial
boot script, and the X server)
to leave the line used alone, one can simply rename the device:
# mv /dev/ttyS0 /dev/ttyS0-sc
When the kernel crashes during the boot, one can log the messages via the netconsole to a different machine. On the other machine, run
% netcat -u -l 5555(this listens to incoming UDP packets on the specified port and prints the contents on stdout). On the booting machine make sure the kernel was compiled with
CONFIG_NETCONSOLE=y
and give kernel command line parameters
netconsole=4444@192.168.1.40/eth2,5555@192.168.1.50/01:23:45:67:89:ABthat is, the source port, source IP address, source ethernet device, destination port, destination IP address, destination MAC address. The port numbers are arbitrary.
If the console is a VT, then it is possible to redirect kernel
messages to a different, already existing, VT by use of the
TIOCLINUX
ioctl:
/* Send kernel messages to a given console */ /* (Abbreviated version of setlogcons.c) */ #include <stdio.h> #include <stdlib.h> #include <sys/ioctl.h> int main(int argc, char **argv){ int cons; struct { char fn, subarg; } arg; if (argc == 2) cons = atoi(argv[1]); else cons = 0; /* current console */ arg.fn = 11; /* redirect kernel messages */ arg.subarg = cons; /* to specified console */ if (ioctl(0, TIOCLINUX, &arg)) { perror("TIOCLINUX"); exit(1); } return 0; }
Using the TIOCCONS
ioctl one can redirect
console output to a pseudotty. This is what xterm -C
and xconsole
do.
A small demo:
/* * compile with * cc -o pseudocons pseudocons.c -lutil */ #include <stdio.h> #include <errno.h> #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <termios.h> #include <pty.h> #include <sys/ioctl.h> void die(char *s) { perror(s); exit(1); } int main() { int masterfd, slavefd, fd; char c; if (openpty(&masterfd, &slavefd, NULL, NULL, NULL) < 0) die("openpty"); printf("got master\n"); if (ioctl(slavefd, TIOCCONS, 0)) { if (errno != EBUSY) die("TIOCCONS"); printf("trying to steal console\n"); fd = open("/dev/tty0", O_WRONLY); if (fd < 0) die("open /dev/tty0 fails"); if (ioctl(fd, TIOCCONS, 0)) die("TIOCCONS tty0"); if (ioctl(slavefd, TIOCCONS, 0)) die("TIOCCONS"); } printf("got slave console\n"); while (read(masterfd, &c, 1) == 1) printf("%c\n", c); return 0; }with application:
% cc -Wall -o pseudocons pseudocons.c -lutil % ./pseudocons got master trying to steal console TIOCCONS tty0: Operation not permitted % su # ./pseudocons got master trying to steal console got slave console H o e r a . .where the output arose because of
# echo Hoera.. > /dev/consoledone in another window.
The syslog()
system call (not to be confused with the
library function of the same name) serves to set logging parameters
and give access to the ringbuffer. The glibc interface is called
klogctl()
. The prototype is
The first parameter specifies one of 10 possible subfunctions:int klogctl(int type, char *bufp, int len);
For details, see/* * Commands to do_syslog: * * 0 -- Close the log. Currently a NOP. * 1 -- Open the log. Currently a NOP. * 2 -- Read from the log. * 3 -- Read all messages remaining in the ring buffer. * 4 -- Read and clear all messages remaining in the ring buffer * 5 -- Clear ring buffer. * 6 -- Disable printk's to console * 7 -- Enable printk's to console * 8 -- Set level of messages printed to console * 9 -- Return number of unread characters in the log buffer */
syslog(2)
.
A common setup is to have a daemon klogd(8)
read the kernel
ringbuffer, and feed the messages to syslogd(8)
.
What happens next depends on the syslogd
configuration,
see syslog.conf(5)
. In the file /etc/syslog.conf
one can specify what mailboxes and consoles and files and internet
connections system log messages should be sent to. This is much more
than just kernel messages - many daemons report their things via
the syslog mechanism. For example, lines
kern.warn;*.err;authpriv.none /dev/tty10 *.*;mail.none;news.none -/var/log/messagescause messages in certain categories to be sent to
/dev/tty10
and to the file /var/log/messages
. The leading -
says that it is not necessary to sync()
after every write.
Under X one can start xconsole
.
It captures messages written to /dev/console
(or some
other specified file) and displays them in an X window.
A similar effect may be achieved by xterm -C
.
You may have to be root to get permission to open /dev/console
.
(Or you can make some login script invoke GiveConsole
.)
The mechanism is provided by the TIOCCONS ioctl discussed above.
Inserting calls to printk()
is safe on most places
in the kernel. Of course one should not put such calls in the
code that handles printk()
output, or an infinite amount
of output will be generated, or, in case such code acquires locks,
a deadlock will result. The variable oops_in_progress
can be set to 1 to break certain locks, and to make sure klogd
is not woken up.
If the kernel hangs or crashes early in the boot sequence, before the console has been initialized, one can trace what is happening by writing messages "by hand" to video memory. See videochar.txt.
It is possible to directly ask kernel info from the keyboard (when at a virtual console), also when no process is reading it. Sometimes it is possible in this way to collect some information when most of the kernel has crashed.
First of all Show_Registers (AltGr-ScrollLock), where AltGr is the right Alt key. It produces a register dump.
Pid: 0, comm: swapper EIP: 0060:[<c01088f4>] CPU: 0 EIP is at default_idle+0x24/0x30 EFLAGS: 00000246 Not tainted EAX: 00000000 EBX: c01088d0 ECX: 00000175 EDX: c12bef50 ESI: c0532000 EDI: c01088d0 EBP: 0008e000 DS: 007b ES: 007b CR0: 8005003b CR2: 084b496c CR3: 0f913000 CR4: 00000290 Call Trace: [<c0108972>] cpu_idle+0x32/0x50 [<c0105000>] _stext+0x0/0x20This dump is written using
printk
hence also goes
to wherever printk output goes (e.g. to /var/log/messages
).
Next is Show_State (Ctrl-ScrollLock). It produces a very long output, listing for every process in the system some data and a stack trace.
emacs S 00000082 4195332920 4988 4987 (NOTLB) Call Trace: [<c032a883>] normal_poll+0x113/0x12b [<c011db3f>] schedule_timeout+0x7f/0xa0 [<c011dab0>] process_timeout+0x0/0x10 [<c014e3cf>] do_select+0x1ef/0x230 [<c014e060>] __pollwait+0x0/0xa0 [<c014e759>] sys_select+0x319/0x470 [<c0109c85>] restore_sigcontext+0x115/0x140 [<c0119a13>] sys_gettimeofday+0x43/0xa0 [<c010a72b>] syscall_call+0x7/0xbSince the result probably scrolls off the screen, one may have to use the
dmesg
command, or inspect the syslog, in order to read the
output.
Finally Show_Memory (Shift-ScrollLock). It shows the memory situation.
Mem-info: DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 Normal per-cpu: cpu 0 hot: low 28, high 84, batch 14 cpu 0 cold: low 0, high 28, batch 14 HighMem per-cpu: empty Free pages: 15440kB (0kB HighMem) Active:31177 inactive:22639 dirty:6 writeback:0 free:3860 DMA free:6000kB min:128kB low:256kB high:384kB active:2316kB inactive:3008kB Normal free:9440kB min:1020kB low:2040kB high:3060kB active:122392kB inactive:87548kB HighMem free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB DMA: 134*4kB 71*8kB 40*16kB 27*32kB 17*64kB 2*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 6000kB Normal: 0*4kB 2*8kB 1*16kB 0*32kB 51*64kB 26*128kB 5*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 9440kB HighMem: empty Swap cache: add 0, delete 0, find 0/0, race 0+0 Free swap: 281044kB 65520 pages of RAM 0 pages of HIGHMEM 1891 reserved pages 31170 pages shared 0 pages swap cached
Also a few virtual console housekeeping commands are implemented via this same mechanism. There are keys for scroll up (Shift-PgUp) and scroll down (Shift-PgDn). Keys to change virtual console: Decr_Console (Alt-LArrow), Incr_Console (Alt-RArrow), Last_Console (unbound).
Finally there are three commands: Boot (Ctrl-Alt-Del), KeyboardSignal (Alt-UpArrow), SAK (unbound).
Key bindings can be changed using loadkeys
.
When so configured, there is one more keyboard mechanism. Build the kernel with CONFIG_MAGIC_SYSRQ enabled. In some kernel versions you then have an active Magic SysRequest key, but in other kernels you first have to do
echo 1 > /proc/sys/kernel/sysrqto activate this key. A command is given by pressing three keys simultaneously: Alt-SysRq-X, where X is the command letter. (On a serial console one gives a Break, followed by the command letter.)
Since Linux 2.4.21/2.5.66 it is also possible to do
echo X > /proc/sysrq-triggerwith the same effect as the keystroke.
One should make sure that this facility is always switched off in a production environment.
There are several facilities to see where the kernel spends its resources. A simple one is the profiling function, that stores the current EIP (instruction pointer) at each clock tick.
Boot the kernel with command line option profile=2
(or some other number instead of 2). This will cause
a file /proc/profile
to be created.
The number given after profile=
is the number of positions
EIP is shifted right when profiling. So a large number gives a
coarse profile.
The counters are reset by writing to /proc/profile
.
The utility readprofile
will output statistics for you.
It does not sort - you have to invoke sort
explicitly.
But given a memory map it will translate addresses to kernel symbols.
See kernel/profile.c
and fs/proc/proc_misc.c
and readprofile(1)
.
For example:
# echo > /proc/profile ... # readprofile -m System.map-2.5.59 | sort -nr | head -2 510502 total 0.1534 508548 default_idle 10594.7500
The first column gives the number of timer ticks. The last column gives the number of ticks divided by the size of the function.
The command readprofile -r
is equivalent to
echo > /proc/profile
.
A more advanced mechanism is given by oprofile.
Build a kernel (2.5.43 or later) with CONFIG_PROFILING=y
and CONFIG_OPROFILE=y. Now the kernel knows about the
oprofilefs
virtual filesystem. The utilities
mentioned below will mount it on /dev/oprofile
.
It is a good idea to add idle=poll
to the kernel
command line; this will make sure time spent in the idle thread
is properly accounted for.
Get the oprofile utility from
http://oprofile.sourceforge.net/.
Configure with ./configure --with-kernel-support
,
then make
and (as root) make install
.
This yields binaries and a man page:
% ls /usr/local/bin op_dump op_merge op_start op_time opcontrol oprofiled op_help op_session op_stop op_to_source oprof_start oprofpp % ls /usr/local/share/oprofile stl.pat % ls /usr/local/share/doc/oprofile oprofile.html % ls /usr/local/man/man1 op_help.1 op_time.1 opcontrol.1 oprofiled.1 op_merge.1 op_to_source.1 oprofile.1 oprofpp.1 %All man pages are links to the
oprofile
man page.
Thus, there are two sources of information:
% man oprofile % mozilla /usr/local/share/doc/oprofile/oprofile.html
Make sure you are root and your PATH contains /usr/local/bin
.
The first action required is invoking opcontrol --setup ...
.
This will create a setup file /root/.oprofile/daemonrc
.
You need the big kernel toplevel vmlinux
file (not bzImage
,
and not the smaller arch/i386/boot/compressed/vmlinux
).
# opcontrol --setup --vmlinux=/foo/vmlinux # cat /root/.oprofile/daemonrc IGNORE_MYSELF=0 SEPARATE_LIB_SAMPLES=0 SEPARATE_KERNEL_SAMPLES=0 VMLINUX=/foo/vmlinux BUF_SIZE=0 one_enabled=1 #The precise setup call needed depends on your hardware. The above example is for an old Pentium without hardware performance counters. For a P3, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=CPU_CLK_UNHALTED --ctr0-count=100000For a P4, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=GLOBAL_POWER_EVENTS --ctr0-unit-mask=1 --ctr0-count=100000For an Athlon or x86-64, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=RETIRED_INSNS --ctr0-count=100000There are many other possible setup options. The command
op_help
will list them once you have done the setup :-).
(See also the sourceforge site for
Intel P6/PII/PIII,
Intel P4 and
AMD events.)
The general idea is that one specifies a certain type of event (such as CPU_CLK_UNHALTED, a CPU clock cycle), and a count (like the 100000 above). Now once every count occurrences of this event the value of EIP is recorded. Pick the value of count suitably: with a 400 MHz machine a count of 100000 for CPU_CLK_UNHALTED means 4000 interrupts/second. Too small a count and the machine dies an interrupt death. Too large a count and the profile will be very coarse.
First start the oprofile daemon.
# opcontrol --start-daemon Using log file /var/lib/oprofile/oprofiled.log Daemon started. # ps ax | grep oprofile ... /usr/local/bin/oprofiled ...
Next clear out old profiling data.
# opcontrol --reset
Next start measuring, do whatever should be measured, and stop measuring.
# opcontrol --start Profiler running. # do_something # opcontrol --stop Stopping profiling. #
One now has profiling data below /var/lib/oprofile/
.
See below for what to do with the data.
When no more profiling is needed, kill the daemon:
# opcontrol --shutdown Stopping profiling. Killing daemon. #
To clean up all data generated by oprofile (after generating any desired output):
# rm -r /var/lib/oprofile
A partial cleanup is done by the opcontrol --reset
mentioned above.
To get a printout of the data for /bin/foo
, use, e.g.
oprofpp -l -i /bin/foo
. The output will be boring,
unless /bin/foo
was not stripped after compilation,
so that it contains symbol information.
The programs and libraries that were invoked (and hence are suitable
arguments for oprofpp -l -i ...
) and the numbers of ticks spent
in each are given by the command op_time
.
With the -l
option the output will be split according to symbol.
# op_time | tail -3 3134 9.0531 0.0000 /lib/i686/libc-2.2.4.so 4813 13.9032 0.0000 /usr/bin/find 26212 75.7178 0.0000 /foo/vmlinux # op_time -l | tail -6 c012c7c0 703 2.04509 kmem_cache_alloc /foo/vmlinux c02146c0 786 2.28655 __copy_to_user_ll /foo/vmlinux c0169b70 809 2.35345 ext3_readdir /foo/vmlinux c01476f0 854 2.48436 link_path_walk /foo/vmlinux c016fcd0 1446 4.20655 ext3_find_entry /foo/vmlinux 00000000 4591 13.3556 (no symbol) /usr/bin/find #
To get statistics for the kernel only, use the vmlinux
name specified.
# oprofpp -l -i /foo/vmlinux | tail c012ca30 488 1.86174 kmem_cache_free c010e280 496 1.89226 mask_and_ack_8259A c010a61a 506 1.93041 restore_all c0119220 603 2.30047 do_softirq c0110b30 663 2.52938 delay_tsc c012c7c0 703 2.68198 kmem_cache_alloc c02146c0 786 2.99863 __copy_to_user_ll c0169b70 809 3.08637 ext3_readdir c01476f0 854 3.25805 link_path_walk c016fcd0 1446 5.51656 ext3_find_entry #
One can get disassembly or annotated source with indication on where the counts occurred.
# op_to_source -a -i /foo/vmlinux ... /* 424 1.618% */ c0169ac0 <ext3_check_dir_entry>: /* 6 0.02289% */ c0169ac0: push %edi /* 56 0.2136% */ c0169ac1: push %esi c0169ac2: push %ebx c0169ac3: mov 0x18(%esp,1),%esi /* 43 0.164% */ c0169ac7: xor %ebx,%ebx c0169ac9: mov 0x14(%esp,1),%edi c0169acd: movzwl 0x4(%esi),%ecx /* 23 0.08775% */ c0169ad1: cmp $0xb,%ecx c0169ad4: jg c0169ae0 <ext3_check_dir_entry+0x20> c0169ad6: mov $0xc032b160,%ebx ...
# op_to_source --source-dir=. --output-dir=/tmp -i /path/binary # diff -u ./source.c /tmp ... int get_line(register FILE *f, int *length) +/* get_line 24 54.55% */ ... + /* 5 11.36% */ c = Getc (f); ...
This profile with annotated source is available only for binaries
compiled with -g2
(and not stripped) so that they still
contain source line numbers.
There does exist a kernel debugger kdb
, maintained
as a patch on the kernel source. It can be obtained by anonymous
ftp from
ftp://oss.sgi.com
. For an introduction, see
ibm/developerworks
.
There are patches for Linux 2.2 and 2.4. I have not seen them for 2.5 yet.
Specifying the boot parameter initcall_debug
causes
the kernel to print the addresses of all initcalls it executes.
This may allow one to pinpoint the guilty part when the kernel
crashes at boot time. (Since 2.5.67.)
Read Documentation/SubmittingPatches
and perhaps
Documentation/SubmittingDrivers
. See also
The perfect patch
and
Linux kernel patch submission format.
The main forum is the mailing list known as lk or lkml or linux-kernel
with submission address linux-kernel@vger.kernel.org
.
Subscribe to the mailing list by sending the message
tosubscribe linux-kernel
Majordomo@vger.kernel.org
. Archives exist. There are many more
specialized lists, such as linux-net
or linux-fsdevel
.
There is a "kernelnewbies" IRC channel, and mailing list, and
website,
with a lot of useful information.
Lots of other places exist.
Brown paper bag bug - A bug one is deeply ashamed of.
Before the release of 2.2.0:
From: Linus Torvalds <torvalds@transmeta.com> Subject: (fwd) 2.2.0-final Date: 1999/01/22 ... In short, before you post a bug-report about 2.2.0-final, I'd like you to have the following simple guidelines: "Is this something Linus would be embarrassed enough about that he would wear a brown paper bag over his head for a month?" and "Is this something that normal people would ever really care deeply about?" If the answer to either question is "probably not", then please consider just politely discussing it as a curiosity on the kernel mailing lists rather than even sending email about it to me ...
After the release of 2.2.0:
From: Linus Torvalds <torvalds@transmeta.com> Subject: Linux-2.2.1 - the Brown Paper Bag release Date: 1999/01/28 ... The subject says it all. We did have a few paper-bag-inducing bugs in 2.2.0, so there's a 2.2.1 out there now, just a few days after 2.2.0. Oh, well. These things happen, Linus