
This
work is licensed under a Creative Commons
Attribution 2.5 License.
Other texts
Any good kernel tries hard to keep in RAM any often referenced memory page,
for instance anything read from a slow device. Many modern Unix kernels use
a 'unified buffer-cache' approach: any unused RAM is a disk cache. On Linux
this is dubbed the 'buffercache'.
But any set of operations on a filesystem implying a single-usage of an
amount of data comparable to the amount of unused RAM may defeat this
caching because the kernel will try to keep those data in RAM, albeit they
will no be used again soon (but the kernel don't 'know' that), probably by
ejecting ('swaping') pages which will soon be more useful.
Moreover some software have a built-in cache for incoming data, but the
kernel is not aware of that and also caches the very same data in its
unified buffer-cache, thus wasting RAM.
All this 'kills' the cache by filling it up more-or-less uselessly, leading
to poor performances because of avoidable I/O, or even to thrashing.
To avoid this the kernel has, among other deeds, to somewhat predict:
- in which way any data read/written on a slow device will be used again
soon by any code running on the machine. For instance:
- on a box used by a single user we need a way to let the kernel know
that it must not use the unified buffer-cache in order to keep chunks of a
file containing a motion picture read by a multimedia player,
- meta-data read by software doing a one-shot exhaustive exploration (for
instance updatedb). Note: the sysadmin of a box periodically running
many tools using such crawling will configure their invocation time in
order to have them running simultaneously, therefore using efficiently the
buffer-cache in order to reduce I/O.
- if a unique code uses the data and caches it itself. For instance: on
any box we need a way to let the kernel know, maybe after some
benchmarking, that it must never cache some data files managed by a
database-server software which itself caches the data. Using a raw
file system in order to store the databases is a good way to cope with this
but often remains difficult.
'Look-ahead' (or "look ahead", "lookahead") is a device feature, for
example managed by a logic integrated to most hard disk, which populates a
cache memory placed in the drive, not even visible for the operating
system. It can usually only be tweaked thanks to a low-level disk-specific
software and some controllers ('intelligent' ones) may be able to modify
this parameter.
'Read-ahead' (or "read ahead", "readahead") is a kernel feature, it works
at block device level and is global (is not process-dependend). To tweak it
one may use /sbin/blockdev. Along with the data resulting from requested
read operations, it populates the buffercache.
The disk-integrated cache can theoritically be totally useless by storing
blocks also stored in the buffercache, but under normal system load this
situation cannot last because if those blocks are often read the
buffercache will try to keep them, therefore they will not be read again on
disk, therefore they will be ejected from the disk internal cache by other
read operations. Moreover the drive-internal cache can cache writes,
potentially damaging data if the application software 'thinks' that a data
is safely stored when it only resides in the disk buffer, waiting to be
written on disk ("sync'ed"), and then disapears (for example in case of
power failure)
Read-ahead is a global parameter (active at the block device level, for all
processes), and by default active for each process. A piece of software
doing intensive random accesses to a set of data much larger than the
average buffercache size may run quicker and induce less system load by
having it inform the kernel that, at least for a given period, it has to
avoid buffercaching (therefore reading ahead). During some phases of it
activity the same software may need read-ahead but not buffercache (for
example to roast a DVD-R thanks to some ISO image), and so on...
Applications can inform the kernel
Various syscalls provide means for an application to inform the kernel:
But many developers do not use them and in some cases they are right
because an adequate usage is context-dependent.
For each volume (physical and logical: /dev/sd /dev/md...) find the optimal
readahead, which brings the best sequential access performance, on both
read and write because sometimes at least a read is needed before/after a
write (for example at the fs or level and RAID5). To do so you will use a
sequential benchmarking tool (for example sdd), flush the
buffercache before each invocation (echo 3 > /proc/sys/vm/drop_caches) and
use 'blockdev' to parameter readaheading. First put it at max (at least 32k
sectors, maybe even more on certain setups, for example if the drive
internal cache size is bigger than 16 MB: /sbin/blockdev --setra 32768
/dev/sd*) then test, for example using 'echo 3 > /proc/sys/vm/drop_caches ;
time sdd -onull if=dev/DeviceName bs=1g count=1 -t' (use 'bs=2g' or even
more until the test completes in more than 10 seconds), then reduce
read-ahead while it does not reduce throughput. Test also, using 'iseek',
the blocks placed nead the end of disk. Check this sample run.
Beware: an adequate read-ahead (optimal sequential throughput) can lead to
awful performance for any badly-written random-accessing code (because each
access, even for a single block, will lead trigger read-ahead), threfore
all the code running on the machine has to adequately use the open(2) flags
or the adequate 'advise' syscall. You can spy on them for that (using
strace), modify their sourcecode (don't forget to send a patch to the
author!) and even and even try to piggy back (LD_PRELOAD)
dynamically-linked binaries.
Adaptative read-ahead
There were some attempts, especially this one.
Write-back cache
The buffercache write-back effect is somewhat useless because some data
written by some applications are of sequential nature (no optimization can
be done by waiting in order to 'group' write requests, reducing the head
moves) not read nor modified during any given run. A typical case is data
acquisition. In order to reduce or disable the write-back cache effect one
may use a dedicated file system mounted with an adequate (low) value for
the 'commit' option, or even 'sync'. This does not announces that the data
will not be read again, therefore various other means presented here may be
useful.
Linux: swapiness
Under Linux: swapiness is
tempting but system-wide.
There is an efficient 'nice' for IO requests (instead of the standard
'nice' which mainly deals with CPU scheduling) of a given process: the
'CFQ' IO scheduler (Linux 2.6.12 or newer required)
Beware: benchmark it before adopting it! On some
hardware (especially RAID) it can slow everything down.
Notes:
- the Debian package's name is 'schedutils'
- for a long time CFQ was only able to manage inputs (reads), not outputs
(writes), I don't know if this is fixed
To read about it please fetch a Linux
kernel and read Documentation/block/switching-sched.txt and
Documentation/block/ioprio.txt
Summary: to see which IO scheduler is active one must:
echo /sys/block/DEVICE/queue/scheduler
, where
DEVICE is the device shortname (for example 'hda')
To activate it one must, as root, invoke
echo cfq > /sys/block/DEVICE/queue/scheduler
To associate a given process to a given IO sched class and priority index:
sudo ionice -c2 -n2 -p PIDOfProcess
.
- -c2
- means 'class 2' (normal). Class 1 is for 'realtime' (urgent: any IO
request of the process will preempt any other "class 2" ou "class 3"
pending request, in fact it will be honoured as soon as possible) and 3 for
'idle' (low-priority: any IO request of the process will be preempted by
any other pending request from class 1 or 2)
- -n2
- means 'intra-class priority index: 2'. There are 8 priorities for the
class 1 (realtime) and 2 (normal): from 0 (high priority) to 7 (low
priority). No priority index for the 'idle' class
Let's list from the absolute highest priority to the lowest one: c1, n0 ;
c1, n1 ; c1, n2; (....) c1, n7 ; c2, n0 ; c2, n1; (....) ; c2, n7 ; c3
Beware
Danger
The material in this section may be dangerous. Under Linux 2.6.16.1:
- a plain
rmmod cfq_iosched
crashed my box!
- after multiple modprobe/rmmod sequences, I could not modprobe it
anymore:
FATAL: Error inserting cfq_iosched
(/lib/modules/2.6.16/kernel/block/cfq-iosched.ko): Cannot allocate
memory
- even the kernel (2.6.17.4, on a Dedibox) didn't like it during the boot:
Jan 2 15:19:39 www kernel: oom-killer: gfp_mask=0x201d2, order=0
Jan 2 15:19:46 www kernel: <c013e243> out_of_memory+0x123/0x140 <c01401e3> __
alloc_pages+0x2b3/0x320
Jan 2 15:19:46 www kernel: <c03e36ce> io_schedule+0xe/0x20 <c01416b6> __do_pa
ge_cache_readahead+0x116/0x230
Jan 2 15:19:46 www kernel: <c03e36ce> io_schedule+0xe/0x20 <c03e3cb9> __wait_
on_bit_lock+0x59/0x70
Jan 2 15:19:46 www kernel: <c013a150> sync_page+0x0/0x50 <c014189d> max_sane_
readahead+0x2d/0x50
Jan 2 15:19:46 www kernel: <c013c3ae> filemap_nopage+0x12e/0x3e0 <c0148916> _
_handle_mm_fault+0x1f6/0xb30
Jan 2 15:19:46 www kernel: <c0111537> do_page_fault+0x137/0x7af <c0111400> do
_page_fault+0x0/0x7af
Jan 2 15:19:46 www kernel: <c0103d0f> error_code+0x4f/0x60
All seems to work better when those schedulers are built into the kernel
(i.e. not available as modules), and I did not encounter any problem with
2.6.20
Global performance
Such 'favorable' nicing often reduces global throughput because it coerces
the kernel into reacting more quickly (thanks to sparse accesses to the
media), and this is often suboptimal in the 'long' term. In a word: the
favored tasks will be quicker by a given index-number but most others will
be slowed down by a bigger index-number.
Setup
To compile an IO scheduler's kernel code invoke 'make menuconfig', chose
'Block layer' then 'IO Schedulers' (in the kernel's .config: search
'IOSCHED' )
Under Linux 2.6 (at least 2.6.12), you may explore: invoke cat
/sys/block/hda/queue/scheduler
. Each available IO scheduler
name is stated, the name of the currently used one is between brackets.
Modifify (as root) by issuing an 'echo'. The current kernel must have the
scheduler's code loader, therefore you may have to issue some
'modprobe'. Any IO scheduler module name is suffixed '-iosched'.
Example: as root I invoked modprobe cfq-iosched && echo cfq >
/sys/block/hda/queue/scheduler
Using it
Various ideas:
- all shellscripts lauching low-priority (cron...) processes will begin
with a
ionice -c3 -p$$
line
- interactive processes which will run in the realtime class of very low
priority:
-c1 -n7
- it may be useful to let root's shells preempt even the interactive
class (in case of emergency, an interactive process must not starve
root), by placing
ionice -c1 -n6 -p$$
in root's
'~/.bashrc'. This is not perfect as the bash startup may be starved
upon execution of this line
- best approach: a global setting may let root login preempt everything,
adn all user processes run by default at
-c2 -n4
, any user
remaining free to use ionice (there is a kernel capability)
CPU frequency scaling
Avoid using CPU frequency scaling (powernowd or cpufreqd, equivalent ACPI
functions, 'cpufreq_ondemand' kernel module...) if you need maximum
performance (response time).
One may additionally:
sudo renice -3 -p PIDOfProcess
sudo chrt -p 75 PIDOfProcess
Default setup
I don't use any X session manager and invoke the X server as 'x', which is
an alias:
alias x='{ startx -- -dpi 100 -nolisten tcp >/dev/null & } ; clear ; sleep 2 ; sudo ionice -c2 -n1 -p `pgrep xinit -u $USER` ; logout'
Therefore any command invoked in my X session is ionice'd "-c2
-n1".
Here is a quick hack for letting a process run 'interactively' (as
continuously as possible, this is useful for audio and video players).
Create a bash shell script named 'interactive':
#!/bin/bash
for p in $*; do
PID=`pidof $p`
for i in $PID; do #ionice is in /usr/src/linux/Documentation/block/ioprio.txt
sudo ionice -c1 -n7 -p $i
sudo renice -3 -p $i
sudo chrt -r -p 75 $i
done
done
Run chmod +x interactive
, then place it in a directory named in your
PATH.
If you use bash and "bash completions" you may find this useful in your
~/.bashrc file:
# bash completion for 'interactive'
_interactive()
{
local cur
COMPREPLY=()
cur=${COMP_WORDS[COMP_CWORD]}
COMPREPLY=( $( compgen -W '$( command ps --user $USER -o comm="" |grep -f ~/bin/utis/interactive_programs-names )' -- $cur ) )
}
complete -F _interactive interactive
In the ~/bin/utis/interactive_programs-names file I list the names of
which are somewhat 'interactive' programs:
mplayer
ogg123
mpg123
vlc
xmms
Usage: invoke the program, then type on a bash prompt "interactive"
followed by the TAB key.
syshint
Proposal: a new 'syshint' command will invoke a binary (the way 'strace'
does it) while letting the user give some information to the system about
its data-related behavior (for example: will it slurp a given file only one
time, or use it by randomly accessing to sparse chunks?). 'syshint' will
use LD_PRELOAD in order to re-implement some syscalls, especially open(),
to make use of the pertinent system calls.
A potential 'syshint' command may take care of those parameters.
Example: syshint --iostrategy="~/mymusic/*/*,READ_ONE_TIME" --cpunice=-3
mpg123 ~/mymusic/*/*.mp3
Moreover a configuration file (/etc/syshint/file_access.conf) may declare
a list of filenames glob patterns, the corresponding strategy and, if
necessary, the names of the concerned programs. Sample:
/var/cache/public/DVD/**.iso;O_DIRECT;ionice: -c1 -n4;/usr/bin/\(mplayer\|vlc\|xine\|totem\)
/var/lib/postgres/data/**;O_DIRECT;ionice: -c1 -n5;/usr/lib/postgresql/bin/postmaster
We may also/instead use some extended file attributes (setfattr, getfattr)
in order to 'mark' the files with those parameters. This will enable those
marks to better stick to the file.
One may couple this to some 'loadwatch' or 'idlerun' utility (or even hack
those in order to let them use 'syshint' even safeguarding against abusive
process by reducing their ionice/nice/...).
Case study: Linux RAID (3Ware and md)
Well, let's go back to the Earth. The road leading to optimization seems
pretty much unexplored.
Do you like this document?
Other texts