Creative Commons License
This work is licensed under a Creative Commons Attribution 2.5 License.

Other texts

Buffer-caching hints

Any good kernel tries hard to keep in RAM any often referenced memory page, for instance anything read from a slow device. Many modern Unix kernels use a 'unified buffer-cache' approach: any unused RAM is a disk cache. On Linux this is dubbed the 'buffercache'.

But any set of operations on a filesystem implying a single-usage of an amount of data comparable to the amount of unused RAM may defeat this caching because the kernel will try to keep those data in RAM, albeit they will no be used again soon (but the kernel don't 'know' that), probably by ejecting ('swaping') pages which will soon be more useful.

Moreover some software have a built-in cache for incoming data, but the kernel is not aware of that and also caches the very same data in its unified buffer-cache, thus wasting RAM.

All this 'kills' the cache by filling it up more-or-less uselessly, leading to poor performances because of avoidable I/O, or even to thrashing.

To avoid this the kernel has, among other deeds, to somewhat predict:

Read-ahead

'Look-ahead' (or "look ahead", "lookahead") is a device feature, for example managed by a logic integrated to most hard disk, which populates a cache memory placed in the drive, not even visible for the operating system. It can usually only be tweaked thanks to a low-level disk-specific software and some controllers ('intelligent' ones) may be able to modify this parameter.

'Read-ahead' (or "read ahead", "readahead") is a kernel feature, it works at block device level and is global (is not process-dependend). To tweak it one may use /sbin/blockdev. Along with the data resulting from requested read operations, it populates the buffercache.

The disk-integrated cache can theoritically be totally useless by storing blocks also stored in the buffercache, but under normal system load this situation cannot last because if those blocks are often read the buffercache will try to keep them, therefore they will not be read again on disk, therefore they will be ejected from the disk internal cache by other read operations. Moreover the drive-internal cache can cache writes, potentially damaging data if the application software 'thinks' that a data is safely stored when it only resides in the disk buffer, waiting to be written on disk ("sync'ed"), and then disapears (for example in case of power failure)

Read-ahead is a global parameter (active at the block device level, for all processes), and by default active for each process. A piece of software doing intensive random accesses to a set of data much larger than the average buffercache size may run quicker and induce less system load by having it inform the kernel that, at least for a given period, it has to avoid buffercaching (therefore reading ahead). During some phases of it activity the same software may need read-ahead but not buffercache (for example to roast a DVD-R thanks to some ISO image), and so on...

Applications can inform the kernel

Various syscalls provide means for an application to inform the kernel: But many developers do not use them and in some cases they are right because an adequate usage is context-dependent.

Adequate read-ahead

For each volume (physical and logical: /dev/sd /dev/md...) find the optimal readahead, which brings the best sequential access performance, on both read and write because sometimes at least a read is needed before/after a write (for example at the fs or level and RAID5). To do so you will use a sequential benchmarking tool (for example sdd), flush the buffercache before each invocation (echo 3 > /proc/sys/vm/drop_caches) and use 'blockdev' to parameter readaheading. First put it at max (at least 32k sectors, maybe even more on certain setups, for example if the drive internal cache size is bigger than 16 MB: /sbin/blockdev --setra 32768 /dev/sd*) then test, for example using 'echo 3 > /proc/sys/vm/drop_caches ; time sdd -onull if=dev/DeviceName bs=1g count=1 -t' (use 'bs=2g' or even more until the test completes in more than 10 seconds), then reduce read-ahead while it does not reduce throughput. Test also, using 'iseek', the blocks placed nead the end of disk. Check this sample run.

Beware: an adequate read-ahead (optimal sequential throughput) can lead to awful performance for any badly-written random-accessing code (because each access, even for a single block, will lead trigger read-ahead), threfore all the code running on the machine has to adequately use the open(2) flags or the adequate 'advise' syscall. You can spy on them for that (using strace), modify their sourcecode (don't forget to send a patch to the author!) and even and even try to piggy back (LD_PRELOAD) dynamically-linked binaries.

Adaptative read-ahead

There were some attempts, especially this one.

Write-back cache

The buffercache write-back effect is somewhat useless because some data written by some applications are of sequential nature (no optimization can be done by waiting in order to 'group' write requests, reducing the head moves) not read nor modified during any given run. A typical case is data acquisition. In order to reduce or disable the write-back cache effect one may use a dedicated file system mounted with an adequate (low) value for the 'commit' option, or even 'sync'. This does not announces that the data will not be read again, therefore various other means presented here may be useful.

Linux: swapiness

Under Linux: swapiness is tempting but system-wide.

Nice'ing IO

There is an efficient 'nice' for IO requests (instead of the standard 'nice' which mainly deals with CPU scheduling) of a given process: the 'CFQ' IO scheduler (Linux 2.6.12 or newer required)

Beware: benchmark it before adopting it! On some hardware (especially RAID) it can slow everything down.

Note: the Debian package's name is 'schedutils'.

To read about it please fetch a Linux kernel and read Documentation/block/switching-sched.txt and Documentation/block/ioprio.txt
Summary: to see which IO scheduler is active one must: cat /sys/block/DEVICE/queue/scheduler, where DEVICE is the device shortname (for example 'hda')
To activate it one must, as root, invoke echo cfq > /sys/block/DEVICE/queue/scheduler

To associate a given process to a given IO sched class and priority index: sudo ionice -c2 -n2 -p PIDOfProcess.

-c2
means 'class 2' (normal). Class 1 is for 'realtime' (urgent: any IO request of the process will preempt any other "class 2" ou "class 3" pending request, in fact it will be honoured as soon as possible) and 3 for 'idle' (low-priority: any IO request of the process will be preempted by any other pending request from class 1 or 2)
-n2
means 'intra-class priority index: 2'. There are 8 priorities for the class 1 (realtime) and 2 (normal): from 0 (high priority) to 7 (low priority). No priority index for the 'idle' class
Let's list from the absolute highest priority to the lowest one: c1, n0 ; c1, n1 ; c1, n2; (....) c1, n7 ; c2, n0 ; c2, n1; (....) ; c2, n7 ; c3 Check also What the 'low_latency' knob does.

Global performance

Such 'favorable' nicing often reduces global throughput because it coerces the kernel into reacting more quickly (thanks to sparse accesses to the media), and this is often suboptimal in the 'long' term. In a word: the favored tasks will be quicker by a given index-number but most others will be slowed down by a bigger index-number.

Setup

To compile an IO scheduler's kernel code invoke 'make menuconfig', chose 'Block layer' then 'IO Schedulers' (in the kernel's .config: search 'IOSCHED' )

Under Linux 2.6 (at least 2.6.12), you may explore: invoke cat /sys/block/hda/queue/scheduler. Each available IO scheduler name is stated, the name of the currently used one is between brackets.

Modifify (as root) by issuing an 'echo'. The current kernel must have the scheduler's code loader, therefore you may have to issue some 'modprobe'. Any IO scheduler module name is suffixed '-iosched'.

Example: as root I invoked modprobe cfq-iosched && echo cfq > /sys/block/hda/queue/scheduler

Using it

Various ideas:

CPU frequency scaling

Avoid using CPU frequency scaling (powernowd or cpufreqd, equivalent ACPI functions, 'cpufreq_ondemand' kernel module...) if you need maximum performance (response time).

Process scheduling attributes

One may additionally:
sudo renice -3 -p PIDOfProcess
sudo chrt -p 75 PIDOfProcess

Default setup

I don't use any X session manager and invoke the X server as 'x', which is an alias:
alias x='{ startx -- -dpi 100 -nolisten tcp >/dev/null & } ; clear ; sleep 2 ; sudo ionice -c2 -n1 -p `pgrep xinit -u $USER` ; logout'
Therefore any command invoked in my X session is ionice'd "-c2 -n1".

"interactive" script

Here is a quick hack for letting a process run 'interactively' (as continuously as possible, this is useful for audio and video players).

Create a bash shell script named 'interactive':


#!/bin/bash
for p in $*; do 
 PID=`pidof $p`
 for i in $PID; do #ionice is in /usr/src/linux/Documentation/block/ioprio.txt
  sudo ionice -c1 -n7 -p $i
  sudo renice -3 -p $i
  sudo chrt -r -p 75 $i
 done
done

Run chmod +x interactive, then place it in a directory named in your PATH.

If you use bash and "bash completions" you may find this useful in your ~/.bashrc file:


# bash completion for 'interactive'

_interactive()
{
        local cur

        COMPREPLY=()
        cur=${COMP_WORDS[COMP_CWORD]}

        COMPREPLY=( $( compgen -W '$( command ps --user $USER  -o comm="" |grep -f ~/bin/utis/interactive_programs-names )' -- $cur ) )

}

complete -F _interactive interactive


In the ~/bin/utis/interactive_programs-names file I list the names of which are somewhat 'interactive' programs:
mplayer
ogg123
mpg123
vlc
xmms

Usage: invoke the program, then type on a bash prompt "interactive" followed by the TAB key.

Also check C. Koliva's "toolsched" smart approach.

syshint

Proposal: a new 'syshint' command will invoke a binary (the way 'strace' does it) while letting the user give some information to the system about its data-related behavior (for example: will it slurp a given file only one time, or use it by randomly accessing to sparse chunks?). 'syshint' will use LD_PRELOAD in order to re-implement some syscalls, especially open(), to make use of the pertinent system calls. It may cooperate with other utility software, such as cpulimit.

A potential 'syshint' command may take care of those parameters. Example: syshint --iostrategy="~/mymusic/*/*,READ_ONE_TIME" --cpunice=-3 mpg123 ~/mymusic/*/*.mp3

Moreover a configuration file (/etc/syshint/file_access.conf) may declare a list of filenames glob patterns, the corresponding strategy and, if necessary, the names of the concerned programs. Sample:


/var/cache/public/DVD/**.iso;O_DIRECT;ionice: -c1 -n4;/usr/bin/\(mplayer\|vlc\|xine\|totem\) /var/lib/postgres/data/**;O_DIRECT;ionice: -c1 -n5;/usr/lib/postgresql/bin/postmaster

We may also/instead use some extended file attributes (setfattr, getfattr) in order to 'mark' the files with those parameters. This will enable those marks to better stick to the file.

One may couple this to some 'loadwatch' or 'idlerun' utility (or even hack those in order to let them use 'syshint' even safeguarding against abusive process by reducing their ionice/nice/...).

Work in progress

pagecache-tools, pagecache-mangagement.

Case study: Linux RAID (3Ware and md)

Well, let's go back to the Earth. The road leading to optimization seems pretty much unexplored.

Other texts

Google+: +Nat Makarevitch