
This
work is licensed under a Creative Commons
Attribution 2.5 License.
On the RAID side I compared 3ware integrated functions to the 'md' Linux driver (beware: 'md' is not 'dm' (which is ATAPI RAID)).
This induced major performance-related problems. This document describes this process and offers hints, especially useful to the beginner, when it comes to measuring or enhancing disk performance under Linux.
Interested? Concerned? Comment or suggestion? Drop me a note!
Please let me know which problem you were able to fix, especially if it is not listed here.
It managed 6 spindles IBM/Hitachi Desktar HDS725050KLA360 (7200 rpm, 16 Mo cache, average seek time: 8.2 ms)
It was replaced by a 9650SE-16ML and we added 6 drives: Western Digital Caviar GP 1 TB WD10EACS (5400 rpm(?), 8.9 ms(?)).
In this document any information not starting with "9650" refers to the 9550SX-12.
I'm disappointed (this is mitigated by the 3ware support service patent will to help) and, for now, can not recommend any 3ware controller. The controller may not be the culprit but if some thingy (mainboard, cable, hard disk...) does not work at best why is this high-end controller unable to warn about it? Why is it able to obtain fairly good sequential performance?
Stripe size. The size of the data written to each disk drive in RAID unit levels that support striping.
'man md' (version 2.5.6) uses the term 'stripe' and explains:
The RAID0 driver assigns the first chunk of the array to the first device, the second chunk to the second device, and so on until all drives have been assigned one chunk. This collection of chunks forms a stripe. ((...)) When any block in a RAID4 array is modified the parity block for that stripe (i.e. the block in the parity device at the same device offset as the stripe) is also modified so that the parity block always contains the "parity" for the whole stripe.
Moreover many intensive I/Os suck the commanding processor into 'iowait' state. This is annoying, not critical.
After tweaking I obtained ~140 random read IOPS (iozone) on 64k blocks and no more iowaits pits. Replacing the RAID5 by a RAID10 (on a 9650, but I bet that a 9550 offers the same) offered a tremendous improvement: ~310 IOPS (randomio mix), 300 MB/s sequential read, ~105 MB sequential write.
I now test 'md' on top of isolated disks ('single', in 3ware terminology: RAID is not done by the controller), obtaining with randomio mix (same parameters as above) a realistic ~450 (randomio, raw device) and ~320 (filesystem level) mixed (20% write) IOPS. With 4 more drives this RAID10 'md' made of 10 heterogeneous disks realistically (all requests served in less than 1/2 second, average ~27 ms) tops at ~730 IOPS (randomio mix, XFS).
The amount of interrupts/context switches may induce CPU load, while bus load is not neglectable because, on a software RAID1, every block written has to be sent to at least two spindles, while the RAID hardware controller needs only a single copy then writes it to as many spindles as necessary.
A 3ware controller is AFAIK optimized for streaming, but I fail to understand how this bias can alleviate random I/O performance. The bottomline, according to some, is that 3ware was sold (in 2006) to AMCC which may be less interested in the Linux market.
RAID5 slows write operations down because it must write the 'parity' block, which is, as far as I know, an XOR of all data blocks contents (or is it a Reed-Solomon code?). In order to maintain it, even a single block write must be preceeded by reading the old block and the existing parity block (this is the "small write problem"). It seems OK to me because my application mainly does read operations, moreover RAID5 eats up a smaller amount of disk space (for 'parity' information) than RAID1 (mirroring), leaving more disk space for data.
RAID5 may not be a clever choice, many experts prefer RAID10 which can simultaneously serve up to a request per spindle and performs better in degraded mode (albeit some wrote that RAID5 already does that). The XFS filesystem is also often judged more adequate than ext3 for huge filesystems. Any information about this, especially about how the 3ware 9550 manages this, will be welcome.
AFAIK the larger the stripe size, up to the average size (or harmonic mean?) of data read per a single request, the better it is for random access because the average number of spindles mobilized during a single read operation will be low, meaning that many spindles will be able to simultaneously seek for data. Allocating a too lenghty stripe size may lower IOPS since uselessly large blocks are then moved around.
Is it that the 3ware controller checks parity on each access, leading to reading at least a block on each disk (a 'stripe' + parity) wherever a single-block read is sufficient? Mobilizing all disks for each read operation leads to an access time equal to the access time of the slowest drive (the 'mechanically slowest' or the one doing the longest seek), which may explain our results. I don't think so as there is no hint about this in the user's manual, moreover the corresponding checking (tw_cli'show verify', 'start verify'... commands) would be only useful for sleeping data, as such faults will be detected on-the-fly during each read, but it can be a bug side-effect. A reply from the kind 3ware customer support service ("if any one of your disks in RAID has slower performance compared to other, it will result in poor performance") seems weird... I then asked if "does any read ... lead to a full-stripe read, even if the amount of data needed is contained in a single disk block? In such a case will switching to RAID10 lead to major gain in random read performance (IOPS)?" and the reply was "the above mentioned statement is common(not only for RAID5)".
At this point it appears that some bug forbids in any/most cases, at least with my (pretty common) setup, reading simultaneously different blocks on different spindles. I asked the support service: "Is the 9550 able, in RAID5 mode, to sort its request queue containing mainly single-disk-block read requests in order to do as many simultaneous accesses to different spindles as possible? If not, can it do that in RAID10 mode?". (Is anyone happy with the random I/O performance of a 3ware 9xxx under FreeBSD or MS-Windows? Which RAID level and hard disks do you use?)
The controller seems able to do much better on a RAID10 array, but my tests did not confirm this.
The existing data block was often already read by an application request, leaving it in a cache. But the common write-postponing strategy (for example when using a BBU: battery-backed unit, which 'commits' into battery-backed memory instead of disk, leaving the real disk-write to a write-back cache) may induce a delay leading to it to vanish from the cache, implying a new disk read.
Moreover I wonder if some efficient approaches are used by software or hardware RAID stacks.
Here are some details, as seen by Linux with the 9550 and 6 Hitachi on 3ware's RAID5:
As for the IO queue scheduler ("I/O elevator"), I prefer using CFQ because it theoritically enables the box to do database serving at high priority along with some low-priority batches also grinding the RAID (but ONLY when the database does not use it). Beware: use ionice if there are other processes requesting disk IO, for example by invoking your critical software with 'ionice -c1 -n7 ...', then check by 'ionice -p ((PID))'. Keep a root shell (for example under 'screen') at 'ionice -c1 -n4 ...', just to have a way to control things if a process in the realtime I/O scheduler class goes havoc.
Use, for example, with '-mx DeviceName 1'. All columns are useful (check '%util'!), read the manpage.
Devise your standard set of tests, with respect to your objectives. It can be a shell script invoking the test tools, do the housekeeping work: dumping all current disk subsystem parameters, emptying the caches, keep the results and a timestamp in a file... (not on the tested volume!). It may accept an argument disabling, for the current run, some types of tests.
You will run it then tweak just a single parameter then launch it and assess the results, then tweak again, assess, tweak, assess... From time to time emitting an hypothesis and trying to devise a sub test validating it.
We don't want any other use of the disk subsystem during the test, therefore:
Linux always uses the RAM not used by programs
in order to maintain a disk cache. To shorten the running time of dedicated
testing tools we may reduce the buffercache size because we are interested
in DISK performance, not in buffercache efficiency (overall system
optimization is better done at application level), therefore the system has
to have as few RAM as possible. I use the kernel boot parameter "mem=192M"
(a reboot is needed. To check memory usage invoke 'free'). "192M" is
adequate for me, but the adequate value depends upon many parameters,
establish it empirically by checking that, when all filesystems to be
tested are mounted and without any test running, your system does not use
the swap and that "free" shows that 'free', 'buffers' and 'cached' cumulate
at least 20 MB. Beware: fiddling with this can hang your system or put it
into a thrashing nightmare, so start high then gradually test
lower values. When not enough RAM is available during boot some daemons
launched by init may lead to swap-intensive activity (thrashing), to the point of
apparent crash. Play it secure by disabling all memory consuming daemons
then testing with "mem" set at 90% of your RAM, rebooting, mounting all
filesystems then checking, then reducing again and testing, and so on.
Don't shrink accessible memory it to the point of forbidding the system to
use some buffercache for metadata (invoke 'ls' two times in a row, if the
disk is physically accessed each time there is not enough RAM!). Moreover
be aware that such constrained memory may distort benchmarks (this is for
example true for the 3ware controller), so only use it to somewhat reduce
(reasonably!) available RAM.
The '/sbin/raw' command (Debian package 'util-linux') offers a way to bypass the buffercache (and therefore the read-ahead logic), but is not flexible and can be dangerous. I didn't use it for benchmarking purposes. Using the O_DIRECT flag for open(2) is much more efficient.
Disable the swap ('swapoff -a'), in order to avoid letting Linux use it to store pages allocated but rarely used (it does this in order to obtain memory space for the buffercache, and we don't want it!).
Disabling the controller cache (example with tw_cli, for the controller 0 - unit 1: "/c0/u1 set cache=off") is not realistic becauses its effect, especially for write operations (it groups the blocks), is of paramount importance.
03:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID (rev 01)
Subsystem: 3ware Inc 9650SE SATA-II RAID
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 16
Region 0: Memory at ca000000 (64-bit, prefetchable) [size=32M]
Region 2: Memory at c8200000 (64-bit, non-prefetchable) [size=4K]
Region 4: I/O ports at 2000 [size=256]
[virtual] Expansion ROM at c8220000 [disabled] [size=128K]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/5 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v1) Legacy Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <128ns, L1 <2us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s L1, Latency L0 <512ns, L1 <64us
ClockPM- Suprise- LLActRep+ BwNot-
LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
Capabilities: [100] Advanced Error Reporting <?>
Kernel driver in use: 3w-9xxx
Kernel modules: 3w-9xxx
You need to make sure that the nr_requests (kernel request queue) is at least twice queue_depth (hardware requests queue). Also the deadline or cfq i/o schedulers work a bit better for database-like workloads. Try something like this, replacing sda with the device name of your 3ware controller. # Limit queue depth somewhat echo 128 > /sys/block/sda/device/queue_depth # Increase nr_requests echo 256 > /sys/block/sda/queue/nr_requests # Don't use as for database-like loads echo deadline > /sys/block/sda/queue/scheduler CFQ seems to like larger nr_requests, so if you use CFQ, try 254 (maximum hardware size) for queue_depth and 512 or 1024 for nr_requests. Oh, remember, if you have just created a RAID array on the disks, wait with testing until the whole array has been rebuild..He also wrote:
There are a couple of things you should do: 1. Use the CFQ I/O scheduler, and increase nr-requests: echo cfq > /sys/block/hda/queue/scheduler echo 1024 > /sys/block/hda/queue/nr_requests 2. Make sure that your filesystem knows about the stripe size and number of disks in the array. E.g. for a raid5 array with a stripe size of 64K and 6 disks (effectively 5, because in every stripe-set there is on disk doing parity): # ext3 fs, 5 disks, 64K stripe, units in 4K blocks mkfs -text3 -E stride=$((64/4)) # xfs, 5 disks, 64K stripe, units in 512 bytes mkfs -txfs -d sunit=$((64*2)) -d swidth=$((5*64*2)) 3. Don't use partitions. Partions do not start on a multiple of the (stripe_size * nr_disks), so your I/O will be misaligned and the settings in (2) will have no or an adverse effect. If you must use partitions, either build them manually with sfdisk so that partitions do start on that multiple, or use LVM. ((editor's note: beware!)) 4. Reconsider your stripe size for streaming large files. If you have say 4 disks, and a 64K stripe size, then a read of a block of 256K will busy all 4 disks. Many simultaneous threads reading blocks of 256K will result in trashings disks as they all want to read from all 4 disks .. so in that case, using a stripesize of 256K will make things better. One read of 256K (in the ideal, aligned case) will just keep one disk busy. 4 reads can happen in parallel without trashing. Esp. in this case, you need the alignment I talked about in (3). 5. Defragment the files. If the files are written sequentially, they will not be fragmented. But if they were stored by writing to thousands of them appending a few K at a time in round-robin fashion, you need to defragment.. in the case of XFS, run xfs_fsr every so often. -=-=-= the internal queue size of some 3ware controllers (queue_depth) is larger than the I/O schedulers nr_requests so that the I/O scheduler doesn't get much chance to properly order and merge the requests. I've sent patches to 3ware a couple of times to make queue_depth writable so that you can tune that as well, but they were refused for no good reason AFAICS. Very unfortunate - if you have 8 JBOD disks attached, you want to set queue_depth for each of them to (max_controller_queue_depth / 8) to prevent one disk from starving the other ones, but oh well.Read the Linux Hardware RAID Howto
The 'su' parameter in XFS is equivalent to the "stripe" setting in 3ware's BIOS and tw_cli. Such that a "stripe" setting in the 3ware setup of "64KB" is matched with "-d su=64k" in the mkfs.xfs command line. The 'sw' parameter is then determined based on the number of data disks within the RAID group multiplied by the stripe size. Such that in a RAID5 with a 64KB stripe 'sw' would be set like (n-1)*64 and with RAID6 it would be (n-2)*64 or with RAID10 ((n/2)*64).The man page for mkfs.xfs is a bit confusing as to how these setting work exactly and the 'xfs_info' output is further confusing as it uses a different block size for displaying its units.
Most hardware-oriented hints published below are geared towards the 3ware (and probably any similar hardware)?
Let's proceed with the tweaking session.
# cd /dev # ln -s md1 md_RAID10_6Hitachi
Caution: some tests write on the disk, therefore any device storing precious data must be preserved. Do a backup, verify it then (it is not on a volume that must be writable for the system to run) protect the device using 'blockdev --setro /dev/DeviceName', then mount it in read-only mode: 'mount -o ro ...'.
Beware: those tools alone are not adequate for hardware RAID (3ware) because they only show what Linux knows, and an intelligent controller act behind the curtains. Therefore, at least when using hardware RAID, check the arrays status (output of "./tw_cli show", the column 'NotOpt' must contain 0, if it contains anything else gather intelligence using 'show all') in order to be quite sure that there is no housekeeping ('init', 'rebuild', 'verify'...) running (the INIT phase of a RAID5 made of six 500 GB spindles (2.3 TB available), doing nothing else, lasts ~7 hours).
If you use the 'md' software RAID check that all spindles are OK and that there is no rebuilding in action, for example with 'mdadm -D /dev/md0'. The best way do do it is to verify that all the drives blink when used, then to stay in front of the machine during all the testing session and stop testing from time to time, but this is pretty inconvenient.
No data is read of written during a disk head move, therefore head movement reduces throughput. Serving immediately each request often leads to frequent head back-and-forth movements. An adequate disk subsystem therefore reduces this by retaining for a short period all pending requests in order to sort them, grouping requests concerning the same disk physical zone. In an ideal world requests will be retained for a very long time then the heads will do a single sweep through the platters surface, collecting data at full speed. Retaining requests while the disk is used raises the global throughput, however retaining them for too long add service latency (the user is waiting!), and not enough retention leads to a disk often spinning without reading or writing data because its heads are in movement (this is the worst case: a few requests will be very quickly served while most will wait for quite a long time (head movements), moreover the global throughput will be bad). An adequate disk subsystem balances all this with respect to the current disk activity and nature of the requests (urgent ones are served as fast as possible, most are served reasonably fast and grouped, while some are only done if there is nothing else waiting).
Note that, for all this to work as efficiently as possible, some pieces of software must know the real disk geometry along with head location. The operating system elevator often can't access to those informations.
Some applications are not able to saturate a given disk subsystem, for example because all needed data is always in some cache, because their architecture limits their requests throughput (they send a read request, wait for it to complete, send another one, wait... and so on). Most recent code (for example database-managing code), however, serve multiple concurrent requests thanks to multiprocessing (for example through forking a subprocess for each new connected client) or multithreading (which, for our present concerns, behaves like multi-processing) or aio (asynchronous I/O). They put the disk subsystem on intense pressure.
When tuning for such applications we need to simulate this very type of system load. Read "Random I/O: saturating the disk subsystem" in the next session.
My application manages approx 800 GB (800000 MB)
The ratio is (800000.0 / 12900): approx 62
During tests: total core memory limited to 96 MB, controller cache 224 MB, 6 disk with 16 MB each: 516 MB total. One may also disable as much caches as possible:
Find the proper size for any testing programs to be used. The 'iozone' test software, for example, offers a '-s' parameter. For a test to deliver a sound result it must run at the very least least for a wallclock minute, in order to alleviate various small perturbations (mainly from the system and internal disk housekeeping). Use various values until they stabilize accross multiple runs:
#RAM available to Linux reduced thanks to the "mem" kernel parameter
$ free
total used free shared buffers cached
Mem: 92420 62812 29608 0 1524 16980
-/+ buffers/cache: 44308 48112
Swap: 0 0 0
Note: this particular test does not need multi-processing or
multi-threading as a single thread simulates adequately the load and saturates the disk subsystem.
#Using a 4 GB testfile:
$ time iozone -f iozone.tmp -w -i0 -i2 -p -e -o -r64k -m -O -S 2048 -s 4g
((...))
random random
KB reclen write rewrite read reread read write
4194304 64 738 1231 114 145
real 19m24.447s
user 0m5.876s
sys 0m42.395s
#This is realistic and repeatedly running it leads to very similar
#results. This will be our reference values. Let's see if we can reduce
#test duration while preserving results quality
#Using a 10 MB testfile:
$ iozone -f iozone.tmp -w -i0 -i2 -p -e -o -r64k -m -O -S 2048 -s 10m
((...))
random random
KB reclen write rewrite read reread read write
10240 64 788 775 13790 493
#Comment: WAY too different from the reference, this server just cannot
#(physically) give 13000 random read/second. The write performance is
#probably boosted by the controller disk cache ('-o' can't disable it), the
#read performance by the buffercache
#Using a 400 MB testfile:
$ time iozone -f iozone.tmp -w -i0 -i2 -p -e -o -r64k -m -O -S 2048 -s 400m
((...))
random random
KB reclen write rewrite read reread read write
409600 64 751 1300 138 173
real 1m37.455s
user 0m0.652s
sys 0m4.392s
#The test did not run long enough
#Using a 800 MB testfile:
$ time iozone -f iozone.tmp -w -i0 -i2 -p -e -o -r64k -m -O -S 2048 -s 800m
((...))
random random
KB reclen write rewrite read reread read write
819200 64 704 1252 116 152
real 3m42.477s
user 0m1.220s
sys 0m9.393s
#Seems good.
#Another run produced:
819200 64 729 1293 114 155
real 3m42.234s
user 0m1.180s
sys 0m9.397s
#Another one:
819200 64 711 1283 113 153
real 3m44.281s
user 0m1.120s
sys 0m9.629s
#800 MB is OK
Here are some dedicated tools:
Beware: on a RAID always use the '-t' parameter, in order to saturate the disk subsystem. Moreover iozone is picky about its arguments format. Check that he understood your command line arguments by reading its output header (summary of the operation mode).
One may try using iozone's '-I' option in order to forbid using the OS buffercache but I'm not sure it works as intended because it seems to forbid some parallelization.
In order to preserve the parameters values I frontalized it with a stupid shellscript 'test_io' which gathers the parameters then runs the test, enabling me to redirect its output to a file which will store the results. It was later automated with a lame Perl script automating testing of all combinations of a given set of parameters values: 'test_io_go.pl'. Please do only use them if you understand what they do and how. Note that they only run under an account authorized to 'sudo' anything without typing a password.
Here are the best results of a 3ware RAID5 (6 Hitachi), slightly edited for clarity: one, two, three. Here is one of the less convincing.
(TODO) (read-ahead parameter: 'test_io' queries it through /sbin/blockdev, but 'test_io_go.pl' sets it by writing into /sys/block/sda/queue/read_ahead_kb. They use different units! Adapt 'test_io')
(TODO) As MySQL may do aio (async I/O) and pwrite/pread let's try the corresponding iozone options:
-H # Use POSIX async I/O with # async operations
-k # Use POSIX async I/O (no bcopy) with # async operations
... but which one? I droped the idea as while using them Linux did not see
any aio (cat /proc/sys/fs/aio-nr)! Moreover it reduced the IOPS per approx
25%.
Published along with the How fast is your disk? LinuxInsight article.
I had to modify it, mainly to enable testing of huge devices and in order to enhance random coverage. Here are the diff and the source.
Compile then run:
$ gcc -O6 -Wall seekerNat.c -o seekerNat $ sudo ./seekerNat /dev/sdaResults: best: 14.7ms, average 20, up to 25. The drives data sheet states that its average acces time is 8.2 ms, therefore there is a quirk.
During a test session don't forget the periodical checks.
Reduce the amount of RAM available to Linux ("mem" kernel parameter):
# free
total used free shared buffers cached
Mem: 254012 149756 104256 0 1652 86052
-/+ buffers/cache: 62052 191960
Swap: 7815580 120 7815460
Use a representative file (create it with dd):
# ls -lh testfile -rw-rw---- 1 root root 278G 2008-02-27 12:24 testfile # df -h . Filesystem Size Used Avail Use% Mounted on /dev/md0 2.3T 279G 2.1T 12% /mnt/md0We will use the 'randomio' software tool, varying the amount of threads and the volume of data for each I/O request ("I/O size"). You may also check, while the test runs (in another terminal), a continuously running 'iostat -mx 1', especially the last ('%util') column: when most disks are saturated (100%) there is no more room for parallelization. Example: "for t in 1 16 64 128 256 512 1024 2048 ; do sync ; echo 3 > /proc/sys/vm/drop_caches ; echo $t; randomio testfile $t 0.2 0.1 $((65536*10)) 60 1 ; done"
The template used here is "randomio testfile ((Threads)) 0.2 0.1 ((I/O Size)) 60 1" (20% write, 10% write sync'ed, 60 seconds/run, 1 run). Check the colums "total iops" along with "latency avg", "latency max" and "sdev" (std deviation): the disk subsystem, as intended, maintains throughput by trading latency (instead of trying hard to keep latency low: this is physically impossible). Operational research to the rescue :-)
T
h
r
e
a
d I/O total | read: latency (ms) | write: latency (ms)
s size iops | iops min avg max sdev | iops min avg max sdev
-----------------------------+-----------------------------------+----------------------------------
md10fs 1 512 112.3 | 89.1 1.8 11.2 270.9 7.5 | 23.2 0.1 0.2 17.7 0.5
3w10 1 512 93.4 | 74.1 3.5 13.4 517.6 10.1 | 19.3 0.1 0.1 1.3 0.1
3w10fs 1 512 131.0 | 104.4 0.2 9.5 498.0 10.6 | 26.6 0.1 0.2 38.6 1.0
3w5fs 1 512 133.1 | 106.1 0.2 9.4 102.7 5.3 | 27.0 0.1 0.2 7.4 0.4
#3ware dominates. Best throughput, lowest latency. Sadly, this test hardly
#simulates server-class load
md10fs 8 512 539.1 | 431.3 0.2 18.4 505.5 13.4 | 107.8 0.1 0.5 81.0 2.8
3w10fs 8 512 635.0 | 508.1 0.3 14.8 185.7 8.4 | 126.9 0.1 4.0 178.8 8.5
md10fs 16 512 792.3 | 634.4 0.2 24.8 384.2 20.5 | 157.9 0.1 1.6 131.1 7.6
3w10 16 512 670.4 | 536.3 2.7 27.8 380.6 22.1 | 134.0 0.1 8.3 412.4 29.3
3w10fs 16 512 811.0 | 649.5 1.7 22.3 126.1 13.8 | 161.5 0.1 9.2 101.0 14.2
3w5fs 16 512 525.9 | 421.0 0.2 33.1 343.6 27.3 | 104.9 0.1 19.5 328.8 30.6
#Light load. Max throughput is far ahead
md10fs 24 512 846.8 | 678.5 1.8 34.3 656.8 40.5 | 168.4 0.1 4.2 407.6 16.5
3w10fs 24 512 847.4 | 678.9 2.1 31.6 523.3 21.4 | 168.4 0.1 15.0 526.3 22.2
#Up to this point the disk subsystem is at ease
md10fs 32 512 877.6 | 702.7 0.2 43.9 1218.7 60.9 | 174.8 0.1 6.6 1144.6 34.2
3w10fs 32 512 872.7 | 698.8 0.3 40.7 180.0 23.8 | 173.8 0.1 20.5 149.0 23.4
#md: some rare requests need 1.2 seconds (average and std deviation remain low)
md10fs 48 512 928.5 | 743.4 0.2 61.5 1175.0 86.3 | 185.1 0.1 12.0 680.9 40.8
3w10fs 48 512 897.4 | 718.7 2.1 58.8 212.7 31.2 | 178.7 0.1 32.0 186.3 30.3
md10fs 64 512 968.2 | 774.8 0.9 78.0 1247.2 104.5 | 193.3 0.1 18.0 857.9 61.1
3w10 64 512 825.3 | 661.1 3.0 95.5 1798.8 131.0 | 164.2 0.1 4.4 711.4 45.1
3w10fs 64 512 913.7 | 731.6 3.2 77.0 269.0 36.9 | 182.0 0.1 41.7 264.1 35.6
3w5fs 64 512 615.6 | 492.7 1.9 112.8 426.6 58.4 | 122.9 0.1 68.0 473.6 57.6
md10fs 128 512 1039.1 | 830.7 2.0 142.4 1584.5 181.5 | 208.4 0.1 44.5 1214.8 118.8
3w10 128 512 891.0 | 715.3 2.9 166.1 2367.9 216.5 | 175.7 0.1 42.8 1693.1 210.1
3w10fs 128 512 968.6 | 774.6 3.7 147.7 410.0 60.4 | 194.0 0.1 68.6 374.0 57.6
3w5fs 128 512 653.5 | 522.5 3.5 217.7 644.3 95.5 | 131.0 0.1 106.6 641.1 91.3
#3ware superbly manages latency
#'md10fs' serves more requests in a given timeframe
md10fs 164 512 1032.2 | 824.6 0.2 173.4 1587.0 171.1 | 207.6 0.1 99.6 1195.1 167.7
md10fs 180 512 1079.1 | 862.9 1.9 185.7 1476.2 184.8 | 216.2 0.1 89.7 1073.8 147.0
md10fs 196 512 1097.8 | 877.8 2.5 200.3 1923.6 202.1 | 220.0 0.1 90.3 1115.6 151.9
md10fs 212 512 1105.9 | 884.4 0.2 220.2 1749.0 228.9 | 221.5 0.1 71.7 1273.0 145.9
md10fs 228 512 1111.1 | 888.1 2.3 236.2 2062.3 243.8 | 223.0 0.1 77.3 1098.2 147.2
md10fs 244 512 1044.5 | 834.9 2.3 257.1 1960.2 233.2 | 209.6 0.1 136.4 1241.2 202.9
#Latency get worse as throughput culminates
md10fs 256 512 1100.9 | 880.4 3.0 259.0 1933.3 235.0 | 220.5 0.1 123.6 1254.5 188.0
3w10 256 512 908.1 | 726.6 3.7 316.8 2838.8 253.3 | 181.5 0.1 138.1 982.8 261.4
3w10fs 256 512 961.5 | 768.7 2.8 308.4 607.6 95.8 | 192.9 0.1 92.9 614.1 102.0
3w5fs 256 512 648.0 | 518.2 0.3 455.3 861.2 146.2 | 129.8 0.1 148.0 975.1 157.0
#3ware limits max-latency (maximum waiting time), lowering cumulated
#output. This is one way to cope with the usual trade-off: accepting more
#incomming requests and serve "on average" faster (with some requests
#waiting) or limiting both the amount of requests AND the worst experienced
#latency for each served request?
md10fs 384 512 1089.9 | 870.4 5.0 394.0 2642.9 267.5 | 219.5 0.3 181.4 1916.2 212.9
3w10fs 384 512 960.3 | 767.4 3.0 470.1 850.4 132.5 | 193.0 0.1 110.4 877.0 148.9
3w5fs 384 512 657.9 | 525.6 3.3 682.2 1271.0 201.2 | 132.4 0.1 176.5 1318.0 222.5
md10fs 512 512 1110.2 | 887.5 4.4 522.7 2701.5 300.0 | 222.7 0.4 202.3 1619.7 211.7
3w10 512 512 912.0 | 729.7 2.8 655.8 3388.7 400.4 | 182.2 0.1 164.4 976.8 283.8
3w10fs 512 512 975.9 | 779.2 3.1 620.6 932.0 168.6 | 196.7 0.1 129.1 996.7 188.6
3w5fs 512 512 656.7 | 524.2 4.6 917.2 1427.7 239.7 | 132.5 0.1 198.7 1437.6 281.9
#With 512 threads at work a request is, on average, served in approx
#1/2 second (with 1/3 s std deviation) and the slowest needs 2 seconds,
#this is unbearable in most OLTP contexts, and getting worse:
md10fs 1K 512 1075.2 | 857.7 3.6 1101.7 3187.2 413.2 | 217.5 0.2 298.2 2414.8 366.0
3w10 1K 512 905.7 | 722.5 2.9 1359.9 4097.0 870.3 | 183.2 0.1 160.1 1047.0 268.2
3w10fs 1K 512 999.9 | 796.6 4.1 1221.9 1694.8 318.1 | 203.2 0.1 186.4 1835.9 356.6
3w5fs 1K 512 662.1 | 527.0 3.3 1828.5 2544.8 468.8 | 135.1 0.1 307.7 2632.8 562.2
md10fs 2K 512 1074.1 | 853.4 4.8 2225.7 6274.2 689.4 | 220.7 0.3 470.2 4030.0 729.1
3w10fs 2K 512 978.0 | 776.3 2.7 2495.7 3207.4 633.2 | 201.7 0.1 313.3 3353.5 736.9
3w5fs 2K 512 667.1 | 527.6 3.9 3633.7 4533.4 912.0 | 139.5 0.1 496.4 4961.4 1113.5
#32K per data block
md10fs 1 32K 107.8 | 85.5 0.3 11.5 54.5 5.2 | 22.3 0.2 0.6 412.8 11.3
3w10 1 32K 90.5 | 71.7 2.6 13.9 413.5 7.8 | 18.8 0.1 0.1 1.1 0.1
3w10fs 1 32K 116.4 | 92.5 0.3 10.7 399.8 8.2 | 23.8 0.1 0.5 433.5 11.5
3w5fs 1 32K 115.4 | 91.7 0.3 10.9 381.2 8.0 | 23.7 0.1 0.2 4.9 0.4
md10fs 16 32K 735.1 | 588.1 0.3 26.7 459.4 22.0 | 147.0 0.2 2.0 160.6 9.0
3w10 16 32K 638.0 | 510.5 0.3 29.4 499.0 23.3 | 127.5 0.1 7.7 225.0 27.2
3w10fs 16 32K 613.3 | 490.8 3.1 29.1 260.8 19.1 | 122.4 0.1 13.8 259.9 22.5
3w5fs 16 32K 403.3 | 322.9 0.4 43.4 310.5 33.9 | 80.4 0.1 24.4 263.1 37.9
md10fs 64 32K 912.5 | 730.5 0.3 83.8 1448.7 117.9 | 181.9 0.2 14.8 828.1 55.9
3w10 64 32K 798.6 | 639.3 2.9 99.9 2132.5 131.6 | 159.3 0.1 0.1 8.9 0.2
3w10fs 64 32K 692.1 | 553.5 4.0 101.9 343.7 48.2 | 138.6 0.1 54.1 327.1 47.3
3w5fs 64 32K 483.5 | 387.0 2.0 144.3 473.7 70.3 | 96.5 0.1 83.8 428.9 70.7
md10fs 128 32K 994.5 | 795.3 0.4 150.6 1977.7 182.0 | 199.3 0.2 39.6 814.3 97.7
3w10 128 32K 849.1 | 680.0 4.4 173.5 2621.9 222.6 | 169.1 0.1 55.8 1655.2 243.4
3w10fs 128 32K 712.6 | 569.8 3.8 200.6 737.6 82.3 | 142.8 0.1 94.9 805.1 84.8
3w5fs 128 32K 504.8 | 404.1 3.7 283.4 698.9 114.2 | 100.7 0.1 131.9 755.6 115.9
md10fs 256 32K 1024.5 | 819.0 3.3 286.8 3253.0 290.4 | 205.5 0.2 98.6 1673.0 186.4
3w10 256 32K 855.6 | 686.3 4.1 336.0 2986.6 266.9 | 169.2 0.1 140.3 1129.2 285.5
3w10fs 256 32K 718.3 | 573.7 0.5 411.3 896.8 128.2 | 144.5 0.1 132.1 932.4 142.8
3w5fs 256 32K 488.6 | 391.0 4.4 601.9 1237.3 183.4 | 97.6 0.2 200.1 1355.1 216.9
md10fs 512 32K 1049.9 | 838.9 6.1 561.7 2463.6 352.4 | 211.0 0.3 169.6 2064.5 209.1
3w10 512 32K 867.9 | 693.5 5.0 669.5 3394.4 386.9 | 174.4 0.1 254.9 967.2 334.2
3w10fs 512 32K 716.1 | 571.5 4.5 846.6 1259.6 225.0 | 144.6 0.2 171.4 1368.3 261.1
3w5fs 512 32K 508.9 | 406.5 4.5 1182.8 1787.3 301.1 | 102.4 0.1 265.0 2161.9 379.1
md10fs 1K 32K 1062.1 | 846.4 5.3 1137.9 3138.1 442.1 | 215.7 0.4 235.6 2481.6 327.2
3w10 1K 32K 868.2 | 694.4 4.8 1372.3 4544.6 761.7 | 173.8 0.1 336.8 1774.4 384.2
3w10fs 1K 32K 713.6 | 568.1 5.5 1710.7 2418.6 433.0 | 145.5 0.1 257.0 2752.5 512.7
3w5fs 1K 32K 494.0 | 392.8 4.0 2444.1 3286.7 636.7 | 101.1 0.1 394.2 3865.4 761.6
#'md' throughput is barely reduced, albeit we now request 64 times more data,
#because our stripe size is larger than each set of data requested: disk
#head movements is the performance-limiting factor
#3ware throughput seems too severily reduced, especially in fs mode, there
#is something to explore here
#64K per data block (stripe size)
md10fs 1 64K 102.3 | 81.2 2.2 12.2 390.9 10.4 | 21.1 0.2 0.3 1.4 0.1
3w10 1 64K 85.3 | 67.7 4.4 14.7 396.5 9.6 | 17.6 0.1 0.2 0.8 0.1
3w10fs 1 64K 107.1 | 85.0 0.5 11.7 376.8 9.7 | 22.1 0.1 0.2 2.1 0.2
md10fs 16 64K 699.9 | 559.7 0.5 28.1 416.8 22.0 | 140.2 0.2 1.9 125.8 8.7
3w10 16 64K 609.5 | 487.8 0.5 31.0 570.5 24.4 | 121.7 0.1 7.3 238.3 26.2
3w10fs 16 64K 492.6 | 394.3 2.3 36.2 288.1 23.1 | 98.3 0.1 17.5 249.2 27.1
md10fs 64 64K 878.2 | 703.3 0.5 86.5 1442.0 117.0 | 175.0 0.2 17.5 940.1 61.3
3w10 64 64K 761.9 | 610.0 4.1 103.8 1821.1 138.8 | 151.9 0.1 3.2 785.7 39.8
3w10fs 64 64K 553.5 | 442.7 5.0 127.2 519.3 60.0 | 110.8 0.1 68.3 428.3 60.5
md10fs 128 64K 930.2 | 744.5 0.5 163.2 2039.0 214.2 | 185.6 0.2 32.8 1454.2 110.4
3w10 128 64K 803.0 | 644.7 0.5 176.3 2384.0 223.7 | 158.3 0.1 78.3 1806.3 288.0
3w10fs 128 64K 565.0 | 451.9 5.2 254.2 661.5 99.1 | 113.1 0.1 114.3 719.4 102.7
md10fs 256 64K 955.1 | 763.8 1.5 303.7 2120.4 292.2 | 191.2 0.2 120.0 1632.9 199.0
3w10 256 64K 823.1 | 660.4 5.2 292.1 2956.8 266.4 | 162.8 0.1 378.4 2178.7 414.4
3w10fs 256 64K 573.9 | 458.6 6.1 514.9 929.3 149.5 | 115.3 0.1 162.1 1139.2 179.8
md10fs 512 64K 964.4 | 770.5 6.9 617.0 2639.6 409.1 | 193.9 0.5 172.7 2145.4 220.7
3w10 512 64K 823.0 | 659.9 5.9 686.0 3788.9 342.6 | 163.1 0.1 342.9 1816.0 389.4
3w10fs 512 64K 569.9 | 454.7 6.5 1054.9 1559.7 262.8 | 115.2 0.2 238.1 1912.6 330.1
md10fs 1K 64K 980.6 | 781.4 7.2 1229.1 3404.8 501.3 | 199.2 0.5 252.9 2399.1 365.9
3w10 1K 64K 827.2 | 659.4 5.6 1408.5 3897.9 591.0 | 167.8 0.1 478.3 1809.4 391.0
3w10fs 1K 64K 565.1 | 449.5 5.8 2144.3 3013.3 542.8 | 115.6 0.2 343.0 3461.2 673.7
#3ware performance sunks, there is definitively something weird
#128K per data block (2 times the stripe size)
md10fs 1 128K 74.0 | 58.7 4.2 16.8 855.2 17.2 | 15.3 0.4 1.0 415.4 13.7
3w10 1 128K 72.2 | 57.1 6.7 17.4 544.7 13.6 | 15.0 0.2 0.3 6.3 0.2
3w10fs 1 128K 94.1 | 74.7 0.6 13.3 374.5 9.9 | 19.4 0.2 0.3 9.9 0.4
3w5fs 1 128K 94.7 | 75.2 0.5 13.2 126.8 7.6 | 19.5 0.2 0.3 8.8 0.7
md10fs 16 128K 387.5 | 310.1 1.3 49.8 527.5 43.7 | 77.4 0.4 7.2 203.1 19.8
3w10 16 128K 329.7 | 263.4 6.8 56.6 651.1 48.3 | 66.2 0.2 16.1 645.1 58.3
3w10fs 16 128K 329.2 | 263.0 1.0 54.2 387.3 36.3 | 66.2 0.2 26.2 381.6 42.2
3w5fs 16 128K 259.5 | 207.3 0.6 69.0 656.6 47.0 | 52.3 0.2 32.4 335.2 52.8
md10fs 64 128K 448.1 | 359.0 5.5 162.3 1543.8 173.9 | 89.1 0.4 63.2 879.1 131.0
3w10 64 128K 389.1 | 311.2 8.3 205.0 2550.3 213.2 | 77.9 0.2 0.3 2.6 0.2
3w10fs 64 128K 353.8 | 282.9 7.1 200.0 621.2 87.7 | 70.9 0.2 103.2 562.1 91.5
3w5fs 64 128K 309.7 | 247.3 6.7 229.5 811.2 98.8 | 62.4 0.2 113.6 710.3 103.3
md10fs 128 128K 461.0 | 369.0 7.0 315.4 2334.6 294.0 | 92.1 0.4 118.7 1862.9 238.8
3w10 128 128K 414.4 | 332.8 10.7 345.0 2632.5 294.5 | 81.6 0.2 139.4 3194.3 520.5
3w10fs 128 128K 378.0 | 302.2 10.2 380.7 927.8 134.0 | 75.8 0.2 168.0 1025.2 151.5
3w5fs 128 128K 324.4 | 259.0 10.3 445.5 1013.4 158.4 | 65.3 0.2 186.7 1083.0 173.9
md10fs 256 128K 486.5 | 389.2 16.3 603.8 2689.6 426.6 | 97.4 0.4 196.6 1602.6 246.9
3w10 256 128K 431.3 | 346.7 11.8 483.5 3028.6 341.3 | 84.6 0.2 1014.5 2214.9 710.5
3w10fs 256 128K 366.6 | 292.5 10.1 806.7 1653.0 234.0 | 74.0 0.2 246.0 1613.7 275.6
3w5fs 256 128K 319.7 | 255.1 10.6 930.2 1589.8 245.1 | 64.6 0.3 263.9 1745.3 310.3
md10fs 512 128K 483.8 | 386.0 11.9 1218.7 4733.3 605.3 | 97.8 1.2 377.7 2961.9 397.6
3w10 512 128K 430.2 | 344.8 12.8 1249.7 3885.5 454.4 | 85.4 0.2 865.6 2648.8 758.4
3w10fs 512 128K 378.6 | 301.5 8.3 1580.2 2427.7 381.7 | 77.1 0.2 361.7 2491.6 500.4
3w5fs 512 128K 319.7 | 255.1 10.6 930.2 1589.8 245.1 | 64.6 0.3 263.9 1745.3 310.3
md10fs 1K 128K 479.6 | 381.5 20.4 2458.4 6277.9 799.1 | 98.1 1.5 647.9 4847.9 825.6
3w10 1K 128K 429.6 | 342.9 14.1 2638.7 5507.9 838.8 | 86.7 0.2 1108.8 2718.6 665.6
3w10fs 1K 128K 370.4 | 293.5 7.5 3246.2 4317.8 788.0 | 77.0 0.5 541.5 4642.4 1007.1
3w5fs 1K 128K 310.4 | 245.1 9.6 3877.3 5185.4 980.4 | 65.3 0.3 651.2 6024.0 1230.1
#'md' performance sunks. There may be (!) a relationship between the
#average size of requested blocks and stripe size
#640K (full stride, all ten disks mobilized by each request)
md10fs 1 640K 40.8 | 32.7 10.4 29.9 414.3 14.7 | 8.1 1.6 2.5 7.3 0.7
3w10 1 640K 48.3 | 38.4 9.5 25.8 65.4 8.2 | 9.9 0.7 1.1 2.4 0.6
3w10fs 1 640K 60.7 | 48.2 8.2 20.5 395.6 9.3 | 12.5 0.7 1.1 2.9 0.7
3w5fs 1 640K 60.2 | 47.7 8.5 20.6 297.9 12.2 | 12.5 0.7 1.3 6.4 1.2
md10fs 16 640K 84.5 | 67.1 24.6 192.8 913.4 110.3 | 17.4 1.5 174.4 933.5 187.5
3w10 16 640K 72.0 | 57.0 26.2 226.8 432.3 58.5 | 15.0 1.4 202.1 371.7 59.0
3w10fs 16 640K 103.9 | 82.5 16.0 161.3 516.6 64.3 | 21.4 0.7 124.7 381.4 82.2
3w5fs 16 640K 91.9 | 72.9 12.2 191.6 781.2 95.1 | 19.0 0.8 105.3 819.2 107.8
md10fs 64 640K 95.8 | 75.9 38.0 756.0 2688.8 335.0 | 19.9 1.5 313.0 2104.7 426.9
3w10 64 640K 71.9 | 56.9 32.8 996.9 1627.6 159.2 | 15.0 2.6 455.6 1090.4 131.6
3w10fs 64 640K 118.9 | 94.4 19.4 595.6 1328.0 182.3 | 24.5 0.9 301.5 1366.1 192.7
3w5fs 64 640K 92.1 | 72.9 22.8 769.9 1396.7 194.6 | 19.2 0.9 385.7 1399.3 241.1
md10fs 128 640K 100.1 | 79.2 195.6 1403.8 3796.1 481.6 | 20.9 4.9 750.1 3164.9 560.9
3w10fs 128 640K 126.8 | 100.7 28.4 1174.0 2353.6 278.0 | 26.1 2.7 339.3 1352.9 238.0
3w5fs 128 640K 97.6 | 77.2 65.7 1507.9 3046.4 302.2 | 20.4 0.8 480.1 2202.0 368.1
md10fs 256 640K 95.9 | 75.8 285.5 2766.4 9177.8 1118.6 | 20.2 3.9 2072.0 7038.4 1400.8
3w10fs 256 640K 127.1 | 100.7 29.4 2381.6 4947.3 487.4 | 26.5 1.7 390.2 4534.5 416.2
3w5fs 256 640K 96.4 | 75.7 39.6 3129.4 6538.3 615.5 | 20.7 1.1 592.5 4551.3 627.6
#1MB (1048576 bytes), 1.6 stride
md10fs 1 1M 36.0 | 28.7 10.2 33.8 390.0 14.3 | 7.2 2.7 4.1 8.2 0.7
3w10 1 1M 44.1 | 35.0 11.2 28.1 396.3 11.8 | 9.0 1.1 1.6 2.8 0.7
3w10fs 1 1M 55.2 | 43.8 9.9 22.3 239.6 7.5 | 11.4 1.1 1.8 3.3 0.8
3w5fs 1 1M 50.5 | 40.2 10.2 24.3 402.5 14.1 | 10.3 1.1 2.0 16.2 1.4
md10fs 16 1M 75.1 | 59.6 36.2 243.2 888.6 125.3 | 15.5 2.4 94.8 609.3 107.4
3w10 16 1M 66.0 | 52.3 46.1 248.2 399.9 59.7 | 13.7 2.8 217.3 343.9 58.6
3w10fs 16 1M 87.4 | 69.3 18.0 189.2 387.6 60.3 | 18.1 2.2 158.7 381.0 78.7
3w5fs 16 1M 69.6 | 55.1 16.4 243.0 1198.4 107.5 | 14.5 1.1 177.2 657.9 116.3
md10fs 64 1M 82.0 | 65.2 72.6 857.9 2250.5 303.5 | 16.9 2.4 450.3 2604.3 474.0
3w10 64 1M 66.1 | 52.3 20.1 1126.9 1621.1 145.5 | 13.8 1.3 320.3 830.7 95.8
3w10fs 64 1M 95.3 | 75.5 30.7 771.1 1607.2 165.8 | 19.8 4.0 271.1 1095.8 156.5
3w5fs 64 1M 73.7 | 58.3 30.6 995.0 2021.7 221.8 | 15.4 3.5 348.4 1311.2 220.1
md10fs 128 1M 85.5 | 68.0 221.8 1631.1 5353.9 605.5 | 17.5 2.5 885.9 3217.8 647.2
3w10 128 1M 66.1 | 52.1 37.1 2320.7 2900.5 333.0 | 14.0 1.4 329.5 2175.3 145.7
3w10fs 128 1M 95.1 | 75.1 59.0 1605.7 2485.2 267.3 | 20.0 1.5 279.8 1786.0 192.9
3w5fs 128 1M 73.6 | 58.0 45.2 2065.1 2894.8 421.6 | 15.6 6.7 383.9 2899.7 341.6
In the next tests we will use:
BEWARE! The command below this line is DANGEROUS, it may ERASE DATA. BEWARE!
The benchmark used for sequential I/O is sdd
BEWARE! The 'write' command below this line is DANGEROUS, it may ERASE DATA. BEWARE!
read: sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -onull if=/dev/DeviceName bs=1m count=10000 -t
write: sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -inull of=/dev/DeviceName bs=1m count=10000 -t
Caution: using a high 'bs' value may be impossible because sdd allocates memory for it, or it may trigger swap activity if you did not 'swapoff'.
IO scheduler used: deadline and CFQ (no major performance hit)
Device names order (ghijkb, ghjkbi, ghkbij, hbijkg, hbjkgi) does not affect performance.
Linux md:
Test | total | read: latency (ms) | write: latency (ms)
| iops | iops min avg max sdev | iops min avg max sdev
----------+--------+-----------------------------------+----------------------------------
1_3w | 162.9 | 130.0 12.6 367.9 2713.5 305.9 | 32.8 0.1 0.7 280.5 9.7 real 2m1.135s, user 0m0.036s, sys 0m0.504s
----------+--------+-----------------------------------+-----------------------------------
5_3w_INIT| 75.7 | 60.5 2.7 617.6 6126.2 1759.7 | 15.2 0.1 622.7 6032.0 1786.2 real 2m0.744s, user 0m0.016s, sys 0m0.288s
5_3w | 78.1 | 62.5 0.5 598.1 6096.5 1711.1 | 15.6 0.1 608.8 6056.1 1740.1 real 2m0.591s, user 0m0.024s, sys 0m0.300s
----------+--------+-----------------------------------+-----------------------------------
5_md | 231.3 | 185.4 3.4 54.3 524.1 54.6 | 45.9 30.6 822.7 16356.0 705.0 real 2m0.404s, user 0m0.072s, sys 0m4.788s
----------+--------+-----------------------------------+-----------------------------------
10_3w 1 | 439.5 | 351.9 3.3 136.2 1838.7 140.9 | 87.5 0.1 0.2 3.5 0.1 real 2m0.202s, user 0m0.096s, sys 0m1.404s
10_3w 2 | 400.6 | 320.7 4.5 116.6 728.2 85.6 | 79.9 0.1 132.4 1186.3 269.0 real 2m0.533s, user 0m0.080s, sys 0m1.484s
10_3w 3 | 440.2 | 352.6 3.5 135.7 1765.0 139.9 | 87.7 0.1 0.9 458.4 14.8 real 2m0.680s, user 0m0.084s, sys 0m1.488s
10_3w 4 | 440.1 | 352.4 4.7 136.0 2200.3 139.3 | 87.6 0.1 0.3 192.6 4.4 real 2m0.320s, user 0m0.076s, sys 0m1.420s
----------+--------+-----------------------------------+-----------------------------------
10_md FAR | 454.6 | 364.0 4.6 126.8 1151.3 135.0 | 90.5 0.2 19.8 542.6 47.7 real 2m0.426s, user 0m0.092s, sys 0m1.660s
10_md OFF | 443.5 | 355.3 3.0 130.7 1298.2 136.3 | 88.2 0.2 17.2 522.8 45.7 real 2m0.340s, user 0m0.100s, sys 0m1.676s
10_md NEAR| 442.1 | 354.1 0.5 130.5 1299.5 137.5 | 87.9 0.2 19.7 617.3 50.0 real 2m0.309s, user 0m0.132s, sys 0m1.548s
3ware efficiently uses its write cache, reducing the apparent latency. It
doesn't boost the effective performance (IOPS), which is device-dependent.
Sequential:
| Type | Read throughput; CPU usage | Write throughput; CPU usage |
| 3ware RAID1, mounted root fs on 2 Western Digital) | (read-ahead 16384): 84 MB/S; real 2m2.687s; user 0m0.060s; sys 0m13.597s | 63 MB/s at fs-level; real 2m48.616s, user 0m0.056s, sys 0m17.061s |
| 3ware RAID5 INIT | During INIT: 223 MB/s; real 0m50.939s, user 0m0.068s, sys 0m17.369s | During INIT: 264 MB/s; real 0m40.015s, user 0m0.020s, sys 0m19.969s |
| 3ware RAID5 | 254 MB/s; real 0m40.774s, user 0m0.004s, sys 0m18.633s | 268 MB/s; real 0m42.785s, user 0m0.020s, sys 0m17.193s |
| md RAID5 | 84 MB/s; real 2m2.019s, user 0m0.040s, sys 0m13.261s | 129 MB/s; real 1m19.632s, user 0m0.020s, sys 0m27.430s |
| 3ware RAID10 | 173 MB/s; real 0m59.907s, user 0m0.008s, sys 0m19.565s | 188 MB/s; real 0m59.907s, user 0m0.008s, sys 0m19.565s (242.8 MB/s; real 19m37.522s, user 0m0.484s, sys 10m20.635s on XFS) |
| md RAID10 FAR | 305 MB/s; real 0m30.548s, user 0m0.048s, sys 0m16.469s (read-ahead 0: 22 MB/s; real 7m45.582s, user 0m0.048s, sys 0m57.764s) | 118 MB/s; real 1m26.857s, user 0m0.012s, sys 0m21.689s |
| md RAID10 OFFSET | 185 MB/s; real 0m59.742s, user 0m0.056s, sys 0m20.197s | 156 MB/s; real 1m5.735s, user 0m0.024s, sys 0m22.585s |
| md RAID10 NEAR | 189 MB/s; real 0m59.046s, user 0m0.036s, sys 0m20.461s | 156 MB/s; real 1m6.124s, user 0m0.012s, sys 0m22.513s |
nr_request=128
Test | total | read: latency (ms) | write: latency (ms)
| iops | iops min avg max sdev | iops min avg max sdev
---------+-------+-----------------------------------+---------------------------------
CFQ | 155.8 | 124.5 6.5 348.5 1790.3 283.3 | 31.3 0.2 144.0 1490.1 278.5
noop | 155.0 | 123.8 8.0 350.0 1519.1 260.8 | 31.2 0.2 137.5 1583.8 308.4
deadline | 152.7 | 122.0 5.6 346.1 1594.8 278.8 | 30.7 0.2 180.1 1686.5 314.9
nr_request=4
Test | total | read: latency (ms) | write: latency (ms)
| iops | iops min avg max sdev | iops min avg max sdev
-----+-------+-----------------------------------+---------------------------------
noop | 128.6 | 102.7 4.5 222.5 1262.2 201.9 | 25.9 0.2 962.6 2205.7 489.4
CFQ | 127.6 | 102.0 5.9 316.6 1381.5 279.6 | 25.6 0.2 590.8 1808.3 490.4
nr_request=1024
Test | total | read: latency (ms) | write: latency (ms)
| iops | iops min avg max sdev | iops min avg max sdev
-----+-------+-----------------------------------+---------------------------------
noop | 154.7 | 123.5 6.9 343.7 1524.4 281.2 | 31.1 0.2 170.8 1901.4 330.5
CFQ | 154.4 | 123.3 6.3 363.1 1705.7 289.2 | 31.1 0.2 100.9 1217.8 220.4
nr_request=1024,queue_depth=1
Test | total | read: latency (ms) | write: latency (ms)
| iops | iops min avg max sdev | iops min avg max sdev
-----+-------+-----------------------------------+---------------------------------
noop | 151.9 | 121.4 4.8 339.2 1585.6 298.3 | 30.5 0.2 208.2 1869.3 363.1
$ mdadm -D /dev/md1
/dev/md1:
Version : 01.02.03
Creation Time : Wed Feb 20 01:17:06 2008
Raid Level : raid10
Array Size : 1464811776 (1396.95 GiB 1499.97 GB)
Device Size : 976541184 (465.65 GiB 499.99 GB)
Raid Devices : 6
Total Devices : 6
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Wed Feb 20 18:48:04 2008
State : clean
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : near=1, far=2
Chunk Size : 64K
Name : diderot:500GB (local to host diderot)
UUID : 194106df:617845ca:ccffda8c:8e2af56a
Events : 2
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 96 1 active sync /dev/sdg
2 8 112 2 active sync /dev/sdh
3 8 128 3 active sync /dev/sdi
4 8 144 4 active sync /dev/sdj
5 8 160 5 active sync /dev/sdk
Note that the filesystem code is smart:
Let's find the adequate read-ahead by running some tests, just after a reboot (default parameters).
Read throughput from the first blocks:
# sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -onull if=/dev/md1 bs=1g count=1 -t sdd: Read 1 records + 0 bytes (total of 1073741824 bytes = 1048576.00k). sdd: Total time 3.468sec (302270 kBytes/sec) real 0m4.276s user 0m0.000s sys 0m2.676sLast volume blocks:
# sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -onull if=/dev/md1 bs=1g iseek=1300g count=1 -t sdd: Read 1 records + 0 bytes (total of 1073741824 bytes = 1048576.00k). sdd: Total time 3.489sec (300537 kBytes/sec) real 0m4.076s user 0m0.000s sys 0m2.680sSequential read of the block device is 3 times slower on the raw underlying device when a filesystem created on it is mounted. Maybe because such operation needs updates between the fs and the block device(?TODO: test with the raw in read-only mode). That's not a problem because I will only use it at the fs level:
# mkfs.xfs -l version=2 -i attr=2 /dev/md1 # mount -t xfs -onoatime,logbufs=8,logbsize=256k /dev/md1 /mnt/md1 # sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -onull if=/dev/md1 bs=1g count=1 -t sdd: Read 1 records + 0 bytes (total of 1073741824 bytes = 1048576.00k). sdd: Total time 11.109sec (94389 kBytes/sec) real 0m11.359s user 0m0.000s sys 0m7.376sXFS pumps faster than a direct access to the raw device, maybe thanks to its multithreaded code.
# sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -onull if=testfile bs=1g count=24 -t sdd: Read 24 records + 0 bytes (total of 25769803776 bytes = 25165824.00k). sdd: Total time 73.936sec (340368 kBytes/sec) real 1m14.036s user 0m0.000s sys 0m31.602sReading at the very end of the filesystem (which is AFAIK in fact in the middle of the platters, with 'md' far(?)) is slower but still OK
# sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -onull if=iozone.tmp bs=1g count=1 -t sdd: Read 1 records + 0 bytes (total of 1073741824 bytes = 1048576.00k). sdd: Total time 3.901sec (268727 kBytes/sec) real 0m4.000s user 0m0.000s sys 0m1.648sRead-ahead is, by default, (on those 6 spindles) 256 for each underlying device and the cumulated value (1536) for /dev/md1 IOPS are pretty good:
# df -h . Filesystem Size Used Avail Use% Mounted on /dev/md1 1.4T 1.3T 145G 90% /mnt/md1 # ls -lh testfile -rw-r--r-- 1 root root 1,2T 2008-02-24 18:34 testfile # sync ; echo 3 > /proc/sys/vm/drop_caches ; time randomio testfile 48 0.2 0.1 65536 120 1 total | read: latency (ms) | write: latency (ms) iops | iops min avg max sdev | iops min avg max sdev --------+-----------------------------------+---------------------------------- 261.5 | 209.3 6.1 218.7 1458.5 169.0 | 52.2 0.2 41.2 843.9 86.0 real 2m0.839s user 0m0.072s sys 0m1.540sOn one of very first files stored:
# ls -lh iozone.DUMMY.4
-rw-r----- 1 root root 8,0G 2008-02-23 19:58 iozone.DUMMY.4
# sync ; echo 3 > /proc/sys/vm/drop_caches ; time randomio iozone.DUMMY.4 48 0.2 0.1 65536 120 1
total | read: latency (ms) | write: latency (ms)
iops | iops min avg max sdev | iops min avg max sdev
---------+-------+-----------------------------------+----------------------------------
deadline | 312.2 | 249.8 3.7 184.1 1058.4 139.4 | 62.4 0.2 31.8 498.0 62.0
noop | 311.5 | 249.2 3.2 183.8 987.5 139.8 | 62.3 0.2 34.4 509.0 67.0
CFQ | 310.2 | 248.2 5.3 185.0 1113.5 144.6 | 62.0 0.2 32.2 879.9 71.0
Without NCQ ('qpolicy=off' for all underlying devices)
(those results are consistent with the raw-level tests):
noop | 280.5 | 224.5 6.5 200.8 971.5 126.1 | 56.1 0.2 51.5 1138.8 103.9
deadline | 278.0 | 222.4 3.0 205.2 1002.0 136.8 | 55.6 0.2 41.4 833.8 83.7
CFQ | 273.2 | 218.5 2.4 208.0 1194.9 144.5 | 54.7 0.2 45.1 845.3 93.2
Let's use CFQ, which enables us to 'nice' I/O and doesn't cost much.
Three processes will run, each in a different 'ionice' class, using
different files. Here is the model:
total | read: latency (ms) | write: latency (ms)
iops | iops min avg max sdev | iops min avg max sdev
---------+-----------------------------------+----------------------------------
1 128.6 | 102.8 11.7 440.1 1530.2 256.3 | 25.7 0.2 103.4 1134.2 156.9 real 3m0.669s, user 0m0.076s, sys 0m0.944s
2 127.6 | 102.0 9.3 443.7 1684.2 258.7 | 25.6 0.2 103.9 1102.8 155.0 real 3m0.585s, user 0m0.076s, sys 0m0.836s
3 31.7 | 25.2 56.9 1880.9 8166.9 1052.1 | 6.5 0.3 86.9 909.8 135.3 real 3m0.190s, user 0m0.008s, sys 0m0.232s
Total: 287.9
CFQ enabled:
total | read: latency (ms) | write: latency (ms)
iops | iops min avg max sdev | iops min avg max sdev
---------+-----------------------------------+----------------------------------
1 141.4 | 113.1 7.7 401.3 1350.0 240.9 | 28.3 0.2 89.1 919.1 127.3 real 3m1.529s, user 0m0.060s, sys 0m0.952s
2 138.3 | 110.6 11.8 410.2 1416.7 243.4 | 27.7 0.2 92.9 986.0 133.5 real 3m0.417s, user 0m0.040s, sys 0m0.924s
3 35.1 | 27.8 20.7 1702.1 7588.5 948.1 | 7.3 0.3 85.8 920.1 120.3 real 3m0.238s, user 0m0.004s, sys 0m0.192s
Total: 314.8
CFQ is efficient on those Hitachi drives, I enable it.
Ionices 1 and 2 simultaneously active:
total | read: latency (ms) | write: latency (ms)
iops | iops min avg max sdev | iops min avg max sdev
---------+-----------------------------------+----------------------------------
1 160.8 | 128.7 6.8 361.7 1502.9 241.1 | 32.0 0.2 42.9 1028.2 109.3 real 3m1.143s, user 0m0.088s, sys 0m1.340s
2 160.4 | 128.4 6.1 362.6 1357.5 239.3 | 32.0 0.2 42.0 1001.3 102.2 real 3m0.749s, user 0m0.080s, sys 0m1.352s
Total: 321.2
1 and 3:
1 160.4 | 128.4 6.1 362.6 1357.5 239.3 | 32.0 0.2 42.0 1001.3 102.2 real 3m0.749s, user 0m0.080s, sys 0m1.352s
3 52.4 | 41.9 29.9 1128.5 5246.7 617.6 | 10.5 0.3 67.7 935.0 82.2 real 3m1.935s, user 0m0.032s, sys 0m0.400s
Total: 212.8
2 and 3:
2 256.2 | 205.1 7.3 213.9 1089.3 138.4 | 51.1 0.2 79.9 626.9 83.0 real 3m0.540s, user 0m0.076s, sys 0m2.136s
3 50.1 | 40.0 19.7 1176.0 5236.8 647.5 | 10.1 0.3 67.4 575.9 73.2 real 3m1.120s, user 0m0.048s, sys 0m0.292s
Total: 306.3
1 and 2, with this last one at -n7:
1 156.9 | 125.5 6.6 367.4 1479.0 246.3 | 31.3 0.2 58.1 1291.8 147.9 real 3m0.489s, user 0m0.032s, sys 0m1.356s
2 158.3 | 126.6 6.3 365.0 1422.9 247.7 | 31.7 0.2 54.2 1334.7 143.8 real 3m0.565s, user 0m0.064s, sys 0m1.324s
Total: 315.2
1 and 2, with this last one at -n0:
1 157.5 | 126.0 5.7 365.6 1681.5 239.3 | 31.5 0.2 58.8 1727.4 143.9 real 3m0.748s, user 0m0.080s, sys 0m1.412s
2 160.6 | 128.6 6.8 359.9 1440.6 237.7 | 32.0 0.2 51.3 1067.3 128.3 real 3m0.748s, user 0m0.080s, sys 0m1.412s
Total: 318.1
2 (best effort) -n0 (high priority) and 2 -n7:
-n0 161.4 | 129.2 6.4 358.9 1474.1 233.8 | 32.2 0.2 49.7 1101.9 120.7 real 3m1.348s, user 0m0.100s, sys 0m1.260s
-n7 159.6 | 127.7 6.2 363.0 1477.4 233.7 | 31.9 0.2 49.9 1150.0 116.9 real 3m1.867s, user 0m0.068s, sys 0m1.380s
Total: 321
Sequential read and random I/O, done simultaneously and on distant files:
# time ionice -c2 -n7 sdd -onull if=testfile bs=1g iseek=1t count=18 -t along with # time ionice -c2 -n0 randomio iozone.DUMMY.2 48 0.2 0.1 65536 180 1 sdd 86.8 MB/s; real 3m37.902s, user 0m0.000s, sys 0m22.353s randomio 231.2 | 184.9 5.8 237.9 1270.8 186.5 | 46.3 0.2 85.2 956.0 104.0 real 3m0.560s, user 0m0.128s, sys 0m2.060s Reciprocally (-n7 for randomio and -n0 for sdd): sdd 95.9 MB/s; real 3m16.912s, user 0m0.000s, sys 0m22.121s randomio 197.3 | 157.9 4.4 265.2 1412.6 210.4 | 39.4 0.2 154.5 1178.7 159.5 real 3m0.800s, user 0m0.076s, sys 0m1.728s sdd under nice, randomio in default mode sdd 88.5 MB/s; real 3m33.447s, user 0m0.000s, sys 0m22.441s randomio 222.6 | 178.1 4.2 243.9 1397.3 188.1 | 44.5 0.2 101.2 1239.2 122.1 real 3m0.277s, user 0m0.088s, sys 0m2.196s sdd under nice, randomio in default mode (deadline) sdd 89.7 MB/s; real 3m31.237s, user 0m0.000s, sys 0m21.613s randomio 220.7 | 176.6 5.7 263.8 1485.4 200.7 | 44.1 0.2 30.9 952.7 107.5 real 3m2.273s, user 0m0.132s, sys 0m3.176sThe CFQ 'realtime' class seems useless, maybe because the underlying XFS has no 'realtime section'. I did not declared one because the underlying and device has no such stuff (a dedicated track associated to a fixed head? A solid state disk?).
The CFQ 'class data' ('-n' argument) seems nearly useless.
mdadm --create /dev/md0 --auto=md --metadata=1.2 --level=raid10 --chunk=64 --raid-devices=10 --spare-devices=0 --layout=f2 --assume-clean /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sd[bghijk]
# mdadm -D /dev/md0
/dev/md0:
Version : 01.02.03
Creation Time : Tue Feb 26 14:26:21 2008
Raid Level : raid10
Array Size : 2441352960 (2328.26 GiB 2499.95 GB)
Device Size : 976541184 (465.65 GiB 499.99 GB)
Raid Devices : 10
Total Devices : 10
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Tue Feb 26 14:26:21 2008
State : clean
Active Devices : 10
Working Devices : 10
Failed Devices : 0
Spare Devices : 0
Layout : near=1, far=2
Chunk Size : 64K
Name : diderot:0 (local to host diderot)
UUID : 51c89f70:70af0287:42c3f2cf:13cd20fd
Events : 0
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 49 1 active sync /dev/sdd1
2 8 65 2 active sync /dev/sde1
3 8 81 3 active sync /dev/sdf1
4 8 16 4 active sync /dev/sdb
5 8 96 5 active sync /dev/sdg
6 8 112 6 active sync /dev/sdh
7 8 128 7 active sync /dev/sdi
8 8 144 8 active sync /dev/sdj
9 8 160 9 active sync /dev/sdk
# time randomio /dev/DeviceName 48 0.2 0.1 65536 180 1
total | read: latency (ms) | write: latency (ms)
iops | iops min avg max sdev | iops min avg max sdev
------------+-----------------------------------+----------------------------------
793.8 | 634.6 1.8 73.6 2710.1 114.0 | 159.2 0.1 8.0 673.3 29.9
3w10 726.0 | 580.3 0.5 82.2 1733.3 104.6 | 145.7 0.1 2.0 586.0 23.8; real 3m0.355s, user 0m0.272s, sys 0m3.944s
3w6 327.5 | 262.0 3.2 131.7 1967.8 145.9 | 65.5 0.1 204.5 1911.6 423.3; real 3m0.365s, user 0m0.144s, sys 0m1.860s
However the throughput on sequential read was mediocre: 342.7 MB/s; real
1m0.674s, user 0m0.000s, sys 0m40.571s (offset2 gave 278 MB/s, near2 206
MB/s)
3ware's RAID10 was 302.3 MB/s (real 1m8.464s, user 0m0.000s, sys 0m31.722s) and RAID6 (during INITIALIZATION!) was 121.4 MB/s (real 2m41.076s, user 0m0.000s, sys 0m31.786s)
Legend (from 'man iostat'):
rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util 0.00 14.85 102.97 82.18 2.29 2.18 49.41 1.67 9.01 4.92 91.09 1.98 86.14 58.42 84.16 1.14 1.80 42.28 3.00 18.86 6.75 96.24 0.00 25.00 84.00 91.00 1.83 2.24 47.59 1.38 9.71 5.33 93.20 0.00 66.34 69.31 157.43 1.64 6.17 70.53 1.78 7.81 3.97 89.90 0.00 62.38 91.09 101.98 2.13 2.47 48.82 1.71 8.70 4.88 94.26 0.00 4.00 25.00 147.00 0.47 3.27 44.56 2.55 15.02 5.86 100.80 5.94 44.55 103.96 77.23 2.41 2.11 51.15 1.52 8.31 4.98 90.30 0.00 75.25 59.41 69.31 1.21 1.70 46.28 3.75 25.42 7.35 94.65 0.00 0.00 49.50 100.00 0.96 2.13 42.33 1.72 14.81 6.23 93.07 0.00 50.51 61.62 98.99 1.45 2.15 45.89 1.96 12.28 5.96 95.76 1.98 44.55 85.15 86.14 2.00 2.02 48.00 1.74 9.99 5.27 90.30 0.00 66.34 51.49 123.76 1.36 2.44 44.47 2.12 12.27 5.06 88.71 5.88 7.84 53.92 81.37 1.15 2.28 51.88 1.16 8.61 6.20 83.92 0.00 58.00 73.00 81.00 1.56 1.63 42.49 2.38 15.30 5.97 92.00 5.94 4.95 13.86 111.88 0.36 2.17 41.07 1.89 15.12 7.87 99.01 10.00 59.00 82.00 73.00 1.55 2.16 48.98 1.31 8.49 5.32 82.40 0.00 62.00 82.00 126.00 1.94 3.05 49.12 2.14 10.37 4.44 92.40 0.00 36.63 81.19 81.19 1.86 1.86 46.93 2.12 12.80 5.78 93.86 9.90 64.36 83.17 100.00 1.75 2.88 51.72 1.42 7.91 4.80 87.92 0.00 59.60 79.80 70.71 1.63 1.85 47.30 1.29 7.89 6.07 91.31 1.96 1.96 89.22 69.61 1.98 2.05 51.90 1.49 9.88 5.63 89.41 0.00 72.00 12.00 129.00 0.36 2.27 38.18 2.93 20.88 7.09 100.00 10.00 55.00 76.00 68.00 1.45 1.82 46.50 1.60 10.75 6.11 88.00 0.00 4.00 38.00 109.00 0.92 1.89 39.13 2.64 18.39 6.59 96.80The '%util' (last) column content seems pretty abnormal to me, given the modest workload.
As a sidenote: MySQL (which, under the 'CFQ' scheduler, is invoked here with 'ionice -c1 -n7') grinds this (in fact this 'optimize' is an 'ALTER TABLE', therefore reading, somewhat reorg, then writing) with less than 2% of a single CPU...). 'show innodb status' reveals:
FILE I/O 87.91 reads/s, 24576 avg bytes/read, 93.91 writes/s BUFFER POOL AND MEMORY 133.87 reads/s, 8.99 creates/s, 164.84 writes/s Buffer pool hit rate 994 / 1000'rrqm/s' with the 'noop' I/O scheduler are similar other schedulers results. How comes? Is it to say that 'noop' merges requests, therefore that it somewhat buffers them? Or that it uses a merge approach inducing no latency, maybe by reserving it to sequential requests saturating the device's queue? Or that 'CFQ' and 'deadline' (even when /sys/block/sda/queue/iosched/read_expire contains 10000, offering up to 10 seconds of latency to group read requests) are not able to merge efficiently, perhaps due to some queue saturation?
During a continuous and stable (homogeneous) usage scenario (I used the loooong MySQL 'optimize' run, even if it is not representative as it does much more writes than read) launch as root:
#!/bin/sh
#name of the monitored device
DEVICENAME=/dev/sda
#pause duration, after establishing the IO scheduler
SLP=10
while (true) ; do
COLLECT_TIME=$(($RANDOM%3600 + 30))
case $(($RANDOM%3)) in
0)
echo cfq > /sys/block/sda/queue/scheduler
sleep $SLP
iostat -mx $DEVICENAME 1 $COLLECT_TIME |grep sda|cut -c10- >> cfq_sched.results
;;
1)
echo deadline > /sys/block/sda/queue/scheduler
sleep $SLP
iostat -mx 1 $COLLECT_TIME |grep sda|cut -c10- >> deadline_sched.results
;;
2)
echo noop > /sys/block/sda/queue/scheduler
sleep $SLP
iostat -mx 1 $COLLECT_TIME |grep sda|cut -c10- >> noop_sched.results
;;
esac
done
Suppress some lines in some series (by using 'head -#OfLines FileName >
NewFileName'), for all of them to contain the same number of samples
(check: 'wc -l wc -l *_sched.results').
Create a MySQL database and table for the time series:
create database ioperf;
use ioperf;
create table iostat ( scheduler ENUM('cfq', 'noop', 'deadline','anticipatory') NOT NULL,
rrqmps float, wrqmps float, rps float, wps float, rmbps float,wmbps float,avgrqsz float,
avgqusz float,await float,svctm float, percentutil float,index id_sche (scheduler));
Inject the data:
$ perl -pe '$_ =~ tr/ /\t/s; print "noop";' < noop_sched.results > iostat.txt $ perl -pe '$_ =~ tr/ /\t/s; print "deadline";' < deadline_sched.results >> iostat.txt $ perl -pe '$_ =~ tr/ /\t/s; print "cfq";' < cfq_sched.results >> iostat.txt $ mysqlimport --local ioperf iostat.txtHere is a way to conclude, using an arithmetic mean (provided by SQL 'avg'). It seems to be sufficient albeit an harmonic mean may be an useful addition. My main concerns are the first columns: rps, rmbps, await...
# check that all time series are of same size select scheduler,count(*) from iostat group by scheduler; +-----------+----------+ | scheduler | count(*) | +-----------+----------+ | cfq | 12000 | | noop | 12000 | | deadline | 12000 | +-----------+----------+ #major performance-related arithmetic means select scheduler,avg(rps),avg(rmbps),avg(await),avg(wps),avg(wmbps),avg(percentutil) from iostat group by scheduler; +-----------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+ | scheduler | avg(rps) | avg(rmbps) | avg(await) | avg(wps) | avg(wmbps) | avg(percentutil) | +-----------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+ | cfq | 85.109504183431 | 1.9109841637902 | 13.676339167714 | 89.733314177394 | 2.2845941611628 | 95.672404343605 | | noop | 86.285021674693 | 1.9277283308546 | 12.093321669618 | 89.79774667867 | 2.2659291594663 | 96.018881834666 | | deadline | 86.597292512933 | 1.931128330753 | 12.50979499948 | 89.036291648348 | 2.2486824922271 | 96.12482605044 | +-----------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+ #(less important) informations about schedulers housekeeping, useful to #spot weird differences select scheduler,avg(svctm),avg(rrqmps),avg(wrqmps),avg(avgrqsz),avg(avgqusz) from iostat group by scheduler; +-----------+-----------------+-----------------+-----------------+-----------------+-----------------+ | scheduler | avg(svctm) | avg(rrqmps) | avg(wrqmps) | avg(avgrqsz) | avg(avgqusz) | +-----------+-----------------+-----------------+-----------------+-----------------+-----------------+ | cfq | 6.1323758338888 | 2.2023866616189 | 34.736365837912 | 48.764866647402 | 2.0277108325611 | | noop | 6.0292158377171 | 2.3308641605874 | 32.569892516136 | 48.432940823873 | 1.897218332231 | | deadline | 6.1189483366609 | 2.2282899945279 | 32.127461672723 | 48.399182485898 | 1.9355808324416 | +-----------+-----------------+-----------------+-----------------+-----------------+-----------------+Right now 'deadline' dominates. (TODO: update those results, refresh them under a more adequate and controlled load (iozone?))
For example partition your databases (MySQL, see also the tech-resources) in order to scatter, accross the fastest spindles, each and every table often-simultaneously random-read.
We badly need HSM.
Modify the parameters of the last set of tests used, which delivered a performance judged adequate, with respect to the new amount of RAM available. Try to match your real-world application. Then re-run the test in order to check that the performance gain. Then select the most adequate set of parameters by using all tests offered by your applications or devised by yourself.
Check SMART status periodically (some daemons just do this, for example 'smartclt' and 'mdadm' in 'monitor' mode) and automagically in order to detect and replace battered drives but keep on mind that this is not a magic bullet: it may only warn about very few potential failures.
# http://www.3ware.com/kb/article.aspx?id=11050k echo 64 > /sys/block/sda/queue/max_sectors_kb #Objective: saturating the HD's integrated cache by reading ahead #during the period used by the kernel to prepare I/O. #It may put in cache data which will be requested by the next read. #Read-ahead'ing too much may kill random I/O on huge files #if it uses potentially useful drive time or loads data beyond caches. #As soon as the caches are saturated the hit ratio is proportional to #(data volume/cache size) therefore beware if the data volume is #much bigger than cumulated useful (not overlaping) caches sizes /sbin/blockdev --setra 4096 /dev/sda #Requests in controller's queue. 3ware max is 254(?) echo 254 > /sys/block/sda/device/queue_depth #requests in OS queue #May slow operation down, if too low of high #Is it per disk? Per controller? Grand total for all controllers? #In which way does is interact with queue_depth? #Beware of CPU usage (iowait) if too high #Miquel van Smoorenburg wrote: CFQ seems to like larger nr_requests, #so if you use it, try 254 (maximum hardware size) for queue_depth #and 512 or 1024 for nr_requests. echo 1024 > /sys/block/sda/queue/nr_requests #Theoritically 'noop' is better with a smart RAID controller because #Linux knows nothing about (physical) disks geometry, therefore it #can be efficient to let the controller, well aware of disk geometry #and servos locations, handle the requests as soon as possible. #But 'deadline' and 'CFQ' seem to enhance performance #even during random I/O. Go figure. echo cfq > /sys/block/sda/queue/scheduler #iff deadline #echo 256 > /sys/block/sda/queue/iosched/fifo_batch #echo 1 > /sys/block/sda/queue/iosched/front_merges #echo 400 > /sys/block/sda/queue/iosched/read_expire #echo 3000 > /sys/block/sda/queue/iosched/write_expire #echo 2 > /sys/block/sda/queue/iosched/writes_starved #avoid swapping, better recycle buffercache memory echo 10 > /proc/sys/vm/swappiness #64k (sector size) per I/O operation in the swap echo 16 > /proc/sys/vm/page-cluster #See also http://hep.kbfi.ee/index.php/IT/KernelTuning #echo 500 > /proc/sys/vm/dirty_expire_centisecs echo 20 > /proc/sys/vm/dirty_background_ratio echo 60 > /proc/sys/vm/dirty_ratio #echo 1 > /proc/sys/vm/vfs_cache_pressure #export IRQBALANCE_BANNED_INTERRUPTS=24 #/etc/init.d/irqbalance stop #/usr/bin/killall irqbalance #/etc/init.d/irqbalance start'/etc/sysctl.conf' contains (after Debian-established lines):
kernel.shmmax = 4294967295 kernel.shmall = 268435456 net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.rmem_default=262143 net.core.wmem_default=262143 net.ipv4.tcp_rmem=8192 1048576 16777216 net.ipv4.tcp_wmem=4096 1048576 16777216 vm.min_free_kbytes=65536 net.ipv4.tcp_max_syn_backlog=4096
Before running the tests launch a dedicated terminal and invoke 'iostat -mx 5' to track performances variations, disk utilization percentage and iowait, and read them alternatively during the tests.
Measure 'random access' performance thanks to randomio:
sync echo 3 > /proc/sys/vm/drop_caches randomio WHERE THREADS WR_PROPORTION 0.1 AVG_RND_BLOCK_SIZE 60 1 ; doneFor sequential read performance, use sdd:
sync echo 3 > /proc/sys/vm/drop_caches time sdd -onull if=WHERE bs=1m count=10000 -tFor sequential write performance:
sync echo 3 > /proc/sys/vm/drop_caches time sdd -inull of=WHERE bs=1m count=10000 -tReplacing:
When using actors (application, operating system and controller) unable to parallelize accesses (beware: a database server used by only one request at a time falls in this category) one may buy less diskspace from faster devices (rotation speed: 15k and up, less 'rotational' latency) with better average access time (probably smaller form factor, reducing mechanical latency).
If parallelization is effective, as with most contemporary server usages and softwares, buying spindles for a 'md' RAID10 is the way to go.
RAID5 is good at sequential access while preserving the ratio (usable space/total disk space). However it is (especially the 3ware implementation!) awful at random access. 'md' RAID5 is good at random... for a RAID5 (in a word: mediocre)
Its CPU cost is neglectable on such a machine: at most ~28% over 3ware (the most stressing test (sequential read on a RAID10 made of 10 drives, 18 GB read in ~1 minute) costed 40.571s 'system' load ('md') and 31.722s (3ware)). TODO: test bus contention, by simultaneously running the disk test and some memory-intensive benchmark.
Caveats:
Beware! There is some controversy (see also here) about this, especially when nearly all data are in the cache (nearly no disk read) while the database writes nearly all the time.
See Sander Marechal's benchmark.