blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

由于在iostat输出中，只能看到service time + wait time，因为对于评估一个磁盘或者云磁盘而言，service time才是衡量磁盘性能的核心指标和直接指标。很不幸iostat无法提供这个指标，但是blktrace可以。

blktrace是一柄神器，很多工具都是基于该神器的：ioprof，seekwatcher，iowatcher，这个工具基本可以满足我们的对块设备请求的所有了解。

blktrace的原理

一个I/O请求，从应用层到底层块设备，路径如下图所示：

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

从上图可以看出IO路径是很复杂的。这么复杂的IO路径我们是无法用短短一篇小博文介绍清楚的。我们将IO路径简化一下：

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

一个I/O请求进入block layer之后，可能会经历下面的过程：

Remap: 可能被DM(Device Mapper)或MD(Multiple Device, Software RAID) remap到其它设备
Split: 可能会因为I/O请求与扇区边界未对齐、或者size太大而被分拆(split)成多个物理I/O
Merge: 可能会因为与其它I/O请求的物理位置相邻而合并(merge)成一个I/O
被IO Scheduler依照调度策略发送给driver
被driver提交给硬件，经过HBA、电缆（光纤、网线等）、交换机（SAN或网络）、最后到达存储设备，设备完成IO请求之后再把结果发回。

blktrace 能够记录下IO所经历的各个步骤:

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

我们一起看下blktrace的输出长什么样子：

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

第一个字段：8,0 这个字段是设备号 major device ID和minor device ID。
第二个字段：3 表示CPU
第三个字段：11 序列号
第四个字段：0.009507758 Time Stamp是时间偏移
第五个字段：PID 本次IO对应的进程ID
第六个字段：Event，这个字段非常重要，反映了IO进行到了那一步
第七个字段：R表示 Read， W是Write，D表示block，B表示Barrier Operation
第八个字段：223490+56，表示的是起始block number 和 number of blocks，即我们常说的Offset 和 Size
第九个字段：进程名

其中第六个字段非常有用：每一个字母都代表了IO请求所经历的某个阶段。

Q – 即将生成IO请求
|
G – IO请求生成
|
I – IO请求进入IO Scheduler队列
|
D – IO请求进入driver
|
C – IO请求执行完毕

根据以上步骤对应的时间戳就可以计算出I/O请求在每个阶段所消耗的时间：

Q2G – 生成IO请求所消耗的时间，包括remap和split的时间；
G2I – IO请求进入IO Scheduler所消耗的时间，包括merge的时间；
I2D – IO请求在IO Scheduler中等待的时间；
D2C – IO请求在driver和硬件上所消耗的时间；
Q2C – 整个IO请求所消耗的时间(Q2I + I2D + D2C = Q2C)，相当于iostat的await。

如果I/O性能慢的话，以上指标有助于进一步定位缓慢发生的地方：
D2C可以作为硬件性能的指标；
I2D可以作为IO Scheduler性能的指标。

一个IO的生命周期大约是：

● I/O enters block layer – it can be:

– Remapped onto another device (MD, DM)

– Split into 2 separate I/Os (alignment, size, ...)

– Added to the request queue

– Merged with a previous entry on the queue All I/Os end up on a request queue at some point

● At some later time, the I/O is issued to a device driver, and submitted to a device

● Later, the I/O is completed by the device, and its driver

Q2Q — time between requests sent to the block layer
Q2G — time from a block I/O is queued to the time it gets a request allocated for it
G2I — time from a request is allocated to the time it is Inserted into the device's queue
Q2M — time from a block I/O is queued to the time it gets merged with an existing request
I2D — time from a request is inserted into the device's queue to the time it is actually issued to the device
M2D — time from a block I/O is merged with an exiting request until the request is issued to the device
D2C — service time of the request by the device
Q2C — total time spent in the block layer for a request

注意，整个IO路径，分成很多段，每一段开始的时候，都会有一个时间戳，根据上一段开始的时间和下一段开始的时间，就可以得到IO 路径各段花费的时间。

注意，我们心心念念的service time，也就是反应块设备处理能力的指标，就是从D到C所花费的时间，简称D2C。

而iostat输出中的await，即整个IO从生成请求到IO请求执行完毕，即从Q到C所花费的时间，我们简称Q2C。

我们知道Linux 有I/O scheduler，调度器的效率如何，I2D是重要的指标。

blktrace、blkparse和btt

我们接下来简单介绍这些工具的使用，其中这三个命令都是属于blktrace这个包的，他们是一家人。

使用blktrace需要挂载debugfs：

$ mount -t debugfs debugfs /sys/kernel/debug

首先通过如下命令，可以查看磁盘上的实时信息：

blktrace -d /dev/sdb -o – | blkparse -i –

这个命令会连绵不绝地出现很多输出，当你输入ctrl＋C的时候，会停止。

当然了，你也可以先用如下命令采集信息，待所有信息采集完毕后，统一分析所有采集到的数据。搜集信息的命令如下：

blktrace -d /dev/sdb

注意，这个命令并不是只输出一个文件，他会根据CPU的个数上，每一个CPU都会输出一个文件，你也可以用-o参数指定自己的输出文件名。如下所示：

-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.0
-rw-r--r-- 1 manu manu  823K Jul  6 19:58 sdb.blktrace.1
-rw-r--r-- 1 manu manu  2.8M Jul  6 19:58 sdb.blktrace.10
-rw-r--r-- 1 manu manu  1.9M Jul  6 19:58 sdb.blktrace.11
-rw-r--r-- 1 manu manu  474K Jul  6 19:58 sdb.blktrace.12
-rw-r--r-- 1 manu manu  271K Jul  6 19:58 sdb.blktrace.13
-rw-r--r-- 1 manu manu  578K Jul  6 19:58 sdb.blktrace.14
-rw-r--r-- 1 manu manu  375K Jul  6 19:58 sdb.blktrace.15
-rw-r--r-- 1 manu manu  382K Jul  6 19:58 sdb.blktrace.16
-rw-r--r-- 1 manu manu  478K Jul  6 19:58 sdb.blktrace.17
-rw-r--r-- 1 manu manu  839K Jul  6 19:58 sdb.blktrace.18
-rw-r--r-- 1 manu manu  848K Jul  6 19:58 sdb.blktrace.19
-rw-r--r-- 1 manu manu  1.6M Jul  6 19:58 sdb.blktrace.2
-rw-r--r-- 1 manu manu  652K Jul  6 19:58 sdb.blktrace.20
-rw-r--r-- 1 manu manu  738K Jul  6 19:58 sdb.blktrace.21
-rw-r--r-- 1 manu manu  594K Jul  6 19:58 sdb.blktrace.22
-rw-r--r-- 1 manu manu  527K Jul  6 19:58 sdb.blktrace.23
-rw-r--r-- 1 manu manu 1005K Jul  6 19:58 sdb.blktrace.3
-rw-r--r-- 1 manu manu  1.2M Jul  6 19:58 sdb.blktrace.4
-rw-r--r-- 1 manu manu  511K Jul  6 19:58 sdb.blktrace.5
-rw-r--r-- 1 manu manu  2.3M Jul  6 19:58 sdb.blktrace.6
-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.7
-rw-r--r-- 1 manu manu  2.1M Jul  6 19:58 sdb.blktrace.8
-rw-r--r-- 1 manu manu  1.1M Jul  6 19:58 sdb.blktrace.9

有了输出，我们可以通过blkparse -i sdb来分析采集的数据：

$ blktrace -d /dev/sdb
$ blkparse -i sdb
  # 第一个IO开始：
  8,16   1        1     0.000000000 18166  Q   R 0 + 1 [dd]
  8,16   1        0     0.000009827     0  m   N cfq18166S  /user.slice alloced
  8,16   1        2     0.000010451 18166  G   R 0 + 1 [dd]
  8,16   1        3     0.000011056 18166  P   N [dd]
  8,16   1        4     0.000012255 18166  I   R 0 + 1 [dd]
  8,16   1        0     0.000013477     0  m   N cfq18166S  /user.slice insert_request
  8,16   1        0     0.000014526     0  m   N cfq18166S  /user.slice add_to_rr
  8,16   1        5     0.000016643 18166  U   N [dd] 1
  8,16   1        0     0.000017522     0  m   N cfq workload slice:300
  8,16   1        0     0.000018880     0  m   N cfq18166S  /user.slice set_active wl_class:0 wl_type:2
  8,16   1        0     0.000020594     0  m   N cfq18166S  /user.slice fifo=          (null)
  8,16   1        0     0.000021462     0  m   N cfq18166S  /user.slice dispatch_insert
  8,16   1        0     0.000022898     0  m   N cfq18166S  /user.slice dispatched a request
  8,16   1        0     0.000023786     0  m   N cfq18166S  /user.slice activate rq, drv=1
  8,16   1        6     0.000023977 18166  D   R 0 + 1 [dd]
  8,16   0        1     0.014270153     0  C   R 0 + 1 [0]
  # 第一个IO结束。
  8,16   0        0     0.014278115     0  m   N cfq18166S  /user.slice complete rqnoidle 0
  8,16   0        0     0.014280044     0  m   N cfq18166S  /user.slice set_slice=100
  8,16   0        0     0.014282217     0  m   N cfq18166S  /user.slice arm_idle: 8 group_idle: 0
  8,16   0        0     0.014282728     0  m   N cfq schedule dispatch
  # 第二个IO开始：
  8,16   1        7     0.014298247 18166  Q   R 1 + 1 [dd]
  8,16   1        8     0.014300522 18166  G   R 1 + 1 [dd]
  8,16   1        9     0.014300984 18166  P   N [dd]
  8,16   1       10     0.014301996 18166  I   R 1 + 1 [dd]
  8,16   1        0     0.014303864     0  m   N cfq18166S  /user.slice insert_request
  8,16   1       11     0.014304981 18166  U   N [dd] 1
  8,16   1        0     0.014306368     0  m   N cfq18166S  /user.slice dispatch_insert
  8,16   1        0     0.014307793     0  m   N cfq18166S  /user.slice dispatched a request
  8,16   1        0     0.014308763     0  m   N cfq18166S  /user.slice activate rq, drv=1
  8,16   1       12     0.014308962 18166  D   R 1 + 1 [dd]
  8,16   0        2     0.014518615     0  C   R 1 + 1 [0]
  # 第二个IO结束。
  8,16   0        0     0.014523548     0  m   N cfq18166S  /user.slice complete rqnoidle 0
  8,16   0        0     0.014525334     0  m   N cfq18166S  /user.slice arm_idle: 8 group_idle: 0
  8,16   0        0     0.014525676     0  m   N cfq schedule dispatch
  # 第三个IO开始：
  8,16   1       13     0.014531022 18166  Q   R 2 + 1 [dd]
  ...

注意，blkparse仅仅是将blktrace输出的信息转化成人可以阅读和理解的输出，但是，信息太多，太杂，人完全没法得到关键信息。这时候btt就横空出世了，这个工具可以将blktrace采集回来的数据，进行分析，得到对人更有用的信息。事实上，btt也是我们的终点。

注：
在以上数据中，有一些记录的event类型是”m”，那是IO Scheduler的调度信息，对研究IO Scheduler问题有意义：

cfq18166S – cfq是IO Scheduler的名称，18166是进程号，”S”表示Sync(同步IO)，如果异步IO则用“A”表示(Async)；
它们的第三列sequence number都是0；

它们表示IO Scheduler内部的关键函数，上例中是cfq，代码参见block/cfq-iosched.c，以下是各关键字所对应的内部函数：

alloced <<< cfq_find_alloc_queue()
insert_request <<< cfq_insert_request()
add_to_rr <<< cfq_add_cfqq_rr()
cfq workload slice:300 <<< choose_wl_class_and_type()
set_active wl_class:0 wl_type:2 <<< __cfq_set_active_queue()
fifo= (null) <<< cfq_check_fifo()
dispatch_insert <<< cfq_dispatch_insert()
dispatched a request <<< cfq_dispatch_requests()
activate rq, drv=1 <<< cfq_activate_request()
complete rqnoidle 0 <<< cfq_completed_request()
set_slice=100 <<< cfq_set_prio_slice()
arm_idle: 8 group_idle: 0 <<< cfq_arm_slice_timer()
cfq schedule dispatch <<< cfq_schedule_dispatch()

利用btt分析blktrace数据

blkparse只是将blktrace数据转成可以人工阅读的格式，由于数据量通常很大，人工分析并不轻松。btt是对blktrace数据进行自动分析的工具。

btt不能分析实时数据，只能对blktrace保存的数据文件进行分析。使用方法：
把原本按CPU分别保存的文件合并成一个，合并后的文件名为sdb.blktrace.bin：
$ blkparse -i sdb -d sdb.blktrace.bin
执行btt对sdb.blktrace.bin进行分析：
$ btt -i sdb.blktrace.bin

下面是一个btt实例：

...
            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
 
Q2Q               0.000138923   0.000154010   0.014298247       94558
Q2G               0.000001154   0.000001661   0.000017313       94559
G2I               0.000000883   0.000001197   0.000012011       94559
I2D               0.000004722   0.000005761   0.000027286       94559
D2C               0.000118840   0.000129201   0.014246176       94558
Q2C               0.000125953   0.000137820   0.014270153       94558
 
==================== Device Overhead ====================
 
       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |   1.2050%   0.8688%   0.0000%   4.1801%  93.7461%
---------- | --------- --------- --------- --------- ---------
   Overall |   1.2050%   0.8688%   0.0000%   4.1801%  93.7461%

我们看到93.7461%的时间消耗在D2C，也就是硬件层，这是正常的，我们说过D2C是衡量硬件性能的指标，这里单个IO平均0.129201毫秒，已经是相当快了，单个IO最慢14.246176 毫秒，不算坏。Q2G和G2I都很小，完全正常。I2D稍微有点大，应该是cfq scheduler的调度策略造成的，你可以试试其它scheduler，比如deadline，比较两者的差异，然后选择最适合你应用特点的那个。

注:

1、一些状态的对应关系

Act Description                                 SetPosition
A  IO was remapped to a different device   blk-core.c/trace_block_remap
B  IO bounced                              bounce.c/trace_block_bio_bounce
C  IO completion                           blk-core.c/trace_block_rq_complete
D  IO issued to driver                     elevator.c/trace_block_rq_issue
F  IO front merged with request on queue   blk-core.c/trace_block_bio_frontmerge
G  Get request                             blk-core.c/trace_block_getrq
I  IO inserted onto request queue          elevator.c/trace_block_rq_insert
M  IO back merged with request on queue    blk-core.c/trace_block_bio_backmerge
P  Plug request                            blk-core.c/trace_block_plug
Q  IO handled by request queue code        blk-core.c/trace_block_bio_queue
S  Sleep request                           blk-core.c/trace_block_sleeprq
T  Unplug due to timeout                   blk-core.c/trace_block_unplug_timer
U  Unplug request                          blk-core.c/trace_block_unplug_io
X  Split                                   bio.c/trace_block_split

2、Trace actions：

C – complete A previously issued request has been completed. The output
will detail the sector and size of that request, as well as the success or
failure of it.
D – issued A request that previously resided on the block layer queue or in
the io scheduler has been sent to the driver.
I – inserted A request is being sent to the io scheduler for addition to the
internal queue and later service by the driver. The request is fully formed
at this time.
Q – queued This notes intent to queue io at the given location. No real re-
quests exists yet.
B – bounced The data pages attached to this bio are not reachable by the
hardware and must be bounced to a lower memory location. This causes
a big slowdown in io performance, since the data must be copied to/from
kernel buffers. Usually this can be fixed with using better hardware -
either a better io controller, or a platform with an IOMMU.
m – message Text message generated via kernel call to blk add trace msg.
M – back merge A previously inserted request exists that ends on the bound-
ary of where this io begins, so the io scheduler can merge them together.
F – front merge Same as the back merge, except this io ends where a previ-
ously inserted requests starts.
G – get request To send any type of request to a block device, a struct request
container must be allocated first.
S – sleep No available request structures were available, so the issuer has to
wait for one to be freed.
P – plug When io is queued to a previously empty block device queue, Linux
will plug the queue in anticipation of future ios being added before this
data is needed.
U – unplug Some request data already queued in the device, start sending
requests to the driver. This may happen automatically if a timeout period
has passed (see next entry) or if a number of requests have been added to
the queue.
T – unplug due to timer If nobody requests the io that was queued after
plugging the queue, Linux will automatically unplug it after a defined
period has passed.
X – split On raid or device mapper setups, an incoming io may straddle a
device or internal zone and needs to be chopped up into smaller pieces
for service. This may indicate a performance problem due to a bad setup
of that raid/dm device, but may also just be part of normal boundary
conditions. dm is notably bad at this and will clone lots of io.
A – remap For stacked devices, incoming io is remapped to device below it in
the io stack. The remap action details what exactly is being remapped to what.

参考:

http://bean-li.github.io/blktrace-to-report/

http://linuxperf.com/?p=161

http://bbs.chinaunix.net/thread-1976867-1-1.html

http://www.cnblogs.com/cobbliu/p/7002933.html

/usr/share/doc/blktrace-1.0.1/blktrace.pdf
/usr/share/doc/blktrace-1.0.1/btt.pdf

http://blog.yufeng.info/archives/751

http://duch.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materialy/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf

http://fibrevillage.com/storage/539-blktrace-and-btt-example-to-debug-and-tune-disk-i-o-on-linux

blktrace分析IO 前言 blktrace的原理 blktrace、blkparse和btt 利用btt分析blktrace数据

blktrace的原理

blktrace、blkparse和btt

利用btt分析blktrace数据

相关推荐