blktrace分析IO 前言 blktrace的原理 blktrace、blkparse和btt 利用btt分析blktrace数据

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

由于在iostat输出中,只能看到service time + wait time,因为对于评估一个磁盘或者云磁盘而言,service time才是衡量磁盘性能的核心指标和直接指标。很不幸iostat无法提供这个指标,但是blktrace可以。

blktrace是一柄神器,很多工具都是基于该神器的:ioprof,seekwatcher,iowatcher,这个工具基本可以满足我们的对块设备请求的所有了解。

blktrace的原理

一个I/O请求,从应用层到底层块设备,路径如下图所示:

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

从上图可以看出IO路径是很复杂的。这么复杂的IO路径我们是无法用短短一篇小博文介绍清楚的。我们将IO路径简化一下:

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

一个I/O请求进入block layer之后,可能会经历下面的过程:

  • Remap: 可能被DM(Device Mapper)或MD(Multiple Device, Software RAID) remap到其它设备
  • Split: 可能会因为I/O请求与扇区边界未对齐、或者size太大而被分拆(split)成多个物理I/O
  • Merge: 可能会因为与其它I/O请求的物理位置相邻而合并(merge)成一个I/O
  • 被IO Scheduler依照调度策略发送给driver
  • 被driver提交给硬件,经过HBA、电缆(光纤、网线等)、交换机(SAN或网络)、最后到达存储设备,设备完成IO请求之后再把结果发回。

blktrace 能够记录下IO所经历的各个步骤:

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

我们一起看下blktrace的输出长什么样子:

blktrace分析IO
前言
blktrace的原理
blktrace、blkparse和btt
利用btt分析blktrace数据

  • 第一个字段:8,0 这个字段是设备号 major device ID和minor device ID。
  • 第二个字段:3 表示CPU
  • 第三个字段:11 序列号
  • 第四个字段:0.009507758 Time Stamp是时间偏移
  • 第五个字段:PID 本次IO对应的进程ID
  • 第六个字段:Event,这个字段非常重要,反映了IO进行到了那一步
  • 第七个字段:R表示 Read, W是Write,D表示block,B表示Barrier Operation
  • 第八个字段:223490+56,表示的是起始block number 和 number of blocks,即我们常说的Offset 和 Size
  • 第九个字段: 进程名

其中第六个字段非常有用:每一个字母都代表了IO请求所经历的某个阶段。

Q – 即将生成IO请求
|
G – IO请求生成
|
I – IO请求进入IO Scheduler队列
|
D – IO请求进入driver
|
C – IO请求执行完毕

根据以上步骤对应的时间戳就可以计算出I/O请求在每个阶段所消耗的时间:


Q2G – 生成IO请求所消耗的时间,包括remap和split的时间;
G2I – IO请求进入IO Scheduler所消耗的时间,包括merge的时间;
I2D – IO请求在IO Scheduler中等待的时间;
D2C – IO请求在driver和硬件上所消耗的时间;
Q2C – 整个IO请求所消耗的时间(Q2I + I2D + D2C = Q2C),相当于iostat的await。


如果I/O性能慢的话,以上指标有助于进一步定位缓慢发生的地方:
D2C可以作为硬件性能的指标;
I2D可以作为IO Scheduler性能的指标。

一个IO的生命周期大约是:
● I/O enters block layer – it can be:
– Remapped onto another device (MD, DM)
– Split into 2 separate I/Os (alignment, size, ...)
– Added to the request queue
– Merged with a previous entry on the queue All I/Os end up on a request queue at some point
● At some later time, the I/O is issued to a device driver, and submitted to a device
● Later, the I/O is completed by the device, and its driver
Q2Q — time between requests sent to the block layer
Q2G — time
from a block I/O is queued to the time it gets a request allocated for it
G2I — time f
rom a request is allocated to the time it is Inserted into the device's queue
Q2M — time
from a block I/O is queued to the time it gets merged with an existing request
I2D — time
from a request is inserted into the device's queue to the time it is actually issued to the device
M2D — time from
a block I/O is merged with an exiting request until the request is issued to the device
D2C — service time of the request by the device
Q2C — total time spent in the block layer
for a request

注意,整个IO路径,分成很多段,每一段开始的时候,都会有一个时间戳,根据上一段开始的时间和下一段开始的时间,就可以得到IO 路径各段花费的时间。

注意,我们心心念念的service time,也就是反应块设备处理能力的指标,就是从D到C所花费的时间,简称D2C。

而iostat输出中的await,即整个IO从生成请求到IO请求执行完毕,即从Q到C所花费的时间,我们简称Q2C。

我们知道Linux 有I/O scheduler,调度器的效率如何,I2D是重要的指标。

blktrace、blkparse和btt

我们接下来简单介绍这些工具的使用,其中这三个命令都是属于blktrace这个包的,他们是一家人。

使用blktrace需要挂载debugfs:

$ mount -t debugfs debugfs /sys/kernel/debug

首先通过如下命令,可以查看磁盘上的实时信息:

blktrace -d /dev/sdb -o – | blkparse -i –

这个命令会连绵不绝地出现很多输出,当你输入ctrl+C的时候,会停止。

当然了,你也可以先用如下命令采集信息,待所有信息采集完毕后,统一分析所有采集到的数据。搜集信息的命令如下:

blktrace -d /dev/sdb

注意,这个命令并不是只输出一个文件,他会根据CPU的个数上,每一个CPU都会输出一个文件,你也可以用-o参数指定自己的输出文件名。如下所示:

-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.0
-rw-r--r-- 1 manu manu  823K Jul  6 19:58 sdb.blktrace.1
-rw-r--r-- 1 manu manu  2.8M Jul  6 19:58 sdb.blktrace.10
-rw-r--r-- 1 manu manu  1.9M Jul  6 19:58 sdb.blktrace.11
-rw-r--r-- 1 manu manu  474K Jul  6 19:58 sdb.blktrace.12
-rw-r--r-- 1 manu manu  271K Jul  6 19:58 sdb.blktrace.13
-rw-r--r-- 1 manu manu  578K Jul  6 19:58 sdb.blktrace.14
-rw-r--r-- 1 manu manu  375K Jul  6 19:58 sdb.blktrace.15
-rw-r--r-- 1 manu manu  382K Jul  6 19:58 sdb.blktrace.16
-rw-r--r-- 1 manu manu  478K Jul  6 19:58 sdb.blktrace.17
-rw-r--r-- 1 manu manu  839K Jul  6 19:58 sdb.blktrace.18
-rw-r--r-- 1 manu manu  848K Jul  6 19:58 sdb.blktrace.19
-rw-r--r-- 1 manu manu  1.6M Jul  6 19:58 sdb.blktrace.2
-rw-r--r-- 1 manu manu  652K Jul  6 19:58 sdb.blktrace.20
-rw-r--r-- 1 manu manu  738K Jul  6 19:58 sdb.blktrace.21
-rw-r--r-- 1 manu manu  594K Jul  6 19:58 sdb.blktrace.22
-rw-r--r-- 1 manu manu  527K Jul  6 19:58 sdb.blktrace.23
-rw-r--r-- 1 manu manu 1005K Jul  6 19:58 sdb.blktrace.3
-rw-r--r-- 1 manu manu  1.2M Jul  6 19:58 sdb.blktrace.4
-rw-r--r-- 1 manu manu  511K Jul  6 19:58 sdb.blktrace.5
-rw-r--r-- 1 manu manu  2.3M Jul  6 19:58 sdb.blktrace.6
-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.7
-rw-r--r-- 1 manu manu  2.1M Jul  6 19:58 sdb.blktrace.8
-rw-r--r-- 1 manu manu  1.1M Jul  6 19:58 sdb.blktrace.9

有了输出,我们可以通过blkparse -i sdb来分析采集的数据:

$ blktrace -d /dev/sdb
$ blkparse -i sdb
  # 第一个IO开始:
  8,16   1        1     0.000000000 18166  Q   R 0 + 1 [dd]
  8,16   1        0     0.000009827     0  m   N cfq18166S  /user.slice alloced
  8,16   1        2     0.000010451 18166  G   R 0 + 1 [dd]
  8,16   1        3     0.000011056 18166  P   N [dd]
  8,16   1        4     0.000012255 18166  I   R 0 + 1 [dd]
  8,16   1        0     0.000013477     0  m   N cfq18166S  /user.slice insert_request
  8,16   1        0     0.000014526     0  m   N cfq18166S  /user.slice add_to_rr
  8,16   1        5     0.000016643 18166  U   N [dd] 1
  8,16   1        0     0.000017522     0  m   N cfq workload slice:300
  8,16   1        0     0.000018880     0  m   N cfq18166S  /user.slice set_active wl_class:0 wl_type:2
  8,16   1        0     0.000020594     0  m   N cfq18166S  /user.slice fifo=          (null)
  8,16   1        0     0.000021462     0  m   N cfq18166S  /user.slice dispatch_insert
  8,16   1        0     0.000022898     0  m   N cfq18166S  /user.slice dispatched a request
  8,16   1        0     0.000023786     0  m   N cfq18166S  /user.slice activate rq, drv=1
  8,16   1        6     0.000023977 18166  D   R 0 + 1 [dd]
  8,16   0        1     0.014270153     0  C   R 0 + 1 [0]
  # 第一个IO结束。
  8,16   0        0     0.014278115     0  m   N cfq18166S  /user.slice complete rqnoidle 0
  8,16   0        0     0.014280044     0  m   N cfq18166S  /user.slice set_slice=100
  8,16   0        0     0.014282217     0  m   N cfq18166S  /user.slice arm_idle: 8 group_idle: 0
  8,16   0        0     0.014282728     0  m   N cfq schedule dispatch
  # 第二个IO开始:
  8,16   1        7     0.014298247 18166  Q   R 1 + 1 [dd]
  8,16   1        8     0.014300522 18166  G   R 1 + 1 [dd]
  8,16   1        9     0.014300984 18166  P   N [dd]
  8,16   1       10     0.014301996 18166  I   R 1 + 1 [dd]
  8,16   1        0     0.014303864     0  m   N cfq18166S  /user.slice insert_request
  8,16   1       11     0.014304981 18166  U   N [dd] 1
  8,16   1        0     0.014306368     0  m   N cfq18166S  /user.slice dispatch_insert
  8,16   1        0     0.014307793     0  m   N cfq18166S  /user.slice dispatched a request
  8,16   1        0     0.014308763     0  m   N cfq18166S  /user.slice activate rq, drv=1
  8,16   1       12     0.014308962 18166  D   R 1 + 1 [dd]
  8,16   0        2     0.014518615     0  C   R 1 + 1 [0]
  # 第二个IO结束。
  8,16   0        0     0.014523548     0  m   N cfq18166S  /user.slice complete rqnoidle 0
  8,16   0        0     0.014525334     0  m   N cfq18166S  /user.slice arm_idle: 8 group_idle: 0
  8,16   0        0     0.014525676     0  m   N cfq schedule dispatch
  # 第三个IO开始:
  8,16   1       13     0.014531022 18166  Q   R 2 + 1 [dd]
  ...

注意,blkparse仅仅是将blktrace输出的信息转化成人可以阅读和理解的输出,但是,信息太多,太杂,人完全没法得到关键信息。 这时候btt就横空出世了,这个工具可以将blktrace采集回来的数据,进行分析,得到对人更有用的信息。事实上,btt也是我们的终点。

注:
在以上数据中,有一些记录的event类型是”m”,那是IO Scheduler的调度信息,对研究IO Scheduler问题有意义:

    • cfq18166S – cfq是IO Scheduler的名称,18166是进程号,”S”表示Sync(同步IO),如果异步IO则用“A”表示(Async);
    • 它们的第三列sequence number都是0;
    • 它们表示IO Scheduler内部的关键函数,上例中是cfq,代码参见block/cfq-iosched.c,以下是各关键字所对应的内部函数:
      alloced <<< cfq_find_alloc_queue()
      insert_request <<< cfq_insert_request()
      add_to_rr <<< cfq_add_cfqq_rr()
      cfq workload slice:300 <<< choose_wl_class_and_type()
      set_active wl_class:0 wl_type:2 <<< __cfq_set_active_queue()
      fifo= (null) <<< cfq_check_fifo()
      dispatch_insert <<< cfq_dispatch_insert()
      dispatched a request <<< cfq_dispatch_requests()
      activate rq, drv=1 <<< cfq_activate_request()
      complete rqnoidle 0 <<< cfq_completed_request()
      set_slice=100 <<< cfq_set_prio_slice()
      arm_idle: 8 group_idle: 0 <<< cfq_arm_slice_timer()
      cfq schedule dispatch <<< cfq_schedule_dispatch()

利用btt分析blktrace数据

blkparse只是将blktrace数据转成可以人工阅读的格式,由于数据量通常很大,人工分析并不轻松。btt是对blktrace数据进行自动分析的工具。

btt不能分析实时数据,只能对blktrace保存的数据文件进行分析。使用方法:
把原本按CPU分别保存的文件合并成一个,合并后的文件名为sdb.blktrace.bin:
$ blkparse -i sdb -d sdb.blktrace.bin
执行btt对sdb.blktrace.bin进行分析:
$ btt -i sdb.blktrace.bin

下面是一个btt实例:

...
            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
 
Q2Q               0.000138923   0.000154010   0.014298247       94558
Q2G               0.000001154   0.000001661   0.000017313       94559
G2I               0.000000883   0.000001197   0.000012011       94559
I2D               0.000004722   0.000005761   0.000027286       94559
D2C               0.000118840   0.000129201   0.014246176       94558
Q2C               0.000125953   0.000137820   0.014270153       94558
 
==================== Device Overhead ====================
 
       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |   1.2050%   0.8688%   0.0000%   4.1801%  93.7461%
---------- | --------- --------- --------- --------- ---------
   Overall |   1.2050%   0.8688%   0.0000%   4.1801%  93.7461%

我们看到93.7461%的时间消耗在D2C,也就是硬件层,这是正常的,我们说过D2C是衡量硬件性能的指标,这里单个IO平均0.129201毫秒,已经是相当快了,单个IO最慢14.246176 毫秒,不算坏。Q2G和G2I都很小,完全正常。I2D稍微有点大,应该是cfq scheduler的调度策略造成的,你可以试试其它scheduler,比如deadline,比较两者的差异,然后选择最适合你应用特点的那个。

注:

1、一些状态的对应关系

Act Description                                 SetPosition
A  IO was remapped to a different device   blk-core.c/trace_block_remap
B  IO bounced                              bounce.c/trace_block_bio_bounce
C  IO completion                           blk-core.c/trace_block_rq_complete
D  IO issued to driver                     elevator.c/trace_block_rq_issue
F  IO front merged with request on queue   blk-core.c/trace_block_bio_frontmerge
G  Get request                             blk-core.c/trace_block_getrq
I  IO inserted onto request queue          elevator.c/trace_block_rq_insert
M  IO back merged with request on queue    blk-core.c/trace_block_bio_backmerge
P  Plug request                            blk-core.c/trace_block_plug
Q  IO handled by request queue code        blk-core.c/trace_block_bio_queue
S  Sleep request                           blk-core.c/trace_block_sleeprq
T  Unplug due to timeout                   blk-core.c/trace_block_unplug_timer
U  Unplug request                          blk-core.c/trace_block_unplug_io
X  Split                                   bio.c/trace_block_split

2、Trace actions:

C – complete A previously issued request has been completed. The output
will detail the sector and size of that request, as well as the success or
failure of it.
D – issued A request that previously resided on the block layer queue or
in the io scheduler has been sent to the driver.
I – inserted A request is being sent to the io scheduler
for addition to the internal queue and later service by the driver. The request is fully formed at this time.
Q – queued This notes intent to queue io at the given location. No real re
- quests exists yet.
B – bounced The data pages attached to this bio are not reachable by the hardware and must be bounced to a lower memory location. This causes a big slowdown
in io performance, since the data must be copied to/from kernel buffers. Usually this can be fixed with using better hardware - either a better io controller, or a platform with an IOMMU.
m – message Text message generated via kernel call to blk add trace msg.
M – back merge A previously inserted request exists that ends on the bound
- ary of where this io begins, so the io scheduler can merge them together.
F – front merge Same as the back merge, except this io ends where a previ
- ously inserted requests starts.
G – get request To send any type of request to a block device, a struct request container must be allocated first.
S
sleep No available request structures were available, so the issuer has to wait for one to be freed.
P – plug When io is queued to a previously empty block device queue, Linux will plug the queue
in anticipation of future ios being added before this data is needed.
U – unplug Some request data already queued
in the device, start sending requests to the driver. This may happen automatically if a timeout period has passed (see next entry) or if a number of requests have been added to the queue.
T – unplug due to timer If nobody requests the io that was queued after plugging the queue, Linux will automatically unplug it after a defined period has passed.
X
split On raid or device mapper setups, an incoming io may straddle a device or internal zone and needs to be chopped up into smaller pieces for service. This may indicate a performance problem due to a bad setup of that raid/dm device, but may also just be part of normal boundary conditions. dm is notably bad at this and will clone lots of io.
A – remap For stacked devices, incoming io is remapped to device below it
in the io stack. The remap action details what exactly is being remapped to what.

参考:

http://bean-li.github.io/blktrace-to-report/

http://linuxperf.com/?p=161

http://bbs.chinaunix.net/thread-1976867-1-1.html

http://www.cnblogs.com/cobbliu/p/7002933.html

/usr/share/doc/blktrace-1.0.1/blktrace.pdf
/usr/share/doc/blktrace-1.0.1/btt.pdf

http://blog.yufeng.info/archives/751

http://duch.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materialy/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf

http://fibrevillage.com/storage/539-blktrace-and-btt-example-to-debug-and-tune-disk-i-o-on-linux