dm-thin-provision架构及实现简析

xiaoxiao2021-02-28 76

前言:

最近对快照感兴趣, 初步分析了下dm-thin-provision的代码, 初步感觉实现方式很不错, 但不足的是性能比较差, metadata写了数据过多.

但整体实现方式还是值得参考的.

dm-thin-provision简介

thin-provision是device mapper的一种, 可以完成存储设备的特定映射, 这种设备有下面的特点:

（1）允许多个虚拟设备存储在相同的数据卷中，从而达到共享数据，节省空间的目的；

（2）支持任意深度的快照。之前的实现的性能为O(n)，新的实现通过一个单独的数据避免了性能随快照深度的增加而降低。

（3）支持元数据存储到单独的设备上。这样就可以将元数据放到镜像设备或者更快的SSD上。

创建thin provision

有两种方法创建dm thin-provision, 一种是通过dmsetup工具创建, 另外一种是通过lvm管理工具创建.

通过dmsetup创建dm-thin-provision

a: 创建pool

# dmsetup create pool \

--table "0 20971520 thin-pool $metadata_dev $data_dev \

$data_block_size $low_water_mark"

# dmsetup create yy_thin_pool --table '0 409600 thin-pool /dev/loop6 /dev/loop7 128 0'

# dmsetup table /dev/mapper/yy_thin_pool

0 409600 thin-pool 7:6 7:7 128 0 0

b: 创建thin volume

# dmsetup message /dev/mapper/yy_thin_pool 0 "create_thin 0"

# dmsetup table /dev/mapper/thin

0 40960 thin 253:3 0

c: 创建快照snapshot

# dmsetup suspend /dev/mapper/thin

# dmsetup message /dev/mapper/yy_thin_pool 0 "create_snap 1 0"

# dmsetup resume /dev/mapper/thin

# dmsetup create snap --table "0 40960 thin /dev/mapper/yy_thin_pool 1"

通过lvm创建thin provision

a: 创建thin pool

# dd if=/dev/zero of=lvm0.img bs=1024k count=256

# losetup /dev/loop7 lvm0.img

# pvcreate /dev/loop7

Physical volume "/dev/loop7" successfully created

# vgcreate vg_test /dev/loop7

Volume group "vg_test" successfully created

# lvcreate -L 200M -T vg_test/mythinpool

Logical volume "lvol0" created

Logical volume "mythinpool" created

# ls /dev/mapper/* |grep mythin

/dev/mapper/vg_test-mythinpool

/dev/mapper/vg_test-mythinpool_tdata

/dev/mapper/vg_test-mythinpool_tmeta

/dev/mapper/vg_test-mythinpool-tpool

b: 创建thin

# lvcreate -T vg_test/mythinpool -V 300M -n lvol1

Logical volume "lvol1" created

c: 创建snapshot

# lvcreate -s --name mysnapshot1 vg_test/lvol1

Logical volume "mysnapshot1" created

dm-thin-provision架构

thin-provision记录了每次写的block的映射关系, 具体对应关系放到了metadata dev中, 每个block介于64k 和1G中间, 创建pool的时候传入, 在pool_ctr中有检查

* The block size of the device holding pool data must be

* between 64KB and 1GB.

#define DATA_DEV_BLOCK_SIZE_MIN_SECTORS (64 * 1024 >> SECTOR_SHIFT)

#define DATA_DEV_BLOCK_SIZE_MAX_SECTORS (1024 * 1024 * 1024 >> SECTOR_SHIFT)

而无论读写, 首先会有一个从metadata中找block的过程, 例如:

thin_bio_map-> dm_thin_find_block. 如果是读命令, 结果却发现找不到block的情况, 这种情况就说明之前根本没有写入数据,所以应该返回全0.

如果是写命令, 没有找到block, 就可以往metadata的树中插入新的项目, 当下次写命令查找时能返回正确的block信息.

snapshot和对应的thin一样, 也是一种thin device, 只不过这种thin device是有一部分能共享. 当对某个thin创建snapshot时, 需要怎么来处理呢?

a: 创建snapshot也就是创建了一个新的thin_id的设备, 只不过有一个time值记录当前的snapshotted_time, 如果找到的block的时间小于当前设备的time, 则说明是共享的,代表当前block是比较老的版本.

b: 如果是共享的, 其实已经找到了block号,

c: 如果不是共享的,则需要重新创建一个block, 并于当前bio进行map, 同时写入map的结果到metadata dev中保存.

关键的数据结构

pool ,代表最开始创建的pool

Collapse source struct pool_c { struct dm_target *ti; struct pool *pool; struct dm_dev *data_dev; struct dm_dev *metadata_dev; struct dm_target_callbacks callbacks; dm_block_t low_water_blocks; struct pool_features requested_pf; /* Features requested during table load */ struct pool_features adjusted_pf; /* Features used after adjusting for constituent devices */ };

Thin_c, 代表已创建好的thin dev

Collapse source struct thin_c { struct list_head list; struct dm_dev *pool_dev; struct dm_dev *origin_dev; sector_t origin_size; dm_thin_id dev_id; struct pool *pool; struct dm_thin_device *td; struct mapped_device *thin_md; bool requeue_mode: 1 ; spinlock_t lock; struct list_head deferred_cells; struct bio_list deferred_bio_list; struct bio_list retry_on_resume_list; struct rb_root sort_bio_list; /* sorted list of deferred bios */ /* * Ensures the thin is not destroyed until the worker has finished * iterating the active_thins list. */ atomic_t refcount; struct completion can_destroy; };

Metadata中的四棵树

Collapse source 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 /* * Two-level btree. * First level holds thin_dev_t. * Second level holds mappings. */ struct dm_btree_info info; // 描述双层的树,上层是设备下层是mapping /* * Non-blocking version of the above. */ struct dm_btree_info nb_info; // 不包含block io的版本 /* * Just the top level for deleting whole devices. */ struct dm_btree_info tl_info; // 上层设备树 /* * Just the bottom level for creating new devices. */ struct dm_btree_info bl_info; // 下层块信息树 /* * Describes the device details btree. */ struct dm_btree_info details_info; // 描述设备的详细信息的树

初始化以及IO操作流程

a: ioctl 命令入口

Collapse source 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 static struct target_type pool_target = { .name = "thin-pool" , .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE | DM_TARGET_IMMUTABLE, .version = { 1 , 16 , 0 }, .module = THIS_MODULE, .ctr = pool_ctr, .dtr = pool_dtr, .map = pool_map, .presuspend = pool_presuspend, .presuspend_undo = pool_presuspend_undo, .postsuspend = pool_postsuspend, .preresume = pool_preresume, .resume = pool_resume, .message = pool_message, .status = pool_status, .merge = pool_merge, .iterate_devices = pool_iterate_devices, .io_hints = pool_io_hints, };

b: 创建thin device

pool_message -> process_create_thin_mesg -> dm_pool_create_thin -> __create_thin -> dm_btree_insert -> insert -> btree_insert_raw

c: 创建snapshot device

pool_message -> process_create_snap_mesg -> dm_pool_create_snap -> __create_snap -> dm_btree_insert ->>> __set_snapshot_details

感觉snapshot device创建过程应该复用thin的创建过程, 最好传递一个参数, 很多函数都是一样的, 浪费代码.

d: io命令处理入口函数map

主要从块设备层开始

对request based 的io, 走的是一条路.

table_load -> dm_setup_md_queue -> dm_init_request_based_queue -> dm_request_fn -> queue_kthread_work ->

kthread_worker_fn -> work->func() ->

这个tio work 在 init_tio -> init_kthread_work(&tio->work, map_tio_request);

map_tio_request -> ti->type->map_rq(ti, clone, &tio->info);

在map request的时候, 会先找clone, 然后调用target的map_rq

对bio based 的io, 路线如下:

table_load -> dm_setup_md_queue -> blk_queue_make_request(md->queue, dm_make_request);

dm_make_request -> __split_and_process_bio -> __split_and_process_non_flush -> __clone_and_map_data_bio

__map_bio -> ti->type->map(ti, clone);

e: 对thin device和snapshot device进行map

thin_bio_map -> dm_thin_find_block -> dm_btree_lookup

看当前block是否找到, 找到则设置当前block的状态:

result->block = exception_block;

result->shared = __snapshotted_since(td, exception_time);

这个__snapshotted_since是比较现在的time和当前设备time的大小, 每次进行snapshot动作, 设备的time值就会加一

Collapse source 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 /* * Check whether @time (of block creation) is older than @td's last snapshot. * If so then the associated block is shared with the last snapshot device. * Any block on a device created *after* the device last got snapshotted is * necessarily not shared. */ static bool __snapshotted_since(struct dm_thin_device *td, uint32_t time) { return td->snapshotted_time > time; }

设备的share状态很关键, 如果当前结果返回的是shared状态, 则代表两者可以共享数据.

如果不是share状态, 则需要重新分配一个块,然后进行remap动作

Collapse source 1 2 3 4 5 6 7 8 9 build_data_key(tc->td, result.block, &key); if (bio_detain(tc->pool, &key, bio, &data_cell)) { cell_defer_no_holder(tc, virt_cell); return DM_MAPIO_SUBMITTED; }

remap动作就是根据block的地址加上当前请求的偏移, 组成新的地址往下发

Collapse source 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 static void remap(struct thin_c *tc, struct bio *bio, dm_block_t block) { struct pool *pool = tc->pool; sector_t bi_sector = bio->bi_sector; bio->bi_bdev = tc->pool_dev->bdev; if (block_size_is_power_of_two(pool)) bio->bi_sector = (block << pool->sectors_per_block_shift) | (bi_sector & (pool->sectors_per_block - 1 )); else bio->bi_sector = (block * pool->sectors_per_block) + sector_div(bi_sector, pool->sectors_per_block); }

如果没有找到, 则需要视情况决定是否重新分配block然后往下发:

Collapse source 1 2 3 4 5 6 7 case -ENODATA: case -EWOULDBLOCK: thin_defer_cell(tc, virt_cell); return DM_MAPIO_SUBMITTED;

thin_defer_cell->wake_worker -> do_worker

Collapse source 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 static void do_worker(struct work_struct *ws) { struct pool *pool = container_of(ws, struct pool, worker); throttle_work_start(&pool->throttle); dm_pool_issue_prefetches(pool->pmd); throttle_work_update(&pool->throttle); process_prepared(pool, &pool->prepared_mappings, &pool->process_prepared_mapping); throttle_work_update(&pool->throttle); process_prepared(pool, &pool->prepared_discards, &pool->process_prepared_discard); throttle_work_update(&pool->throttle); process_deferred_bios(pool); throttle_work_complete(&pool->throttle); }

这个do_worker做的事情是进行新的mapping, 然后往下发:

process_prepared_mapping -> dm_thin_insert_block -> remap_and_issue

核心在于之前的remap函数, issue的过程比较简单

Collapse source 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 static void issue(struct thin_c *tc, struct bio *bio) { struct pool *pool = tc->pool; unsigned long flags; if (!bio_triggers_commit(tc, bio)) { generic_make_request(bio); return ; } /* * Complete bio with an error if earlier I/O caused changes to * the metadata that can't be committed e.g, due to I/O errors * on the metadata device. */ if (dm_thin_aborted_changes(tc->td)) { bio_io_error(bio); return ; } /* * Batch together any bios that trigger commits and then issue a * single commit for them in process_deferred_bios(). */ spin_lock_irqsave(&pool->lock, flags); bio_list_add(&pool->deferred_flush_bios, bio); spin_unlock_irqrestore(&pool->lock, flags); }

结论

dm-thin-provision通过在metadata device中创建二元树(thin_id, LBA)来记录对请求的映射情况, 而当对某个thin device进行snapshot的时候, 通过time值记录该snapshot的情况. 在重新定位的时候通过比较该time值的大小来确定当前块是否被共享. 这样能最大程度的实现了块空间的利用, 非常适合类似container的场景.

而不利的情况是当要导出增量的修改的时候, 可能比较麻烦.

转载请注明原文地址: https://www.6miu.com/read-79469.html

技术

最新回复(0)