内存管理篇————内存管理机制

linux 的内存管理是行进了相当长的一段时间的，此期间我们需要理解的有 bootmem, memblock, buddy system, slab/slub。这些机制也分别出现在内核初始化的不同阶段。

bootmem: 早期启动阶段的内存管理机制，后来被 memblock 取代了
memblock: 早期启动阶段的内存管理机制，负责内存初始化时的内存管理
buddy system: 内核运行期间的物理页分配器，是页级分配的核心
slab/slub: 相比于 buddy system，这些是非页级的连续字节内存分配的核心机制

1 bootmem

在内核初始化的初期，linux 建立了一个非常简单的内存管理机制，bootmem。在此之前，并没有我们所说的页分配机制。

据查看文档可知，v4.20-rc1 版本开始，该机制就被 memblock 代替了。

LWN 上的链接: https://lwn.net/Articles/764197/

这里看 https://elixir.bootlin.com/linux/v4.19.325/A/ident/bootmem_data，觉得可能是 bootmem 的最后一舞。

这里笔者觉得知道有这个机制就可以了，因为知识点已经过时了，作为一个 TODO。

2 memblock

该机制由 Yinghai Lu 在 2010 年提出，https://lkml.org/lkml/2010/7/13/114。

该机制主要将内存分为了两部分: memblock.memory和memblock.reserved。

物理上可用的内存区域都会被添加到memblock.memory。而被分配或者被系统占用的区域则会添加到memblock.reserved。但是呢？假如某块内存被分配了它也不会从 memblock.memory 当中被剔除掉。

memblock 的初始化需要依赖一些硬件，因为要对内存进行控制，必然就需要获取内存的硬件信息。这里不同架构的实现不太一致，比如 x86 会依赖 E820/EFI, 而 arm64 依赖 Device Tree/UEFI(或者说ACPI)，而 riscv64 则依赖 Device Tree/UEFI。

2.1 memblock 对外的 API

memblock_add, memblock_remove: 这两个函数只改变 memblock.memory
memblock_alloc, memblock_free: 这两个函数只改变 memblock.reserved

memblock_add_range: 添加一个区域到 memblock 当中(memory/reserved)

情况一: 此时 rbase 和 rend 是某个区域的内存，base 和 end 是我们将要添加的区域，区域之间不重叠
情况二: 我们要插入的区域被包含在 rbase 和 rend 之间
情况三: 此时我们要插入的区域在左侧与 rbase 和 rend 部分重叠
情况四: 此时我们要插入的区域在右侧与 rbase 和 rend 部分重叠
情况五: 此时我们要插入的区域包含了旧的区域

函数解析: https://elixir.bootlin.com/linux/v7.1-rc7/source/mm/memblock.c#L611，

static int __init_memblock memblock_add_range(struct memblock_type *type,
    phys_addr_t base, phys_addr_t size,
    int nid, enum memblock_flags flags)
{
    bool insert = false;
    phys_addr_t obase = base;
    phys_addr_t end = base + memblock_cap_size(base, &size);
    int idx, nr_new, start_rgn = -1, end_rgn;
    struct memblock_region *rgn;

    if (!size)
        return 0;

    /* special case for empty array */
    if (type->regions[0].size == 0) {
        WARN_ON(type->cnt != 0 || type->total_size);
        type->regions[0].base = base;
        type->regions[0].size = size;
        type->regions[0].flags = flags;
        memblock_set_region_node(&type->regions[0], nid);
        type->total_size = size;
        type->cnt = 1;
        return 0;
    }

    /*
    * The worst case is when new range overlaps all existing regions,
    * then we'll need type->cnt + 1 empty regions in @type. So if
    * type->cnt * 2 + 1 is less than or equal to type->max, we know
    * that there is enough empty regions in @type, and we can insert
    * regions directly.
    */
    if (type->cnt * 2 + 1 <= type->max)
        insert = true;

repeat:
    /*
    * The following is executed twice.  Once with %false @insert and
    * then with %true.  The first counts the number of regions needed
    * to accommodate the new area.  The second actually inserts them.
    */
    base = obase;
    nr_new = 0;

    for_each_memblock_type(idx, type, rgn) {
        phys_addr_t rbase = rgn->base;
        phys_addr_t rend = rbase + rgn->size;

        if (rbase >= end)
            break;
        if (rend <= base)
            continue;
        /*
        * @rgn overlaps.  If it separates the lower part of new
        * area, insert that portion.
        */
        if (rbase > base) {
#ifdef CONFIG_NUMA
            WARN_ON(nid != memblock_get_region_node(rgn));
#endif
            WARN_ON(flags != MEMBLOCK_NONE && flags != rgn->flags);
            nr_new++;
            if (insert) {
                if (start_rgn == -1)
                    start_rgn = idx;
                end_rgn = idx + 1;
                memblock_insert_region(type, idx++, base,
                            rbase - base, nid,
                            flags);
            }
        }
        /* area below @rend is dealt with, forget about it */
        base = min(rend, end);
    }
    bool insert = false;
    phys_addr_t obase = base;
    phys_addr_t end = base + memblock_cap_size(base, &size);
    int idx, nr_new, start_rgn = -1, end_rgn;
    struct memblock_region *rgn;

    if (!size)
        return 0;

    /* special case for empty array */
    if (type->regions[0].size == 0) {
        WARN_ON(type->cnt != 0 || type->total_size);
        type->regions[0].base = base;
        type->regions[0].size = size;
        type->regions[0].flags = flags;
        memblock_set_region_node(&type->regions[0], nid);
        type->total_size = size;
        type->cnt = 1;
        return 0;
    }

    /*
    * The worst case is when new range overlaps all existing regions,
    * then we'll need type->cnt + 1 empty regions in @type. So if
    * type->cnt * 2 + 1 is less than or equal to type->max, we know
    * that there is enough empty regions in @type, and we can insert
    * regions directly.
    */
    if (type->cnt * 2 + 1 <= type->max)
        insert = true;

repeat:
    /*
    * The following is executed twice.  Once with %false @insert and
    * then with %true.  The first counts the number of regions needed
    * to accommodate the new area.  The second actually inserts them.
    */
    base = obase;
    nr_new = 0;

    for_each_memblock_type(idx, type, rgn) {
        phys_addr_t rbase = rgn->base;
        phys_addr_t rend = rbase + rgn->size;

        if (rbase >= end)
            break;
        if (rend <= base)
            continue;
        /*
        * @rgn overlaps.  If it separates the lower part of new
        * area, insert that portion.
        */
        if (rbase > base) {
#ifdef CONFIG_NUMA
            WARN_ON(nid != memblock_get_region_node(rgn));
#endif
            WARN_ON(flags != MEMBLOCK_NONE && flags != rgn->flags);
            nr_new++;
            if (insert) {
                if (start_rgn == -1)
                    start_rgn = idx;
                end_rgn = idx + 1;
                memblock_insert_region(type, idx++, base,
                            rbase - base, nid,
                            flags);
            }
        }
        /* area below @rend is dealt with, forget about it */
        base = min(rend, end);
    }

    /* insert the remaining portion */
    if (base < end) {
        nr_new++;
        if (insert) {
            if (start_rgn == -1)
                start_rgn = idx;
            end_rgn = idx + 1;
            memblock_insert_region(type, idx, base, end - base,
                        nid, flags);
        }
    }

    if (!nr_new)
        return 0;

    /*
    * If this was the first round, resize array and repeat for actual
    * insertions; otherwise, merge and return.
    */
    if (!insert) {
        while (type->cnt + nr_new > type->max)
            if (memblock_double_array(type, obase, size) < 0)
                return -ENOMEM;
        insert = true;
        goto repeat;
    } else {
        memblock_merge_regions(type, start_rgn, end_rgn);
        return 0;
    }
}

    /* insert the remaining portion */
    if (base < end) {
        nr_new++;
        if (insert) {
            if (start_rgn == -1)
                start_rgn = idx;
            end_rgn = idx + 1;
            memblock_insert_region(type, idx, base, end - base,
                        nid, flags);
        }
    }

    if (!nr_new)
        return 0;

    /*
    * If this was the first round, resize array and repeat for actual
    * insertions; otherwise, merge and return.
    */
    if (!insert) {
        while (type->cnt + nr_new > type->max)
            if (memblock_double_array(type, obase, size) < 0)
                return -ENOMEM;
        insert = true;
        goto repeat;
    } else {
        memblock_merge_regions(type, start_rgn, end_rgn);
        return 0;
    }
}

memblock_find_in_range_node: 通过 for_each_free_mem_range_reverse 找到空闲空间
memblock_alloc_range_nid: 通过 memblock_find_in_range_node 找到空闲空间
for_each_mem_pfn_range: 遍历 memory 区域
for_each_free_mem_range：同时遍历 memory 和 reserved 区域，但是找出在 memory 当中但是不在 reserved 的

2.2 debug

在 kernel 的启动参数中加入 memblock=debug

2.3 memblock 的结构

我们将 memblock 的整体结构进行一个简单的呈现，大概如下。

struct memblock {
	bool bottom_up;  /* is bottom up direction? */
	phys_addr_t current_limit;
	struct memblock_type memory;
	struct memblock_type reserved;
};

struct memblock_type {
	unsigned long cnt;
	unsigned long max;
	phys_addr_t total_size;
	struct memblock_region *regions; // 很关键，该结构用来划分不同的区域，是一个数组结构
	char *name;
};

struct memblock_region {
	phys_addr_t base; // 物理地址，区域的开头
	phys_addr_t size; // 物理地址，区域的大小
	enum memblock_flags flags; // 区域的相关属性
#ifdef CONFIG_NUMA
	int nid;
#endif
};

enum memblock_flags {
	MEMBLOCK_NONE		= 0x0,	/* No special request */
	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
	MEMBLOCK_DRIVER_MANAGED = 0x8,	/* always detected via a driver */
	MEMBLOCK_RSRV_NOINIT	= 0x10,	/* don't initialize struct pages */
	MEMBLOCK_RSRV_KERN	= 0x20,	/* memory reserved for kernel use */
	MEMBLOCK_KHO_SCRATCH	= 0x40,	/* scratch memory for kexec handover */
};

static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_MEMORY_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
static struct memblock_region memblock_physmem_init_regions[INIT_PHYSMEM_REGIONS];
#endif

struct memblock memblock __initdata_memblock = {
	.memory.regions		= memblock_memory_init_regions,
	.memory.max		    = INIT_MEMBLOCK_MEMORY_REGIONS,
	.memory.name		= "memory",

	.reserved.regions	= memblock_reserved_init_regions,
	.reserved.max		= INIT_MEMBLOCK_RESERVED_REGIONS,
	.reserved.name		= "reserved",

	.bottom_up		    = false,
	.current_limit		= MEMBLOCK_ALLOC_ANYWHERE,
};

#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
struct memblock_type physmem = {
	.regions		= memblock_physmem_init_regions,
	.max			= INIT_PHYSMEM_REGIONS,
	.name			= "physmem",
};
#endif

memblock 的大小最开始是有一定限度的，我们可以通过 memblock_double_array 函数将数组放大。此时的策略会用到 slab/slub，如果此时 slab/slub 已经初始化完成，那么我们将使用 kmalloc 来扩充数组的空间，否则需要使用 memblock 来扩充空间。

alt text

3 buddy system

对于 buddy system，前阵子在刷视频的时候看到了一个很通俗易懂的评论:

”伙伴分配器就是由一系列表示不同层级的链表构成的，比如0到10层，第0层就是单个page页链接而成的链表，第一层就是两个连续page链接的链表，第2层就是4个，以此类推，当用户需要一个page的时候，优先从第0层找，如果第0层不够了就去第一层，这个时候分配器就会把第1层的两个连续page拆成两个，一个给用户，一个下放到第0层链表。当用户释放内存页时，先放到第0层，如果分配器发现有两个物理页号相连，并且合并后的起始地址是合并后块大小的整数倍，就会合并起来放到第一层，如果再在第一层发现相连，就合并放到第二层，以此类推。“

这是笔者对 buddy system 初次最直观的感受。

buddy system 是基于 zone 来管理系统的，之前在“内存模型”篇章提到过 node 与 cpu 相对应，node 内部又被划分为很多 zone, 而现在每个 zone 的内部都对应着一个 buddy system。

buddy system 的主要思想就是将所有的空闲页帧分为 NR_PAGE_ORDERS 组链表，依次表示 1，2，4，8，16，32，64，128，256……个连续的页帧组成的页框，每一个页框的物理地址是该块的整数倍，例如 32 个页帧的页框，其起始地址是 32 * 2 ^ 12 的倍数.

分配内存: 找到最小的足够适配的内存块进行分配，如果不够则找到更大的块，并进行分裂操作
释放内存: 检查它的伙伴块是否也空闲，如果空闲则合并，直到不能再合并为止

根据 https://elixir.bootlin.com/linux/v7.1-rc7/source/include/linux/mmzone.h#L967，我们得知了 zone 的结构，其中有一个链表结构为:

/* Free memory management - zoned buddy allocator.  */
#ifndef CONFIG_ARCH_FORCE_MAX_ORDER
#define MAX_PAGE_ORDER 10
#else
#define MAX_PAGE_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
#endif
#define MAX_ORDER_NR_PAGES (1 << MAX_PAGE_ORDER)
......
#define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)

struct zone {
    ......
    struct free_area	free_area[NR_PAGE_ORDERS]; // 在之前的版本可能该宏为 MAX_ORDER, 这里一般情况下可以看见为 11
    ......
}

struct free_area {
	struct list_head	free_list[MIGRATE_TYPES]; // 此处涉及到了迁徙类型（migration types），将内存划分为不同的类型，不同的类型单独呈现成一串链表，TODO
	unsigned long		nr_free;
};

buddy system 的结构大致如下，对每个 zone，内核都会维护一个 free_area[NR_PAGE_ORDERS] 数组。

其中：

free_area[0]：管理 2^0 = 1 页大小的空闲块
free_area[1]：管理 2^1 = 2 页大小的空闲块
free_area[2]：管理 2^2 = 4 页大小的空闲块
…
free_area[n]：管理 2^n 页大小的空闲块

因此，free_area 可以理解为：按不同规格组织的空闲页块链表集合。

alt text

3.1 内存分裂

假设现在情况如下:

free_area[3] 存在一块空闲块，并且现在这个空闲块覆盖 0~7 的页帧
目前我想申请 2 块页大小的内存

由于 2^0 是 1, 2^1 是 2, 现在优先查找 free_area[1], 但是发现没有可用的块，于是查找到 free_area[3] 发现了存在一块大小合适的区域。此时会将 free_area[3] 拆分成 [0-3] 和 [4-7]，此时这两块互为 buddy，其中一个挂载到了 free_area[2]，另一个继续拿去拆分。此时拆分成 [0-1] 和 [2-3]，这两块也互为 buddy，[0-1] 分配给请求者，[2-3] 挂给 free_area[1]。

这里如果申请的页数不是 2^n 呢？比如说申请 3 块页大小，那么就会向上取到 2 的幂数个页，这里就为 4。

图示为分裂前。

alt text

图示为分裂后。

alt text

3.2 内存合并

假设 3.1 当中的 [0-1] 被释放了，内核首先会找到它的 buddy，此处它的 buddy 为 [2-3],如果 [2-3] 此时是空闲的，那么就会合并成 [0-3]，此处变成了规格为 2 的块，然后继续往上走，如果它们的 buddy [4-7] 空闲，就继续合并，以此类推。

3.3 实例

我们查看当前系统当中 buddy system 的情况。

$ cat /proc/buddyinfo 
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      2      2 
Node 0, zone    DMA32   5253   2129    925    592    257    120     74     46     16      9      3 
Node 0, zone   Normal   2486   3197   1063    507    438    146    168     55     60     18     16

其中横坐标的数字我们可以看到刚好是 NR_PAGE_ORDERS 的数目，这依次代表 buddy system 在对应规格还剩下的空闲块数。

4 slab/slub

前面提到了 buddy system 的分配单位对应的是页，有时候我们需要更为细致的方式来对内存进行管理。于是基于 buddy system，我们进一步拓展出了一个更小颗粒度的分配器。

这个分配机制是 slab/slub，此处 slub 是对 slab 的进一步优化，还有一个成为 slob 的分配器，这是用于嵌入式场景下的。

slab 的层级关系如下，对于不同类型的对象，我们转门为其在内存中分配了一个 slab cache，其维护着很多个 slab，而 slab 是从 buddy system 当中申请的以页为单位的空间，每个 slab 下面又存储着多个 object，每个 object 都可以分配出去。

alt text

对于 slab/slub，我们始终要维系着一个称为 object 的概念。

这里的 object 指的是具有某些专有用途，大小一致的一类抽象概念。因此从逻辑上来讲，slab/slub 分配器管理的就是一个以用途和大小为组织关系的数据结构集合，这类集合就是一个一个的 object。具体的记录我们可以通过以下方式查看。

$ sudo cat /proc/slabinfo 
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nf_conntrack_expect      0      0    208   39    2 : tunables    0    0    0 : slabdata      0      0      0
nf_conntrack         355    480    256   32    2 : tunables    0    0    0 : slabdata     15     15      0
QIPCRTR               39     39    832   39    8 : tunables    0    0    0 : slabdata      1      1      0
i915_dependency      256    256    128   32    1 : tunables    0    0    0 : slabdata      8      8      0
execute_cb             0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
i915_request         894    989    704   23    4 : tunables    0    0    0 : slabdata     43     43      0
drm_i915_gem_object   1472   2184   1152   28    8 : tunables    0    0    0 : slabdata     78     78      0
intel_context        210    210    768   21    4 : tunables    0    0    0 : slabdata     10     10      0
kvm_vcpu               3      3   9216    3    8 : tunables    0    0    0 : slabdata      1      1      0
x86_emulator          12     12   2672   12    8 : tunables    0    0    0 : slabdata      1      1      0
ovl_inode             69     69    712   23    4 : tunables    0    0    0 : slabdata      3      3      0
ext4_groupinfo_4k   7568   7568    184   22    1 : tunables    0    0    0 : slabdata    344    344      0
fsverity_info          0      0    264   31    2 : tunables    0    0    0 : slabdata      0      0      0
fscrypt_inode_info      0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
MPTCPv6                0      0   2176   15    8 : tunables    0    0    0 : slabdata      0      0      0
ip6-frags              0      0    184   22    1 : tunables    0    0    0 : slabdata      0      0      0
PINGv6                 0      0   1216   26    8 : tunables    0    0    0 : slabdata      0      0      0
......
kmalloc_buckets       36     36    112   36    1 : tunables    0    0    0 : slabdata      1      1      0
kmalloc-cg-8k         32     32   8192    4    8 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-cg-4k        224    240   4096    8    8 : tunables    0    0    0 : slabdata     30     30      0
kmalloc-cg-2k       1014   1152   2048   16    8 : tunables    0    0    0 : slabdata     72     72      0
kmalloc-cg-1k        793    832   1024   32    8 : tunables    0    0    0 : slabdata     26     26      0
kmalloc-cg-512      1335   1344    512   32    4 : tunables    0    0    0 : slabdata     42     42      0
kmalloc-cg-256       352    352    256   32    2 : tunables    0    0    0 : slabdata     11     11      0
kmalloc-cg-128       352    352    128   32    1 : tunables    0    0    0 : slabdata     11     11      0
kmalloc-cg-64       1344   1344     64   64    1 : tunables    0    0    0 : slabdata     21     21      0
kmalloc-cg-32       1408   1408     32  128    1 : tunables    0    0    0 : slabdata     11     11      0
kmalloc-cg-16       8446   8448     16  256    1 : tunables    0    0    0 : slabdata     33     33      0
kmalloc-cg-8        4096   4096      8  512    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-cg-192      1134   1134    192   21    1 : tunables    0    0    0 : slabdata     54     54      0
......
kmalloc-rnd-01-8    4096   4096      8  512    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-rnd-01-192    861    861    192   21    1 : tunables    0    0    0 : slabdata     41     41      0
kmalloc-rnd-01-96   2898   2898     96   42    1 : tunables    0    0    0 : slabdata     69     69      0
kmalloc-8k            32     32   8192    4    8 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-4k           104    104   4096    8    8 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-2k           384    384   2048   16    8 : tunables    0    0    0 : slabdata     24     24      0
kmalloc-1k           256    256   1024   32    8 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-512          832    832    512   32    4 : tunables    0    0    0 : slabdata     26     26      0
kmalloc-256          288    288    256   32    2 : tunables    0    0    0 : slabdata      9      9      0
kmalloc-128          256    256    128   32    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-64           512    512     64   64    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-32          1664   1664     32  128    1 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-16          2048   2048     16  256    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-8           4096   4096      8  512    1 : tunables    0    0    0 : slabdata      8      8      0
kmalloc-192          357    357    192   21    1 : tunables    0    0    0 : slabdata     17     17      0
kmalloc-96           336    336     96   42    1 : tunables    0    0    0 : slabdata      8      8      0
kmem_cache_node      704    704     64   64    1 : tunables    0    0    0 : slabdata     11     11      0
kmem_cache           544    544    256   32    2 : tunables    0    0    0 : slabdata     17     17      0

当中我们可以看到内核中常用的 kmalloc 实际上就是一个个固定大小的 slab/slub，为了应对各种大小的 kmalloc，我们也划分了多种规格的 kmalloc。

我们来看看其他属性:

active_objs 表示 slab cache 中已经被分配出去的对象个数
num_objs 表示 slab cache 中容纳的对象总数
objsize 表示 slab 中对象的 object size ，单位为字节
objperslab 表示 slab 中可以容纳的对象个数
pagesperslab 表示 slab 所需要的物理内存页个数

我们也可以通过下面的指令来查看所有的 slab cache 在内存当中的占用总量。

1
2
3

$ sudo cat /proc/meminfo 
......
Slab:            2094372 kB

也可以通过 slabtop 命令查看系统当中各个 slab cache 的内存占用量。

slab/slub 的工作流程如下:

首先会从 buddy system 当中申请出一个或多个物理 page
根据对象的尺寸，计算出对象在内存当中的真实布局从而划分出多个内存块

4.1 slab 结构

大多数的 blog 通过 slab 来描述整个算法，但是由于笔者查看的是最新的源码，slab 已经完全被 slub 替换掉了，此处决定跟随时代，直接描述 slub。

// 这里我们查看的是 slub，现代的主线已经没有了 slab，只留下了 slub
// https://elixir.bootlin.com/linux/v7.1-rc7/source/mm/slab.h#L198

struct slab {
	memdesc_flags_t flags;

	struct kmem_cache *slab_cache;
	union {
		struct {
			struct list_head slab_list;
			/* Double-word boundary */
			struct freelist_counters;
		};
		struct rcu_head rcu_head;
	};

	unsigned int __page_type;
	atomic_t __page_refcount;
#ifdef CONFIG_SLAB_OBJ_EXT
	unsigned long obj_exts;
#endif
};

struct kmem_cache {
	struct slub_percpu_sheaves __percpu *cpu_sheaves; // SLUB fastpath 机制
	/* Used for retrieving partial slabs, etc. */
	slab_flags_t flags; // 这里给对象设置了一些属性
	unsigned long min_partial;
	unsigned int size;	// 对象在内存中的真实大小，包括对齐填充之后的
	unsigned int object_size;	// 对象在内存当中的真实大小，不包含填充之后的
	struct reciprocal_value reciprocal_size;
	unsigned int offset;		/* Free pointer offset */
	unsigned int sheaf_capacity;
	struct kmem_cache_order_objects oo; // 高 16 位存储 slab 所需要的物理内存页个数，低 16 位存储 slab 所能容纳的对象总数

	/* Allocation and freeing of slabs */
	struct kmem_cache_order_objects min; // 当 oo 的 slab 申请紧张的时候会采用 min 申请
	gfp_t allocflags;		/* gfp flags to use on each alloc */
	int refcount;			/* Refcount for slab cache destroy */
	void (*ctor)(void *object);	// 用于创建 slab 对象池中的对象
	unsigned int inuse;		/* Offset to metadata */
	unsigned int align;		/* Alignment */
	unsigned int red_left_pad;	/* Left redzone padding size */
	const char *name;		/* Name (only for display!) */
	struct list_head list;	// 串联所有类型的 slab cache
#ifdef CONFIG_SYSFS
	struct kobject kobj;		/* For sysfs */
#endif
#ifdef CONFIG_SLAB_FREELIST_HARDENED
	unsigned long random;
#endif

#ifdef CONFIG_NUMA
	/*
	 * Defragmentation by allocating from a remote node.
	 */
	unsigned int remote_node_defrag_ratio;
#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM
	unsigned int *random_seq;
#endif

#ifdef CONFIG_KASAN_GENERIC
	struct kasan_cache kasan_info;
#endif

#ifdef CONFIG_HARDENED_USERCOPY
	unsigned int useroffset;	/* Usercopy region offset */
	unsigned int usersize;		/* Usercopy region size */
#endif

#ifdef CONFIG_SLUB_STATS
	struct kmem_cache_stats __percpu *cpu_stats;
#endif

	struct kmem_cache_per_node_ptrs per_node[MAX_NUMNODES]; // Node 管理 slab 页面
};

// 以下是 cpu 为基础的 slab 管理

struct slub_percpu_sheaves {
	local_trylock_t lock;
	struct slab_sheaf *main; /* never NULL when unlocked */
	struct slab_sheaf *spare; /* empty or full, may be NULL */
	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
};

struct slab_sheaf {
	union {
		struct rcu_head rcu_head;
		struct list_head barn_list;
		/* only used for prefilled sheafs */
		struct {
			unsigned int capacity;
			bool pfmemalloc;
		};
	};
	struct kmem_cache *cache;
	unsigned int size;
	int node; /* only used for rcu_sheaf */
	void *objects[];
};

// 以下是 node 为基础的 slab 管理

struct kmem_cache_per_node_ptrs {
	struct node_barn *barn;
	struct kmem_cache_node *node;
};

struct node_barn {
	spinlock_t lock;
	struct list_head sheaves_full; // 满的 sheaf
	struct list_head sheaves_empty; // 空的 sheaf
	unsigned int nr_full;
	unsigned int nr_empty;
};

struct kmem_cache_node {
	spinlock_t list_lock;
	unsigned long nr_partial; // slab 页面数量
	struct list_head partial; // slab 页面链表
#ifdef CONFIG_SLUB_DEBUG
	atomic_long_t nr_slabs;
	atomic_long_t total_objects;
	struct list_head full;
#endif
};

见图:

alt text

4.2 slab 对外 API

kmem_cache_create:

创建一个用于管理 slab 缓存的 kmem_cache 结构，并对该结构体进行初始化，最终添加到全局链表中。kmem_cache结构体初始化，包括了上文中的 slub_percpu_sheaves 和 kmem_cache_per_node_ptrs 两个字段结构。

kmem_cache_alloc:

该接口用于从 slab cache 中分配对象。

kmem_cache_free:

该接口用于从 slab cache 中释放对象。

我们来做一个 demo:

#include <linux/atomic.h>
#include <linux/gfp.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/slab.h>
#include <linux/string.h>

#define SLUB_DEMO_MAX_ALLOCS 8
#define SLUB_DEMO_MAGIC 0x51ab51ab

static unsigned int alloc_count = 3;
module_param(alloc_count, uint, 0444);
MODULE_PARM_DESC(alloc_count, "How many objects to allocate from the custom kmem_cache");

static unsigned int kmalloc_request = 130;
module_param(kmalloc_request, uint, 0444);
MODULE_PARM_DESC(kmalloc_request, "Byte size used for the kmalloc bucket demo");

struct slub_demo_obj {
	u32 magic;
	u32 ctor_serial;
	u32 last_allocation_round;
	u32 payload_len;
	char payload[96];
};

static struct kmem_cache *demo_cache;
static struct slub_demo_obj *demo_objs[SLUB_DEMO_MAX_ALLOCS];
static struct slub_demo_obj *reused_obj;
static void *kmalloc_buf;
static atomic_t ctor_calls = ATOMIC_INIT(0);

static void slub_demo_ctor(void *object)
{
	struct slub_demo_obj *obj = object;

	obj->magic = SLUB_DEMO_MAGIC;
	obj->ctor_serial = atomic_inc_return(&ctor_calls);
	obj->last_allocation_round = 0;
	obj->payload_len = sizeof(obj->payload);
	memset(obj->payload, 0x5a, sizeof(obj->payload));
}

static void dump_demo_obj(const char *tag, struct slub_demo_obj *obj)
{
	if (!obj) {
		pr_info("%s: object is NULL\n", tag);
		return;
	}

	pr_info("%s: obj=%px magic=0x%x ctor_serial=%u alloc_round=%u first_payload_byte=0x%x\n",
		tag, obj, obj->magic, obj->ctor_serial,
		obj->last_allocation_round, (unsigned char)obj->payload[0]);
}

static int __init slub_demo_init(void)
{
	unsigned int i;
	unsigned int free_index;
	struct slub_demo_obj *freed_obj;

	if (alloc_count == 0 || alloc_count > SLUB_DEMO_MAX_ALLOCS)
		return -EINVAL;

	pr_info("slub_demo: init alloc_count=%u kmalloc_request=%u\n",
		alloc_count, kmalloc_request);

    // 我们创建一个名为 slub_demo_cache 的 slab cache
	demo_cache = kmem_cache_create("slub_demo_cache",
				       sizeof(struct slub_demo_obj),
				       0, SLAB_HWCACHE_ALIGN,
				       slub_demo_ctor);
	if (!demo_cache)
		return -ENOMEM;

	pr_info("slub_demo: sizeof(object)=%zu kmem_cache_size=%u\n",
		sizeof(struct slub_demo_obj), kmem_cache_size(demo_cache));
	pr_info("slub_demo: ctor runs when SLUB prepares new objects, not on every alloc\n");

    // 这里开始分配，分配 alloc_count 个 objects
	for (i = 0; i < alloc_count; i++) {
		demo_objs[i] = kmem_cache_alloc(demo_cache, GFP_KERNEL);
		if (!demo_objs[i])
			return -ENOMEM;

		demo_objs[i]->last_allocation_round = i + 1;
		dump_demo_obj("custom-cache alloc", demo_objs[i]);
	}

	pr_info("slub_demo: ctor_calls after %u allocations = %d\n",
		alloc_count, atomic_read(&ctor_calls));

	free_index = alloc_count / 2;
	freed_obj = demo_objs[free_index];
	pr_info("slub_demo: freeing slot %u to observe freelist reuse\n", free_index);
	dump_demo_obj("custom-cache free target", freed_obj);

    // 这里释放我们上面分配的 object[free_index]
	kmem_cache_free(demo_cache, freed_obj);
	demo_objs[free_index] = NULL;

    // 这里再次分配一个 object, 这里测试的是会不会复用之前释放的 object
	reused_obj = kmem_cache_alloc(demo_cache, GFP_KERNEL);
	if (!reused_obj)
		return -ENOMEM;

	reused_obj->last_allocation_round = alloc_count + 1;
	dump_demo_obj("custom-cache re-allocation", reused_obj);
	if (freed_obj == reused_obj)
		pr_info("slub_demo: allocator returned the same address immediately after free\n");
	else
		pr_info("slub_demo: allocator returned a different address; same-cache reuse can still occur later\n");

	pr_info("slub_demo: ctor_calls after re-allocation = %d\n",
		atomic_read(&ctor_calls));

    // 这里尝试使用 kmalloc 进行分配
	kmalloc_buf = kmalloc(kmalloc_request, GFP_KERNEL);
	if (!kmalloc_buf)
		return -ENOMEM;

	memset(kmalloc_buf, 0x23, kmalloc_request);
	pr_info("slub_demo: kmalloc request=%u actual_bucket=%zu ptr=%px\n",
		kmalloc_request, ksize(kmalloc_buf), kmalloc_buf);
	pr_info("slub_demo: for small requests, kmalloc is usually serving from a SLUB size cache rather than directly from buddy\n");

	return 0;
}

static void __exit slub_demo_exit(void)
{
	unsigned int i;

	if (kmalloc_buf)
		kfree(kmalloc_buf);

	if (reused_obj)
		kmem_cache_free(demo_cache, reused_obj);

	for (i = 0; i < alloc_count; i++) {
		if (demo_objs[i])
			kmem_cache_free(demo_cache, demo_objs[i]);
	}

	if (demo_cache)
		kmem_cache_destroy(demo_cache);

	pr_info("slub_demo: exit ctor_calls=%d\n", atomic_read(&ctor_calls));
}

module_init(slub_demo_init);
module_exit(slub_demo_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Jvle");
MODULE_DESCRIPTION("Educational SLUB demo for kmem_cache, constructors, and object reuse");

makefile:

KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)

obj-m := slub_demo.o

all:
	$(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
	$(MAKE) -C $(KDIR) M=$(PWD) clean

观察日志情况:

[89656.300782] slub_demo: loading out-of-tree module taints kernel.
[89656.300803] slub_demo: module verification failed: signature and/or required key missing - tainting kernel
[89656.305610] slub_demo: init alloc_count=3 kmalloc_request=130
[89656.305690] slub_demo: sizeof(object)=112 kmem_cache_size=112
[89656.305695] slub_demo: ctor runs when SLUB prepares new objects, not on every alloc
[89656.305706] custom-cache alloc: obj=ffff8b372d2d9080 magic=0x51ab51ab ctor_serial=1 alloc_round=1 first_payload_byte=0x5a
[89656.305716] custom-cache alloc: obj=ffff8b372d2d9c80 magic=0x51ab51ab ctor_serial=2 alloc_round=2 first_payload_byte=0x5a
[89656.305723] custom-cache alloc: obj=ffff8b372d2d9400 magic=0x51ab51ab ctor_serial=3 alloc_round=3 first_payload_byte=0x5a
[89656.305728] slub_demo: ctor_calls after 3 allocations = 32
[89656.305733] slub_demo: freeing slot 1 to observe freelist reuse
[89656.305736] custom-cache free target: obj=ffff8b372d2d9c80 magic=0x51ab51ab ctor_serial=2 alloc_round=2 first_payload_byte=0x5a
[89656.305743] custom-cache re-allocation: obj=ffff8b372d2d9c80 magic=0x51ab51ab ctor_serial=2 alloc_round=4 first_payload_byte=0x5a
[89656.305749] slub_demo: allocator returned the same address immediately after free
[89656.305752] slub_demo: ctor_calls after re-allocation = 32
[89656.305757] slub_demo: kmalloc request=130 actual_bucket=192 ptr=ffff8b3697fa06c0
[89656.305762] slub_demo: for small requests, kmalloc is usually serving from a SLUB size cache rather than directly from buddy

再次查看情况是否和想象的一样。

$ sudo cat /proc/slabinfo | grep slub_demo
slub_demo_cache       32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
$ sudo rmmod slub_demo.ko 
$ sudo cat /proc/slabinfo | grep slub_demo
$ sudo dmesg | tail -n 200
[89717.021448] slub_demo: exit ctor_calls=32

4.3 slub 详解函数

TODO

References

一篇还不错的 blog: https://richardweiyang-2.gitbook.io/kernel-exploring/nei-cun-guan-li/00-memory_a_bottom_up_view/02-memblock
对于 buddy system 有不错的解释: https://blog.csdn.net/yhb1047818384/article/details/114454299
slab 的详细思路: https://people.eecs.berkeley.edu/~kubitron/courses/cs194-24-S13/hand-outs/bonwick_slab.pdf
关于 slab 的详细参数: https://www.cnblogs.com/binlovetech/p/17288990.html