内存管理篇——内存模型

由于暑期无事,为了巩固自己对整个 Linux 内核的了解,决定记录一下学习内容,或是阅读文档,或是阅读代码,或是调试代码,都放置于此,现在开始 Linux 的内存管理模块。

1 物理内存

1.1 抽象概念

Linux 需要一个架构无关的抽象来描述物理内存,我们将内存中一块儿一块儿的存储区域称为 node,在 linux 当中这些 node 通过一个结构体 struct pglist_datahttps://elixir.bootlin.com/linux/v7.0.10/source/include/linux/mmzone.h#L1381) 来表示。

在不同的机器上,我们可以将内存分为两种架构,NUMA 和 UMA 是由硬件结构决定的,操作系统只是一个识别者:

计算机的 CPU 是一个拓扑结构,在 NUMA 中我们抽象出了一个叫 “distance” 的概念,其实他所描述的就是不同的处理器访问不同的内存所需要的代价(具体所谓的代价应该是硬件层面决定的,这里理解不太深刻,TODO)。

alt text

alt text

node 被用来表示一个一个的存储块,更具体的来说,我们用一个称为 zone 的区域来表示 node 内部各个区域的属性。这些区域的被结构体 struct zone 来表示。

我们可以查看本设备的物理内存信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
$ lscpu | grep NUMA
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 15651 MB
node 0 free: 683 MB
node distances:
node 0
0: 10
$ cat /sys/devices/system/node/node*/meminfo
Node 0 MemTotal: 16027208 kB
Node 0 MemFree: 685932 kB
Node 0 MemUsed: 15341276 kB
Node 0 SwapCached: 0 kB
Node 0 Active: 7153124 kB
Node 0 Inactive: 4680456 kB
Node 0 Active(anon): 4219400 kB
Node 0 Inactive(anon): 0 kB
Node 0 Active(file): 2933724 kB
Node 0 Inactive(file): 4680456 kB
Node 0 Unevictable: 532076 kB
Node 0 Mlocked: 132 kB
Node 0 Dirty: 2396 kB
Node 0 Writeback: 0 kB
Node 0 FilePages: 8212204 kB
Node 0 Mapped: 1680792 kB
Node 0 AnonPages: 4146872 kB
Node 0 Shmem: 598024 kB
Node 0 KernelStack: 24048 kB
Node 0 PageTables: 68108 kB
Node 0 SecPageTables: 2928 kB
Node 0 NFS_Unstable: 0 kB
Node 0 Bounce: 0 kB
Node 0 WritebackTmp: 0 kB
Node 0 KReclaimable: 2229864 kB
Node 0 Slab: 2634832 kB
Node 0 SReclaimable: 2229864 kB
Node 0 SUnreclaim: 404968 kB
Node 0 AnonHugePages: 6144 kB
Node 0 ShmemHugePages: 0 kB
Node 0 ShmemPmdMapped: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 FilePmdMapped: 0 kB
Node 0 Unaccepted: 0 kB
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0

这里我的电脑上只有一个 Node, 于是我们可以查看该 Node 的相关信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
$ cat /proc/zoneinfo
Node 0, zone DMA
per-node stats
nr_inactive_anon 0
nr_active_anon 1054333
nr_inactive_file 1170182
nr_active_file 733431
nr_unevictable 134124
nr_slab_reclaimable 557477
nr_slab_unreclaimable 101234
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 0
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
nr_anon_pages 1036294
nr_mapped 420232
nr_file_pages 2054200
nr_dirty 174
nr_writeback 0
nr_writeback_temp 0
nr_shmem 150583
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 3
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 560659
nr_written 543621
nr_throttled_written 0
nr_kernel_misc_reclaimable 0
nr_foll_pin_acquired 1581
nr_foll_pin_released 1581
nr_kernel_stack 24096
nr_page_table_pages 17031
nr_sec_page_table_pages 732
nr_iommu_pages 732
nr_swapcached 0
pgpromote_success 0
pgpromote_candidate 0
pgdemote_kswapd 0
pgdemote_direct 0
pgdemote_khugepaged 0
nr_hugetlb 0
pages free 3328
boost 0
min 16
low 20
high 24
promo 28
spanned 4095
present 3998
managed 3840
cma 0
protection: (0, 1633, 15636, 15636, 15636)
nr_free_pages 3328
nr_zone_inactive_anon 0
nr_zone_active_anon 0
nr_zone_inactive_file 0
nr_zone_active_file 0
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_bounce 0
nr_zspages 0
nr_free_cma 0
nr_unaccepted 0
numa_hit 1
numa_miss 0
numa_foreign 0
numa_interleave 1
numa_local 1
numa_other 0
pagesets
cpu: 0
count: 0
high: 5
batch: 1
high_min: 4
high_max: 60
vm stats threshold: 8
cpu: 1
count: 0
high: 0
batch: 1
high_min: 4
high_max: 60
vm stats threshold: 8
cpu: 2
count: 0
high: 0
batch: 1
high_min: 4
high_max: 60
vm stats threshold: 8
cpu: 3
count: 0
high: 0
batch: 1
high_min: 4
high_max: 60
vm stats threshold: 8
cpu: 4
count: 0
high: 0
batch: 1
high_min: 4
high_max: 60
vm stats threshold: 8
cpu: 5
count: 0
high: 0
batch: 1
high_min: 4
high_max: 60
vm stats threshold: 8
cpu: 6
count: 0
high: 0
batch: 1
high_min: 4
high_max: 60
vm stats threshold: 8
cpu: 7
count: 0
high: 0
batch: 1
high_min: 4
high_max: 60
vm stats threshold: 8
node_unreclaimable: 0
start_pfn: 1
Node 0, zone DMA32
pages free 138839
boost 0
min 1689
low 2111
high 2533
promo 2955
spanned 1044480
present 434615
managed 418068
cma 0
protection: (0, 0, 14003, 14003, 14003)
nr_free_pages 138839
nr_zone_inactive_anon 0
nr_zone_active_anon 221943
nr_zone_inactive_file 33938
nr_zone_active_file 1958
nr_zone_unevictable 15729
nr_zone_write_pending 22
nr_mlock 4
nr_bounce 0
nr_zspages 0
nr_free_cma 0
nr_unaccepted 0
numa_hit 4049687
numa_miss 0
numa_foreign 0
numa_interleave 1
numa_local 4049687
numa_other 0
pagesets
cpu: 0
count: 206
high: 263
batch: 63
high_min: 263
high_max: 6227
vm stats threshold: 40
cpu: 1
count: 74
high: 263
batch: 63
high_min: 263
high_max: 6227
vm stats threshold: 40
cpu: 2
count: 245
high: 263
batch: 63
high_min: 263
high_max: 6227
vm stats threshold: 40
cpu: 3
count: 283
high: 779
batch: 63
high_min: 263
high_max: 6227
vm stats threshold: 40
cpu: 4
count: 263
high: 263
batch: 63
high_min: 263
high_max: 6227
vm stats threshold: 40
cpu: 5
count: 210
high: 263
batch: 63
high_min: 263
high_max: 6227
vm stats threshold: 40
cpu: 6
count: 276
high: 326
batch: 63
high_min: 263
high_max: 6227
vm stats threshold: 40
cpu: 7
count: 175
high: 263
batch: 63
high_min: 263
high_max: 6227
vm stats threshold: 40
node_unreclaimable: 0
start_pfn: 4096
Node 0, zone Normal
pages free 32681
boost 6656
min 21845
low 25642
high 29439
promo 33236
spanned 3667968
present 3667968
managed 3584894
cma 0
protection: (0, 0, 0, 0, 0)
nr_free_pages 32681
nr_zone_inactive_anon 0
nr_zone_active_anon 832390
nr_zone_inactive_file 1136244
nr_zone_active_file 731473
nr_zone_unevictable 118395
nr_zone_write_pending 152
nr_mlock 29
nr_bounce 0
nr_zspages 0
nr_free_cma 0
nr_unaccepted 0
numa_hit 27017726
numa_miss 0
numa_foreign 0
numa_interleave 4775
numa_local 27017726
numa_other 0
pagesets
cpu: 0
count: 562
high: 42857
batch: 63
high_min: 2373
high_max: 55976
vm stats threshold: 64
cpu: 1
count: 2530
high: 8218
batch: 63
high_min: 2373
high_max: 55976
vm stats threshold: 64
cpu: 2
count: 369
high: 10685
batch: 63
high_min: 2373
high_max: 55976
vm stats threshold: 64
cpu: 3
count: 321
high: 42857
batch: 63
high_min: 2373
high_max: 55976
vm stats threshold: 64
cpu: 4
count: 2213
high: 6865
batch: 63
high_min: 2373
high_max: 55976
vm stats threshold: 64
cpu: 5
count: 1436
high: 46782
batch: 63
high_min: 2373
high_max: 55976
vm stats threshold: 64
cpu: 6
count: 5754
high: 51670
batch: 63
high_min: 2373
high_max: 55976
vm stats threshold: 64
cpu: 7
count: 2500
high: 11711
batch: 63
high_min: 2373
high_max: 55976
vm stats threshold: 64
node_unreclaimable: 0
start_pfn: 1048576
Node 0, zone Movable
pages free 0
boost 0
min 32
low 32
high 32
promo 32
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0, 0)
Node 0, zone Device
pages free 0
boost 0
min 0
low 0
high 0
promo 0
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0, 0)

1.2 内存模型

Linux 有三种内存模型,但是我们只能使用一种,这是在内核编译的时候就确定了的。他们分别是:

  • FLATMEM: 理想情况下,物理内存是一个地址连续的存储空间,这样 PFN 也是连续的。该模型的特点是适用于 UMA 的物理内存。在该模型当中,有一个全局的 mem_map(https://elixir.bootlin.com/linux/v7.0.10/source/mm/mm_init.c#L45) 数组来映射整个物理内存。

    这里 ARCH_PFN_OFFSET 貌似是在不同的架构下可能物理内存的起始地址并不是第0号页帧(TODO,这里存疑),这里表示页帧号的起始偏移量。

    1
    2
    3
    4
    5
    6
    /* file: include/asm-generic/memory_model.h */

    #if defined(CONFIG_FLATMEM)
    #define __pfn_to_page(pfn) (mem_map + ((pfn)-ARCH_PFN_OFFSET))
    #define __page_to_pfn(page) ((unsigned long)((page)-mem_map) + ARCH_PFN_OFFSET)
    #endif
  • DISCONTIGMEM: 当内存不连续的时候,FLATMEM 就不再方便管理了,上面提到的 mem_map 数组使用 PFN 作为 page 的索引,当空间不连续的时候,该数组就会造成大量的 hole,另外对于我们的 NUMA 架构,每个 node 都存在自己的内存区域,使用全局的变量来追踪内存块也不太合理。

    在这个情况下,我们实际将全局的 mem_map 放在了每个 node 的结构体当中作为一个字段,每个 node 自己管理自己的内存。因此我们在索引 PFN 和 page 的关系的时候,多了一个去查看 node 的步骤。

    1
    2
    3
    4
    5
    6
    /* file: include/linux/mmzone.h */

    typedef struct pglist_data {
    #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
    struct page *node_mem_map;
    #endif
  • SPARSEMEM: 通用的内存模型,支持若干的高级功能,该模型将内存分为多个区段,每个连续的地址区段用 mem_section(https://elixir.bootlin.com/linux/v7.0.10/source/mm/sparse.c#L27)数组表示,每个 section 又被单独地管理起来。mem_section 的结构体里面有一个 section_mem_map 字段,该字段指向连续的 page 对象。该模型是对 DISCONTIGMEM 的升级。

  • ZONE_DEVICE: TODO

这里我们重点理解一下前三种模型。

alt text

2 NODE

我们的操作系统维护了一个 node 结构体的快速索引表,这个索引表主要是为了判别 node 的相关属性的。这个索引表是一个属性为 nodemask_t 的数组(https://elixir.bootlin.com/linux/v7.0.10/source/mm/page_alloc.c#L224)。

  • N_POSSIBLE: 表示这个 node 理论上可能存在,或者未来可能被带上线。不代表它现在一定可用,只是说它在系统的可能拓扑里。
  • N_ONLINE: 表示这个 node 当前已经在线、可用。这是“现在能不能参与系统运行”的状态。
  • N_NORMAL_MEMORY: 表示这个 node 里有 normal memory,也就是常规可用的普通内存。
  • N_HIGH_MEMORY: 表示这个 node 里有 normal memory 或 high memory。这个主要和老的 32 位高端内存模型有关。如果没开 CONFIG_HIGHMEM,它基本就等同于 N_NORMAL_MEMORY。
  • N_MEMORY: 表示这个 node 里有 内存,范围更泛一些,包含:
    • normal memory
    • high memory
    • movable memory
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# (base) jvle@jvle-ThinkPad-X1-Carbon-Gen-8 15:08:00> <~/.../kernel_study/mm>
$ cat /sys/devices/system/node/possible
0
# (base) jvle@jvle-ThinkPad-X1-Carbon-Gen-8 15:08:02> <~/.../kernel_study/mm>
$ cat /sys/devices/system/node/online
0
# (base) jvle@jvle-ThinkPad-X1-Carbon-Gen-8 15:08:08> <~/.../kernel_study/mm>
$ cat /sys/devices/system/node/has_cpu
0
# (base) jvle@jvle-ThinkPad-X1-Carbon-Gen-8 15:08:15> <~/.../kernel_study/mm>
$ cat /sys/devices/system/node/has_memory
0
# (base) jvle@jvle-ThinkPad-X1-Carbon-Gen-8 15:08:18> <~/.../kernel_study/mm>
$ cat /sys/devices/system/node/has_normal_memory
0

以上指令可以查看 node 掩码数组,这里的 0 表示 node0 在掩码当中。

内核代码一般如何使用 node?

1
2
3
4
5
// 或者 for_each_node(nid)
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
foo(pgdat);
}

遍历所有在线 node,取到该 node 的 pg_data_t,对每个 node 做操作。

我们也可以通过 node_states 去访问。

1
2
3
4
5
6
node_states[N_POSSIBLE]
node_states[N_ONLINE]
node_states[N_NORMAL_MEMORY]
node_states[N_HIGH_MEMORY]
node_states[N_MEMORY]
node_states[N_CPU]

3 ZONE

zone 的类型有:

  1. ZONE_DMA: 表示一小段适合老旧或受限 DMA 设备使用的内存。这类设备不能访问全部物理地址,只能访问较低地址范围,所以内核专门留出这类 zone。
  2. ZONE_DMA32: 和 ZONE_DMA 类似,但通常表示 32 位 DMA 可寻址范围内 的内存。常见于 64 位平台上,某些设备虽然运行在 64 位系统里,但 DMA 仍只能打到 4GB 以内。
  3. ZONE_NORMAL: 最常规、最重要的普通内存区域。内核可以一直直接访问这部分内存,很多核心内存管理操作都依赖它,所以它通常是最关键的 zone。
  4. ZONE_HIGHMEM: 高端内存。主要出现在某些 32 位架构上,这部分物理内存没有永久映射到内核地址空间,内核要访问它时,需要临时建立映射。现代 64 位系统里一般不需要它。
  5. ZONE_MOVABLE: 可移动内存区域。这里的大多数页内容可以在不同物理页之间迁移,因此适合做页迁移、内存热插拔、减少碎片等场景。它看起来像普通内存,但更强调“内容可搬移”。
  6. ZONE_DEVICE: 设备内存区域。表示不属于普通 RAM,而是来自设备的内存,比如 PMEM、GPU memory。它存在的目的不是把这类内存当普通内存完全等价使用,而是让内核能为这些地址范围提供 struct page 和相关管理能力。

4 代码观察

4.1 NODE 和 ZONE 代码实例

以下给出一段内核代码来观察本机的内存情况。

1
2
3
4
5
6
7
8
9
10
obj-m += pgdat_inspect.o

KDIR ?= /lib/modules/$(shell uname -r)/build
PWD := $(CURDIR)

all:
$(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
$(MAKE) -C $(KDIR) M=$(PWD) clean
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
#include <linux/atomic.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/mmzone.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/nodemask.h>

static int target_nid = -1;
module_param(target_nid, int, 0444);
MODULE_PARM_DESC(target_nid, "Only dump this NUMA node id; -1 dumps all online nodes");

static bool show_empty_zones;
module_param(show_empty_zones, bool, 0444);
MODULE_PARM_DESC(show_empty_zones, "Also print zones with present_pages == 0");

static void dump_zone_lowmem_reserve(const struct zone *zone)
{
int i;

pr_info(" lowmem_reserve:");
for (i = 0; i < MAX_NR_ZONES; i++)
pr_cont(" %ld", zone->lowmem_reserve[i]);
pr_cont("\n");
}

static void dump_zone(struct zone *zone)
{
if (!show_empty_zones && !populated_zone(zone))
return;

pr_info(" zone[%lu] %s\n", zone_idx(zone), zone->name);
pr_info(" start_pfn=%lu present=%lu managed=%ld spanned=%lu\n",
zone->zone_start_pfn,
zone->present_pages,
atomic_long_read(&zone->managed_pages),
zone->spanned_pages);
pr_info(" watermarks: min=%lu low=%lu high=%lu promo=%lu boost=%lu\n",
zone->_watermark[WMARK_MIN],
zone->_watermark[WMARK_LOW],
zone->_watermark[WMARK_HIGH],
zone->_watermark[WMARK_PROMO],
zone->watermark_boost);
pr_info(" pageset: high_min=%d high_max=%d batch=%d\n",
zone->pageset_high_min,
zone->pageset_high_max,
zone->pageset_batch);
dump_zone_lowmem_reserve(zone);
#ifdef CONFIG_CMA
pr_info(" cma_pages=%lu\n", zone->cma_pages);
#endif
pr_info(" flags=0x%lx initialized=%d contiguous=%d\n",
zone->flags, zone->initialized, zone->contiguous);
}

static void dump_fallback_zonelist(pg_data_t *pgdat)
{
struct zoneref *zref;
int entry = 0;

pr_info(" fallback zonelist (node_zonelists[%d])\n", ZONELIST_FALLBACK);

for (zref = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
zonelist_zone(zref);
zref++, entry++) {
struct zone *zone = zonelist_zone(zref);

pr_info(" [%d] node=%d zone=%s zone_idx=%d\n",
entry,
zonelist_node_idx(zref),
zone->name,
zonelist_zone_idx(zref));
}
}

static void dump_pgdat(pg_data_t *pgdat)
{
int zid;

pr_info("pgdat for node %d\n", pgdat->node_id);
pr_info(" nr_zones=%d node_start_pfn=%lu present_pages=%lu spanned_pages=%lu\n",
pgdat->nr_zones,
pgdat->node_start_pfn,
pgdat->node_present_pages,
pgdat->node_spanned_pages);
pr_info(" totalreserve_pages=%lu kswapd_order=%d kswapd_highest_zoneidx=%d failures=%d\n",
pgdat->totalreserve_pages,
pgdat->kswapd_order,
pgdat->kswapd_highest_zoneidx,
pgdat->kswapd_failures);
#ifdef CONFIG_NUMA
pr_info(" min_unmapped_pages=%lu min_slab_pages=%lu\n",
pgdat->min_unmapped_pages,
pgdat->min_slab_pages);
#endif
#ifdef CONFIG_FLATMEM
pr_info(" node_mem_map=%px\n", pgdat->node_mem_map);
#ifdef CONFIG_PAGE_EXTENSION
pr_info(" node_page_ext=%px\n", pgdat->node_page_ext);
#endif
#endif
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
pr_info(" first_deferred_pfn=%lu\n", pgdat->first_deferred_pfn);
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pr_info(" deferred_split_queue_len=%lu\n",
pgdat->deferred_split_queue.split_queue_len);
#endif
pr_info(" flags=0x%lx\n", pgdat->flags);

for (zid = 0; zid < MAX_NR_ZONES; zid++)
dump_zone(&pgdat->node_zones[zid]);

dump_fallback_zonelist(pgdat);
}

static int __init pgdat_inspect_init(void)
{
int nid;
bool dumped = false;

pr_info("pgdat_inspect: loading (target_nid=%d show_empty_zones=%d)\n",
target_nid, show_empty_zones);

for_each_online_node(nid) {
if (target_nid >= 0 && nid != target_nid)
continue;

dump_pgdat(NODE_DATA(nid));
dumped = true;
}

if (!dumped) {
pr_warn("pgdat_inspect: no matching online node for target_nid=%d\n",
target_nid);
return -ENODEV;
}

return 0;
}

static void __exit pgdat_inspect_exit(void)
{
pr_info("pgdat_inspect: unloading\n");
}

module_init(pgdat_inspect_init);
module_exit(pgdat_inspect_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("OpenAI");
MODULE_DESCRIPTION("Inspect pg_data_t and zone fields for learning Linux memory management");

最后安装到内核当中去观察情况。

1
2
3
4
make
sudo insmod pgdat_inspect.ko
dmesg | tail -n 200
sudo rmmod pgdat_inspect

如果只去看某个 node:

1
sudo insmod pgdat_inspect.ko target_nid=0

如果想把空的 zone 也打出来。

1
sudo insmod pgdat_inspect.ko show_empty_zones=1

不加任何参数,观察如下:

具体解释可以查看 https://www.kernel.org/doc/html/latest/translations/zh_CN/mm/physical_memory.html#id4。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[112504.823310] pgdat_inspect: loading out-of-tree module taints kernel.
[112504.823316] pgdat_inspect: module verification failed: signature and/or required key missing - tainting kernel
[112504.825889] pgdat_inspect: loading (target_nid=-1 show_empty_zones=0)
[112504.825892] pgdat for node 0
[112504.825893] nr_zones=3 node_start_pfn=1 present_pages=4106581 spanned_pages=4716543
[112504.825894] totalreserve_pages=43159 kswapd_order=0 kswapd_highest_zoneidx=5 failures=0
[112504.825895] min_unmapped_pages=39848 min_slab_pages=199244
[112504.825896] deferred_split_queue_len=0
[112504.825897] flags=0x0
[112504.825897] zone[0] DMA
[112504.825898] start_pfn=1 present=3998 managed=3840 spanned=4095
[112504.825899] watermarks: min=16 low=20 high=24 promo=28 boost=0
[112504.825900] pageset: high_min=4 high_max=60 batch=1
[112504.825901] lowmem_reserve: 0 1633 15636 15636 15636
[112504.825905] flags=0x4 initialized=1 contiguous=1
[112504.825906] zone[1] DMA32
[112504.825907] start_pfn=4096 present=434615 managed=418068 spanned=1044480
[112504.825908] watermarks: min=1689 low=2111 high=2533 promo=2955 boost=0
[112504.825909] pageset: high_min=263 high_max=6227 batch=63
[112504.825910] lowmem_reserve: 0 0 14003 14003 14003
[112504.825913] flags=0x0 initialized=1 contiguous=0
[112504.825914] zone[2] Normal
[112504.825914] start_pfn=1048576 present=3667968 managed=3584894 spanned=3667968
[112504.825915] watermarks: min=15189 low=18986 high=22783 promo=26580 boost=0
[112504.825916] pageset: high_min=2373 high_max=55976 batch=63
[112504.825917] lowmem_reserve: 0 0 0 0 0
[112504.825920] flags=0x0 initialized=1 contiguous=1
[112504.825921] fallback zonelist (node_zonelists[0])
[112504.825922] [0] node=0 zone=Normal zone_idx=2
[112504.825923] [1] node=0 zone=DMA32 zone_idx=1
[112504.825924] [2] node=0 zone=DMA zone_idx=0

4.2 内存模型实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/gfp.h>

static unsigned long target_pfn = ~0UL;
module_param(target_pfn, ulong, 0444);
MODULE_PARM_DESC(target_pfn, "Optional PFN to inspect with pfn_to_page");

static unsigned int probe_count = 4;
module_param(probe_count, uint, 0444);
MODULE_PARM_DESC(probe_count, "How many PFNs to probe starting from the allocated page");

static struct page *exp_page;

static const char *memmodel_name(void)
{
#if defined(CONFIG_FLATMEM)
return "FLATMEM";
#elif defined(CONFIG_SPARSEMEM)
return "SPARSEMEM";
#elif defined(CONFIG_DISCONTIGMEM)
return "DISCONTIGMEM";
#else
return "UNKNOWN";
#endif
}

static void dump_page_info(const char *tag, struct page *page)
{
unsigned long pfn;
phys_addr_t phys;
struct zone *zone;

if (!page) {
pr_info("%s: page is NULL\n", tag);
return;
}

pfn = page_to_pfn(page);
phys = page_to_phys(page);
zone = page_zone(page);

pr_info(
"%s: page=%px pfn=%lu phys=%pa nid=%d zone=%s refcnt=%d\n",
tag, page, pfn, &phys, page_to_nid(page),
zone ? zone->name : "unknown", page_count(page));
}

static void probe_single_pfn(unsigned long pfn)
{
struct page *page;
phys_addr_t phys;

if (!pfn_valid(pfn)) {
pr_info("probe: pfn=%lu invalid on this machine\n", pfn);
return;
}

page = pfn_to_page(pfn);
phys = PFN_PHYS(pfn);

pr_info("probe: pfn=%lu valid page=%px phys=%pa round_trip_pfn=%lu\n",
pfn, page, &phys, page_to_pfn(page));
}

static int __init memory_model_init(void)
{
unsigned long pfn;
unsigned int i;

pr_info("memory_model: init\n");
pr_info("memory_model: model=%s sizeof(struct page)=%zu\n",
memmodel_name(), sizeof(struct page));

#ifdef CONFIG_SPARSEMEM_VMEMMAP
pr_info("memory_model: CONFIG_SPARSEMEM_VMEMMAP enabled\n");
#endif

exp_page = alloc_page(GFP_KERNEL);
if (!exp_page)
return -ENOMEM;

dump_page_info("allocated", exp_page);

pfn = page_to_pfn(exp_page);
pr_info("round-trip: pfn_to_page(page_to_pfn(page)) == page ? %d\n",
pfn_to_page(pfn) == exp_page);

for (i = 0; i < probe_count; i++)
probe_single_pfn(pfn + i);

if (target_pfn != ~0UL) {
pr_info("memory_model: probing target_pfn=%lu\n", target_pfn);
probe_single_pfn(target_pfn);
}

return 0;
}

static void __exit memory_model_exit(void)
{
if (exp_page)
__free_page(exp_page);

pr_info("memory_model: exit\n");
}

module_init(memory_model_init);
module_exit(memory_model_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("OpenAI Codex");
MODULE_DESCRIPTION("Minimal experiment for Linux memory model and PFN/page mapping");
1
2
3
4
5
6
7
8
9
10
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)

obj-m := memory_model.o

all:
$(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
$(MAKE) -C $(KDIR) M=$(PWD) clean

输出。

具体解释一下,这里的 pfn 指代的是页帧号,page 指代的是 struct page 的地址,phys 才是对应的物理地址。

1
2
3
4
5
6
7
8
9
[195192.121075] memory_model: init
[195192.121085] memory_model: model=SPARSEMEM sizeof(struct page)=64
[195192.121092] memory_model: CONFIG_SPARSEMEM_VMEMMAP enabled
[195192.121097] allocated: page=fffff13305ca0900 pfn=1517604 phys=0x0000000172824000 nid=0 zone=Normal refcnt=1
[195192.121107] round-trip: pfn_to_page(page_to_pfn(page)) == page ? 1
[195192.121111] probe: pfn=1517604 valid page=fffff13305ca0900 phys=0x0000000172824000 round_trip_pfn=1517604
[195192.121117] probe: pfn=1517605 valid page=fffff13305ca0940 phys=0x0000000172825000 round_trip_pfn=1517605
[195192.121122] probe: pfn=1517606 valid page=fffff13305ca0980 phys=0x0000000172826000 round_trip_pfn=1517606
[195192.121126] probe: pfn=1517607 valid page=fffff13305ca09c0 phys=0x0000000172827000 round_trip_pfn=1517607

References

  1. 简述: https://www.kernel.org/doc/html/latest/mm/index.html

  2. NUMA wiki: https://en.wikipedia.org/wiki/Non-uniform_memory_access

  3. 内存管理: https://www.kernel.org/doc/html/latest/admin-guide/mm/index.html

  4. 内存模型文档: https://docs.kernel.org/mm/memory-model.html

  5. 一篇对稀疏内存模型理解很不错的 blog: https://www.cnblogs.com/LoyenWang/p/11523678.html