perrynzhou

专注于系统组件研发

0%

基本介绍

系统调用例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <stdio.h>
#include <time.h>
int main(void) {
time_t tt;
struct tm *t;
//time系统调用非常方便
tt = time(NULL);
//内嵌汇编替换time系统调用
asm volatile (
//清空ebx寄存器
"mov $0,%%ebx\n\t"
//系统调用一般是通过eax保存系统调用号,time在32位系统中调用号为13,因此传递13到eax寄存器
"mov $0xd,%%eax\n\t"
//调用int 0x80指令,陷入内核态,进行系统调用
"int $0x80\n\t"
//把结果eax寄存器结果保存在变量tt中
"mov %%eax,%0\n\t"
: "=m" (tt)
);

t = localtime(&tt);
fprintf(stdout,"%s",asctime(t));
return 0;
}

故障现象

  • glusterfs客户端无法连接某个节点的brick,一直出现rpc无法连接
  • kernel 出现hung_task_timeout_secs超时问题

故障分析

kernel层面分析
kernel出现sync flood,原因是客户端断掉后不断重新尝试导致,后面CallTrace出现 hung_task_timeout_secs关键字,该关键字含义是,异步刷新IO到磁盘,如果在hung_task_timeout_secs时间内IO没有刷完,就会报出这个错误,解决方法需要调整vm.dirty_ratio和vm.dirty_backgroud_ratio参数,两个参数分别控制脏页刷磁盘的比例,如果后端磁盘非常慢,超过操作系统默认的timeout会出现这样的calltrace.建议设置vm.dirty_ratio=5和vm.dirty_backgroud_ratio=5

客户端日志分析
日志中大量报出0-speech_vol_v2-client-55对应的brick无法建立rpc连接(通过网络telnet 主机和端口无异常),从日志来看可能是sync flood导致,客户端无法连接到这个glusterfsd

服务端日志
服务端的日志也很明显出现read/write出现告警

解决办法

  • 调整kernel中 vm.dirty_ratio=5和 vm.dirty_backgroud_ratio=10(该问题仅仅关闭kernel报错)
  • 重启节点的glusterd的进程或者重启节点(sync flood的问题)

glusterfs version

  • 服务端版本
1
2
3
[root@ai-storage-prd-141-165-65-141.v-bj-4.vivo.lan:/root]
# gluster --version
glusterfs 7.2
  • 客户端版本
1
2
3
4
5
[root@storage-center-prd-10-194-8-133.v-bj-4.vivo.lan:/var/log/glusterfs]
# rpm -qa|grep gluster
glusterfs-7.2-1.el7.x86_64
glusterfs-fuse-7.2-1.el7.x86_64
glusterfs-libs-7.2-1.el7.x86_64

cluster volume

  • volume info

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    # gluster volume info  speech_vol_v2     

    Volume Name: speech_vol_v2
    Type: Distributed-Replicate
    Volume ID: e1b28e82-0d0b-4c87-bb6a-45306ec4a568
    Status: Started
    Snapshot Count: 0
    Number of Bricks: 24 x 3 = 72
    Transport-type: tcp
    Bricks:
    Brick1: 141.165.65.141:/data1/brick
    Brick2: 141.165.65.142:/data1/brick
    Brick3: 141.165.65.144:/data1/brick
    Brick4: 141.165.65.141:/data2/brick
    Brick5: 141.165.65.142:/data2/brick
    Brick6: 141.165.65.144:/data2/brick
    Brick7: 141.165.65.141:/data3/brick
    Brick8: 141.165.65.142:/data3/brick
    Brick9: 141.165.65.144:/data3/brick
    Brick10: 141.165.65.141:/data4/brick
    Brick11: 141.165.65.142:/data4/brick
    Brick12: 141.165.65.144:/data4/brick
    Brick13: 141.165.65.141:/data5/brick
    Brick14: 141.165.65.142:/data5/brick
    Brick15: 141.165.65.144:/data5/brick
    Brick16: 141.165.65.141:/data6/brick
    Brick17: 141.165.65.142:/data6/brick
    Brick18: 141.165.65.144:/data6/brick
    Brick19: 141.165.65.141:/data7/brick
    Brick20: 141.165.65.142:/data7/brick
    Brick21: 141.165.65.144:/data7/brick
    Brick22: 141.165.65.141:/data8/brick
    Brick23: 141.165.65.142:/data8/brick
    Brick24: 141.165.65.144:/data8/brick
    Brick25: 141.165.65.141:/data9/brick
    Brick26: 141.165.65.142:/data9/brick
    Brick27: 141.165.65.144:/data9/brick
    Brick28: 141.165.65.141:/data10/brick
    Brick29: 141.165.65.142:/data10/brick
    Brick30: 141.165.65.144:/data10/brick
    Brick31: 141.165.65.141:/data11/brick
    Brick32: 141.165.65.142:/data11/brick
    Brick33: 141.165.65.144:/data11/brick
    Brick34: 141.165.65.141:/data12/brick
    Brick35: 141.165.65.142:/data12/brick
    Brick36: 141.165.65.144:/data12/brick
    Brick37: 141.165.181.146:/data1/brick
    Brick38: 141.165.181.147:/data1/brick
    Brick39: 141.165.181.148:/data1/brick
    Brick40: 141.165.181.146:/data2/brick
    Brick41: 141.165.181.147:/data2/brick
    Brick42: 141.165.181.148:/data2/brick
    Brick43: 141.165.181.146:/data3/brick
    Brick44: 141.165.181.147:/data3/brick
    Brick45: 141.165.181.148:/data3/brick
    Brick46: 141.165.181.146:/data4/brick
    Brick47: 141.165.181.147:/data4/brick
    Brick48: 141.165.181.148:/data4/brick
    Brick49: 141.165.181.146:/data5/brick
    Brick50: 141.165.181.147:/data5/brick
    Brick51: 141.165.181.148:/data5/brick
    Brick52: 141.165.181.146:/data6/brick
    Brick53: 141.165.181.147:/data6/brick
    Brick54: 141.165.181.148:/data6/brick
    Brick55: 141.165.181.146:/data7/brick
    Brick56: 141.165.181.147:/data7/brick
    Brick57: 141.165.181.148:/data7/brick
    Brick58: 141.165.181.146:/data8/brick
    Brick59: 141.165.181.147:/data8/brick
    Brick60: 141.165.181.148:/data8/brick
    Brick61: 141.165.181.146:/data9/brick
    Brick62: 141.165.181.147:/data9/brick
    Brick63: 141.165.181.148:/data9/brick
    Brick64: 141.165.181.146:/data10/brick
    Brick65: 141.165.181.147:/data10/brick
    Brick66: 141.165.181.148:/data10/brick
    Brick67: 141.165.181.146:/data11/brick
    Brick68: 141.165.181.147:/data11/brick
    Brick69: 141.165.181.148:/data11/brick
    Brick70: 141.165.181.146:/data12/brick
    Brick71: 141.165.181.147:/data12/brick
    Brick72: 141.165.181.148:/data12/brick
    Options Reconfigured:
    network.ping-timeout: 120
    cluster.read-hash-mode: 3
    storage.fips-mode-rchecksum: on
    nfs.disable: on
  • volume status

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# gluster volume status   speech_vol_v2     
Status of volume: speech_vol_v2
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 141.165.65.141:/data1/brick 49152 0 Y 16432
Brick 141.165.65.142:/data1/brick 49152 0 Y 16355
Brick 141.165.65.144:/data1/brick 49152 0 Y 32212
Brick 141.165.65.141:/data2/brick 49153 0 Y 20613
Brick 141.165.65.142:/data2/brick 49153 0 Y 16376
Brick 141.165.65.144:/data2/brick 49153 0 Y 32232
Brick 141.165.65.141:/data3/brick 49154 0 Y 20627
Brick 141.165.65.142:/data3/brick 49154 0 Y 16402
Brick 141.165.65.144:/data3/brick 49154 0 Y 32252
Brick 141.165.65.141:/data4/brick 49155 0 Y 10020
Brick 141.165.65.142:/data4/brick 49155 0 Y 16433
Brick 141.165.65.144:/data4/brick 49155 0 Y 32272
Brick 141.165.65.141:/data5/brick 49156 0 Y 10026
Brick 141.165.65.142:/data5/brick 49156 0 Y 16457
Brick 141.165.65.144:/data5/brick 49156 0 Y 32292
Brick 141.165.65.141:/data6/brick 49157 0 Y 20620
Brick 141.165.65.142:/data6/brick 49157 0 Y 16485
Brick 141.165.65.144:/data6/brick 49157 0 Y 32312
Brick 141.165.65.141:/data7/brick 49158 0 Y 10055
Brick 141.165.65.142:/data7/brick 49158 0 Y 16505
Brick 141.165.65.144:/data7/brick 49158 0 Y 32332
Brick 141.165.65.141:/data8/brick 49159 0 Y 15551
Brick 141.165.65.142:/data8/brick 49159 0 Y 16530
Brick 141.165.65.144:/data8/brick 49159 0 Y 32352
Brick 141.165.65.141:/data9/brick 49160 0 Y 10075
Brick 141.165.65.142:/data9/brick 49160 0 Y 16552
Brick 141.165.65.144:/data9/brick 49160 0 Y 32372
Brick 141.165.65.141:/data10/brick 49161 0 Y 10086
Brick 141.165.65.142:/data10/brick 49161 0 Y 16573
Brick 141.165.65.144:/data10/brick 49161 0 Y 32394
Brick 141.165.65.141:/data11/brick 49162 0 Y 10099
Brick 141.165.65.142:/data11/brick 49162 0 Y 16595
Brick 141.165.65.144:/data11/brick 49162 0 Y 32415
Brick 141.165.65.141:/data12/brick 49163 0 Y 10106
Brick 141.165.65.142:/data12/brick 49163 0 Y 16616
Brick 141.165.65.144:/data12/brick 49163 0 Y 32435
Brick 141.165.181.146:/data1/brick 49152 0 Y 177222
Brick 141.165.181.147:/data1/brick 49164 0 Y 215532
Brick 141.165.181.148:/data1/brick 49152 0 Y 182183
Brick 141.165.181.146:/data2/brick 49153 0 Y 177242
Brick 141.165.181.147:/data2/brick 49165 0 Y 215541
Brick 141.165.181.148:/data2/brick 49153 0 Y 182203
Brick 141.165.181.146:/data3/brick 49154 0 Y 177263
Brick 141.165.181.147:/data3/brick 49166 0 Y 215552
Brick 141.165.181.148:/data3/brick 49154 0 Y 182224
Brick 141.165.181.146:/data4/brick 49155 0 Y 177283
Brick 141.165.181.147:/data4/brick 49167 0 Y 215563
Brick 141.165.181.148:/data4/brick 49155 0 Y 182244
Brick 141.165.181.146:/data5/brick 49156 0 Y 177303
Brick 141.165.181.147:/data5/brick 49168 0 Y 215574
Brick 141.165.181.148:/data5/brick 49156 0 Y 182264
Brick 141.165.181.146:/data6/brick 49157 0 Y 177323
Brick 141.165.181.147:/data6/brick 49169 0 Y 215585
Brick 141.165.181.148:/data6/brick 49157 0 Y 182284
Brick 141.165.181.146:/data7/brick 49158 0 Y 177343
Brick 141.165.181.147:/data7/brick 49158 0 Y 176773
Brick 141.165.181.148:/data7/brick 49158 0 Y 182304
Brick 141.165.181.146:/data8/brick 49159 0 Y 177363
Brick 141.165.181.147:/data8/brick 49170 0 Y 215596
Brick 141.165.181.148:/data8/brick 49159 0 Y 182324
Brick 141.165.181.146:/data9/brick 49160 0 Y 177383
Brick 141.165.181.147:/data9/brick 49171 0 Y 215607
Brick 141.165.181.148:/data9/brick 49160 0 Y 182346
Brick 141.165.181.146:/data10/brick 49161 0 Y 177403
Brick 141.165.181.147:/data10/brick 49172 0 Y 215618
Brick 141.165.181.148:/data10/brick 49161 0 Y 182366
Brick 141.165.181.146:/data11/brick 49162 0 Y 177423
Brick 141.165.181.147:/data11/brick 49173 0 Y 215629
Brick 141.165.181.148:/data11/brick 49162 0 Y 182386
Brick 141.165.181.146:/data12/brick 49163 0 Y 177443
Brick 141.165.181.147:/data12/brick 49174 0 Y 215639
Brick 141.165.181.148:/data12/brick 49163 0 Y 182406

客户端无法连接的Brick的信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
//这些信息存在于/var/log/glusterfs/mnt-{挂载名称}.log中的Final graph:
volume speech_vol_v2-client-55
type protocol/client
option ping-timeout 120
option remote-host 141.165.181.147
option remote-subvolume /data7/brick
option transport-type socket
option transport.socket.ssl-enabled off
option transport.tcp-user-timeout 0
option transport.socket.keepalive-time 20
option transport.socket.keepalive-interval 2
option transport.socket.keepalive-count 9
option send-gids true
end-volume

客户端挂载Debug日志

操作系统kernel日志(/var/log/message)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
May 26 12:20:18 storage-center-prd-141-165-181-147 kernel: [4036004.156726] TCP: request_sock_TCP: Possible SYN flooding on port 24007. Sending cookies.  Check SNMP counters.
May 26 16:52:13 storage-center-prd-141-165-181-147 systemd[1]: Starting Cleanup of Temporary Directories...
May 26 16:52:13 storage-center-prd-141-165-181-147 systemd[1]: Started Cleanup of Temporary Directories.
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.524551] INFO: task kswapd0:266 blocked for more than 120 seconds.
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.531180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539187] kswapd0 D 0000000007496801 0 266 2 0x00000000
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539195] ffff881ffcd3b8f0 0000000000000046 ffff881ffeab9fa0 ffff881ffcd3bfd8
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539198] ffff881ffcd3bfd8 ffff881ffcd3bfd8 ffff881ffeab9fa0 ffff8804e82b4860
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539200] 7fffffffffffffff ffff8804e82b4858 ffff881ffeab9fa0 0000000007496801
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539202] Call Trace:
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539216] [<ffffffff816a94c9>] schedule+0x29/0x70
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539228] [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539235] [<ffffffff812fab24>] ? blk_finish_plug+0x14/0x40
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539271] [<ffffffffc01f9354>] ? _xfs_buf_ioapply+0x334/0x460 [xfs]
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539273] [<ffffffff816a987d>] wait_for_completion+0xfd/0x140
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539277] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539292] [<ffffffffc01fb434>] ? xfs_bwrite+0x24/0x60 [xfs]
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539305] [<ffffffffc01fb026>] xfs_buf_submit_wait+0xe6/0x1d0 [xfs]
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539316] [<ffffffffc01fb434>] xfs_bwrite+0x24/0x60 [xfs]
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539330] [<ffffffffc0203b71>] xfs_reclaim_inode+0x331/0x360 [xfs]
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539346] [<ffffffffc0203e07>] xfs_reclaim_inodes_ag+0x267/0x390 [xfs]
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539348] [<ffffffff810c4593>] ? try_to_wake_up+0x183/0x340
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539351] [<ffffffff8121d67a>] ? evict+0x10a/0x180
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539353] [<ffffffff810c4765>] ? wake_up_process+0x15/0x20
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539366] [<ffffffffc0204df3>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539380] [<ffffffffc02146f5>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539383] [<ffffffff81203888>] prune_super+0xe8/0x170
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539387] [<ffffffff81195413>] shrink_slab+0x163/0x330
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539390] [<ffffffff811f7537>] ? vmpressure+0x87/0x90
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539392] [<ffffffff81199081>] balance_pgdat+0x4b1/0x5e0
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539394] [<ffffffff81199323>] kswapd+0x173/0x440
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539398] [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539401] [<ffffffff811991b0>] ? balance_pgdat+0x5e0/0x5e0
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539402] [<ffffffff810b098f>] kthread+0xcf/0xe0
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539404] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539408] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
May 27 00:30:14 storage-center-prd-141-165-181-147 kernel: [4079697.539409] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
May 27 00:32:14 storage-center-prd-141-165-181-147 kernel: [4079817.259262] INFO: task kswapd0:266 blocked for more than 120 seconds.
客户端Mount Debug日志详情
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
[2020-05-27 03:47:34.303569] D [logging.c:1718:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2020-05-27 03:47:52.040397] D [socket.c:3053:socket_event_handler] 0-transport: EPOLLERR - disconnecting (sock:7) (non-SSL)
[2020-05-27 03:47:52.040442] D [MSGID: 0] [client.c:2334:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_DISCONNECT
[2020-05-27 03:47:52.040639] D [rpc-clnt-ping.c:96:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7fb22321e8ea] (--> /lib64/libgfrpc.so.0(+0x12eeb)[0x7fb222fc9eeb] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x92)[0x7fb222fc5922] (--> /lib64/libgfrpc.so.0(+0xf4e8)[0x7fb222fc64e8] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fb222fc2a93] ))))) 0-: 141.165.181.147:24007: ping timer event already removed
[2020-05-27 03:47:55.040804] D [name.c:171:client_fill_address_family] 0-speech_vol_v2-client-55: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: 141.165.181.147)
[2020-05-27 03:47:55.043219] D [MSGID: 0] [common-utils.c:532:gf_resolve_ip6] 0-resolver: returning ip-141.165.181.147 (port-24007) for hostname: 141.165.181.147 and port: 24007
[2020-05-27 03:47:55.043244] D [socket.c:3384:socket_fix_ssl_opts] 0-speech_vol_v2-client-55: disabling SSL for portmapper connection
[2020-05-27 03:47:55.043408] D [MSGID: 0] [client.c:2323:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_CONNECT
[2020-05-27 03:47:55.043611] D [rpc-clnt-ping.c:96:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7fb22321e8ea] (--> /lib64/libgfrpc.so.0(+0x12eeb)[0x7fb222fc9eeb] (--> /lib64/libgfrpc.so.0(+0x13691)[0x7fb222fca691] (--> /lib64/libgfrpc.so.0(rpc_clnt_submit+0x3b4)[0x7fb222fc6914] (--> /usr/lib64/glusterfs/7.2/xlator/protocol/client.so(+0x135b2)[0x7fb2146675b2] ))))) 0-: 141.165.181.147:24007: ping timer event already removed
[2020-05-27 03:47:55.043662] D [MSGID: 0] [client-handshake.c:1399:server_has_portmap] 0-speech_vol_v2-client-55: detected portmapper on server
[2020-05-27 03:47:55.043783] D [rpc-clnt-ping.c:195:rpc_clnt_ping_cbk] 0-speech_vol_v2-client-55: Ping latency is 0ms
[2020-05-27 03:47:55.043812] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-speech_vol_v2-client-55: changing port to 49158 (from 0)
[2020-05-27 03:47:55.043844] D [socket.c:3053:socket_event_handler] 0-transport: EPOLLERR - disconnecting (sock:6) (non-SSL)
[2020-05-27 03:47:55.043855] D [MSGID: 0] [client.c:2394:client_rpc_notify] 0-speech_vol_v2-client-55: disconnected (skipped notify)
[2020-05-27 03:47:55.043876] D [name.c:171:client_fill_address_family] 0-speech_vol_v2-client-55: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: 141.165.181.147)
[2020-05-27 03:49:34.303646] D [logging.c:1756:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present
[2020-05-27 03:47:55.043854] D [MSGID: 0] [client.c:2334:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_DISCONNECT
[2020-05-27 03:47:55.046202] D [MSGID: 0] [common-utils.c:532:gf_resolve_ip6] 0-resolver: returning ip-141.165.181.147 (port-24007) for hostname: 141.165.181.147 and port: 24007
[2020-05-27 03:49:34.303898] D [logging.c:1718:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2020-05-27 03:50:02.344385] D [socket.c:3053:socket_event_handler] 0-transport: EPOLLERR - disconnecting (sock:7) (non-SSL)
[2020-05-27 03:50:02.344418] D [MSGID: 0] [client.c:2334:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_DISCONNECT
[2020-05-27 03:50:02.344582] D [rpc-clnt-ping.c:96:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7fb22321e8ea] (--> /lib64/libgfrpc.so.0(+0x12eeb)[0x7fb222fc9eeb] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x92)[0x7fb222fc5922] (--> /lib64/libgfrpc.so.0(+0xf4e8)[0x7fb222fc64e8] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fb222fc2a93] ))))) 0-: 141.165.181.147:24007: ping timer event already removed
[2020-05-27 03:50:05.344677] D [name.c:171:client_fill_address_family] 0-speech_vol_v2-client-55: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: 141.165.181.147)
[2020-05-27 03:50:05.347099] D [MSGID: 0] [common-utils.c:532:gf_resolve_ip6] 0-resolver: returning ip-141.165.181.147 (port-24007) for hostname: 141.165.181.147 and port: 24007
[2020-05-27 03:50:05.347126] D [socket.c:3384:socket_fix_ssl_opts] 0-speech_vol_v2-client-55: disabling SSL for portmapper connection
[2020-05-27 03:50:05.347336] D [MSGID: 0] [client.c:2323:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_CONNECT
[2020-05-27 03:50:05.347542] D [rpc-clnt-ping.c:96:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7fb22321e8ea] (--> /lib64/libgfrpc.so.0(+0x12eeb)[0x7fb222fc9eeb] (--> /lib64/libgfrpc.so.0(+0x13691)[0x7fb222fca691] (--> /lib64/libgfrpc.so.0(rpc_clnt_submit+0x3b4)[0x7fb222fc6914] (--> /usr/lib64/glusterfs/7.2/xlator/protocol/client.so(+0x135b2)[0x7fb2146675b2] ))))) 0-: 141.165.181.147:24007: ping timer event already removed
[2020-05-27 03:50:05.347591] D [MSGID: 0] [client-handshake.c:1399:server_has_portmap] 0-speech_vol_v2-client-55: detected portmapper on server
[2020-05-27 03:50:05.347633] D [rpc-clnt-ping.c:195:rpc_clnt_ping_cbk] 0-speech_vol_v2-client-55: Ping latency is 0ms
[2020-05-27 03:50:05.347701] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-speech_vol_v2-client-55: changing port to 49158 (from 0)
[2020-05-27 03:50:05.347735] D [socket.c:3053:socket_event_handler] 0-transport: EPOLLERR - disconnecting (sock:6) (non-SSL)
[2020-05-27 03:50:05.347749] D [MSGID: 0] [client.c:2394:client_rpc_notify] 0-speech_vol_v2-client-55: disconnected (skipped notify)
[2020-05-27 03:50:05.347765] D [name.c:171:client_fill_address_family] 0-speech_vol_v2-client-55: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: 141.165.181.147)
/0-speech_vol_v2-client-55
[2020-05-27 03:51:34.303977] D [logging.c:1756:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present
[2020-05-27 03:50:05.347748] D [MSGID: 0] [client.c:2334:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_DISCONNECT
[2020-05-27 03:50:05.350156] D [MSGID: 0] [common-utils.c:532:gf_resolve_ip6] 0-resolver: returning ip-141.165.181.147 (port-24007) for hostname: 141.165.181.147 and port: 24007
[2020-05-27 03:51:34.304035] D [logging.c:1718:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2020-05-27 03:52:12.648388] D [socket.c:3053:socket_event_handler] 0-transport: EPOLLERR - disconnecting (sock:7) (non-SSL)
[2020-05-27 03:52:12.648424] D [MSGID: 0] [client.c:2334:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_DISCONNECT
[2020-05-27 03:52:12.648591] D [rpc-clnt-ping.c:96:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7fb22321e8ea] (--> /lib64/libgfrpc.so.0(+0x12eeb)[0x7fb222fc9eeb] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x92)[0x7fb222fc5922] (--> /lib64/libgfrpc.so.0(+0xf4e8)[0x7fb222fc64e8] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fb222fc2a93] ))))) 0-: 141.165.181.147:24007: ping timer event already removed
[2020-05-27 03:52:15.648683] D [name.c:171:client_fill_address_family] 0-speech_vol_v2-client-55: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: 141.165.181.147)
[2020-05-27 03:52:15.651027] D [MSGID: 0] [common-utils.c:532:gf_resolve_ip6] 0-resolver: returning ip-141.165.181.147 (port-24007) for hostname: 141.165.181.147 and port: 24007
[2020-05-27 03:52:15.651052] D [socket.c:3384:socket_fix_ssl_opts] 0-speech_vol_v2-client-55: disabling SSL for portmapper connection
[2020-05-27 03:52:15.651271] D [MSGID: 0] [client.c:2323:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_CONNECT
[2020-05-27 03:52:15.651501] D [rpc-clnt-ping.c:96:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7fb22321e8ea] (--> /lib64/libgfrpc.so.0(+0x12eeb)[0x7fb222fc9eeb] (--> /lib64/libgfrpc.so.0(+0x13691)[0x7fb222fca691] (--> /lib64/libgfrpc.so.0(rpc_clnt_submit+0x3b4)[0x7fb222fc6914] (--> /usr/lib64/glusterfs/7.2/xlator/protocol/client.so(+0x135b2)[0x7fb2146675b2] ))))) 0-: 141.165.181.147:24007: ping timer event already removed
[2020-05-27 03:52:15.651557] D [MSGID: 0] [client-handshake.c:1399:server_has_portmap] 0-speech_vol_v2-client-55: detected portmapper on server
[2020-05-27 03:52:15.651693] D [rpc-clnt-ping.c:195:rpc_clnt_ping_cbk] 0-speech_vol_v2-client-55: Ping latency is 0ms
[2020-05-27 03:52:15.651734] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-speech_vol_v2-client-55: changing port to 49158 (from 0)
[2020-05-27 03:52:15.651758] D [socket.c:3053:socket_event_handler] 0-transport: EPOLLERR - disconnecting (sock:6) (non-SSL)
[2020-05-27 03:52:15.651770] D [MSGID: 0] [client.c:2394:client_rpc_notify] 0-speech_vol_v2-client-55: disconnected (skipped notify)
[2020-05-27 03:52:15.651785] D [name.c:171:client_fill_address_family] 0-speech_vol_v2-client-55: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: 141.165.181.147)
[2020-05-27 03:53:34.304123] D [logging.c:1756:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present
[2020-05-27 03:52:15.651768] D [MSGID: 0] [client.c:2334:client_rpc_notify] 0-speech_vol_v2-client-55: got RPC_CLNT_DISCONNECT
[2020-05-27 03:52:15.654154] D [MSGID: 0] [common-utils.c:532:gf_resolve_ip6] 0-resolver: returning ip-141.165.181.147 (port-24007) for hostname: 141.165.181.147 and port: 24007
[2020-05-27 03:53:34.304179] D [logging.c:1718:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
0-speech_vol_v2-client-55对应服务端节点(147下的/data7/brick)日志
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[2020-05-27 01:02:22.421765] W [socket.c:774:__socket_rwv] 0-tcp.speech_vol_v2-server: readv on 141.165.32.25:47645 failed (No data available)
[2020-05-27 01:02:22.421795] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-speech_vol_v2-server: disconnecting connection from CTX_ID:de438bc8-1b52-446c-817f-60d646af0ba6-GRAPH_ID:2-PID:84971-HOST:ai-vtraining-prd-141-165-32-25.v-bj-4.vivo.lan-PC_NAME:speech_vol_v2-client-55-RECON_NO:-0
[2020-05-27 01:02:26.918776] W [socket.c:774:__socket_rwv] 0-tcp.speech_vol_v2-server: readv on 141.165.32.25:47562 failed (No data available)
[2020-05-27 01:02:26.918807] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-speech_vol_v2-server: disconnecting connection from CTX_ID:36965041-2046-4413-88a4-08bc06fc1de4-GRAPH_ID:2-PID:87591-HOST:ai-vtraining-prd-141-165-32-25.v-bj-4.vivo.lan-PC_NAME:speech_vol_v2-client-55-RECON_NO:-0
[2020-05-27 01:03:13.270262] W [socket.c:774:__socket_rwv] 0-tcp.speech_vol_v2-server: readv on 141.165.86.129:48212 failed (No data available)
[2020-05-27 01:03:13.270295] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-speech_vol_v2-server: disconnecting connection from CTX_ID:2dafdf38-ff96-45cc-99c2-1238764b59ae-GRAPH_ID:0-PID:75668-HOST:ai-vtraining-prd-141-165-86-129.v-bj-4.vivo.lan-PC_NAME:speech_vol_v2-client-55-RECON_NO:-0
[2020-05-27 01:03:13.285400] W [socket.c:774:__socket_rwv] 0-tcp.speech_vol_v2-server: readv on 141.165.86.129:48162 failed (No data available)
[2020-05-27 01:03:13.285422] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-speech_vol_v2-server: disconnecting connection from CTX_ID:4251ffc2-e1c5-4357-a5d2-7c92e8221fad-GRAPH_ID:0-PID:75625-HOST:ai-vtraining-prd-141-165-86-129.v-bj-4.vivo.lan-PC_NAME:speech_vol_v2-client-55-RECON_NO:-0
[2020-05-27 01:36:05.410946] W [socket.c:774:__socket_rwv] 0-tcp.speech_vol_v2-server: readv on 10.196.20.133:47621 failed (No data available)
[2020-05-27 01:36:05.410974] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-speech_vol_v2-server: disconnecting connection from CTX_ID:57fbf79a-bae3-4bcd-8670-70d559956cac-GRAPH_ID:0-PID:15841-HOST:ai-vtraining-gpu-prd-10-196-20-133.v-bj-4.vivo.lan-PC_NAME:speech_vol_v2-client-55-RECON_NO:-0
[2020-05-27 01:36:06.056913] W [socket.c:774:__socket_rwv] 0-tcp.speech_vol_v2-server: readv on 10.196.20.133:47331 failed (No data available)
[2020-05-27 01:36:06.056941] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-speech_vol_v2-server: disconnecting connection from CTX_ID:b68c4c8e-e8d8-4d44-97a3-a503791d6177-GRAPH_ID:0-PID:16094-HOST:ai-vtraining-gpu-prd-10-196-20-133.v-bj-4.vivo.lan-PC_NAME:speech_vol_v2-client-55-RECON_NO:-0
[2020-05-27 02:10:34.719707] W [socket.c:774:__socket_rwv] 0-tcp.speech_vol_v2-server: readv on 141.165.30.45:48603 failed (No data available)
[2020-05-27 02:10:34.719738] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-speech_vol_v2-server: disconnecting connection from CTX_ID:0ed18a02-3363-4cba-9313-93f7b221ede4-GRAPH_ID:10-PID:6869-HOST:ai-vtraining-prd-141-165-30-45.v-bj-4.vivo.lan-PC_NAME:speech_vol_v2-client-55-RECON_NO:-0
[2020-05-27 02:13:50.989206] I [addr.c:54:compare_addr_and_update] 0-/data7/brick: allowed = "*", received addr = "141.165.30.45"
[2020-05-27 02:13:50.989234] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-speech_vol_v2-server: accepted client from CTX_ID:a327610d-0acf-4785-a912-17e87c126ba9-GRAPH_ID:0-PID:67045-HOST:ai-vtraining-prd-141-165-30-45.v-bj-4.vivo.lan-PC_NAME:speech_vol_v2-client-55-RECON_NO:-0 (version: 6.5) with subvol /data7/brick
[2020-05-27 02:17:13.967294] W [socket.c:774:__socket_rwv] 0-tcp.speech_vol_v2-server: readv on 141.165.30.45:48606 failed (No data available)
[2020-05-27 02:39:55.104419] W [glusterfsd.c:1596:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dd5) [0x7f9d087f1dd5] -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x55f6769bb625] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55f6769bb48b] ) 0-: received signum (15), shutting down
[2020-05-27 02:39:55.104502] W [socket.c:774:__socket_rwv] 0-glusterfs: writev on 141.165.181.147:24007 failed (Broken pipe)

Glusterfs 原理

Glusterfs 基本原理
Glusterfs 是基于fuse的分布式存储,功能上支持分布式/3副本/EC三种存储方式。Glusterfs采用堆栈式的架构设计,服务端和客户端采用translator.
GlusterFS概念中,由一系列translator构成的完整功能栈称之为Volume,分配给一个volume的本地文件系统称为brick,被至少一个translator处理过的brick称为subvolume。客户端是由于volume类型来加载对应的translator,服务端也是一样,根据不同的volume的类型加载translator。客户端(glusterfs)通过挂载时候提供节点IP地址,很对应节点的服务端管理进程通信,获取brick源信息、客户单需要加载的配置,客户端根据配置初始化xlator,后续IO的流程按照xlator的顺序经过每个xlator的fop函数,然后直接和对应的glusterfsd的进程交互IO操作。glusterfsd也是一样,根据服务端配置文件,初始化服务端需要加载xlator,进行每个xlator的fop的操作,最终执行系统IO函数进行IO操作。节点的管理服务(glusterd),仅仅加载一个管理的xlator,处理来自glusterfs/gluster的请求,不会处理对应的IO操作操作。

Glusterfs环境

1.Glusterfs版本

1
glusterfs 7.5

2.volume info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Volume Name: rep3_vol
Type: Replicate
Volume ID: 73360fc2-e105-4fd4-9b92-b5fa333ba75d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.193.189.153:/debug/glusterfs/rep3_vol/brick
Brick2: 10.193.189.154:/debug/glusterfs/rep3_vol/brick
Brick3: 10.193.189.155:/debug/glusterfs/rep3_vol/brick
Options Reconfigured:
diagnostics.brick-log-level: INFO
performance.client-io-threads: off
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
diagnostics.client-log-level: DEBUG

Mount 命令

1.通过mount命令挂载

1
mount -t glusterfs -o acl 10.193.189.153:/rep3_vol /mnt/rep3_vol2

2.直接使用glusterfs二进制挂载

1
/usr/local/sbin/glusterfs --acl --process-name fuse --volfile-server=10.193.189.153 --volfile-id=rep3_vol /mnt/rep3_vol

Mount Glusterfs 整个流程

Glusterfs 交互架构

glusterfs-init

Glusterfs 客户端实现分析
  • 1.glusterfsd.c中的main方法入口函数

    1
    2
    3
    4
    5
    int main(int argc, char *argv[])
    {
    create_fuse_mount(ctx);
    glusterfs_volumes_init(ctx);
    }
  • 2.create_fuse_mount初始化mount/fuse的模块,具体是加载/usr/local/lib/glusterfs/2020.05.12/xlator/mount/fuse.so,会去执行fuse.so中的init方法

    1
    2
    3
    4
    5
    int create_fuse_mount(glusterfs_ctx_t *ctx)
    {
    xlator_set_type(master, "mount/fuse")
    xlator_init(master);
    }
  • 3.加载glusterfs fuse模块以后会去执行 glusterfs_volumes_init,该函数主要是在客户端初始化针对客户端操作的volume需要的xlator.

    1
    2
    3
    4
    5
    6
    int glusterfs_volumes_init(glusterfs_ctx_t *ctx)
    {

    glusterfs_mgmt_init(ctx);
    glusterfs_process_volfp(ctx, fp);
    }
  • 4.针对我们参数传递的volfile_server为某个节点的ip,代码会走到glusterfs_mgmt_init,该函数会根据连接请求去fetch spec

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    int glusterfs_mgmt_init(glusterfs_ctx_t *ctx)
    {
    rpc_clnt_register_notify(rpc, mgmt_rpc_notify, THIS);
    rpcclnt_cbk_program_register(rpc, &mgmt_cbk_prog, THIS);
    ctx->notify = glusterfs_mgmt_notify;
    ret = rpc_clnt_start(rpc);
    }
    static int mgmt_rpc_notify(struct rpc_clnt *rpc, void *mydata, rpc_clnt_event_t event,
    void *data)
    {
    switch (event) {
    case RPC_CLNT_DISCONNECT:
    //
    case RPC_CLNT_CONNECT:
    ret = glusterfs_volfile_fetch(ctx);
    }
    //spec 信息如下参见附件
    1. glusterfs_volfile_fetch 会调用glusterfs_volfile_fetch_one 获取spec
1
2
3
4
int glusterfs_volfile_fetch(glusterfs_ctx_t *ctx)
{
return glusterfs_volfile_fetch_one(ctx, ctx->cmd_args.volfile_id);
}
  • 6.glusterfs_volfile_fetch_one是实质获取客户端brick元数据和translator的函数、同时通过mgmt_getspec_cbk回调函数处理获取这些信息后的处理函数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    static int glusterfs_volfile_fetch_one(glusterfs_ctx_t *ctx, char *volfile_id)
    {
    ret = mgmt_submit_request(&req, frame, ctx, &clnt_handshake_prog,
    GF_HNDSK_GETSPEC, mgmt_getspec_cbk,
    (xdrproc_t)xdr_gf_getspec_req);
    }
    //mgmt_submit_request函数会调用服务端的gluster_handshake_actors[GETSPEC]对应的函数,获取spec信息后会调用mgmt_getspec_cbk回调函数.
    //在gdb参数中设置--volfile-server=10.193.189.154 这个节点上gdb attach glusterd进程,然后设置在server_getspec,然后客户端请求,然后10.193.189.154节点的glusterd在server_getspec响应
    rpcsvc_actor_t gluster_handshake_actors[GF_HNDSK_MAXVALUE] = {
    [GF_HNDSK_GETSPEC] = {"GETSPEC", GF_HNDSK_GETSPEC, server_getspec, NULL, 0, DRC_NA}
    };

    //重新获取spec初始化客户端的translator的,这个函数就是构造xlator的图,同时初始化每个xlator
    int
    mgmt_getspec_cbk(struct rpc_req *req, struct iovec *iov, int count,
    void *myframe)
    {
    glusterfs_volfile_reconfigure(tmpfp, ctx);
    glusterfs_process_volfp(ctx, tmpfp);
    }

    Debug方法

1.调试的注意点
glusterfs/glusterd/glusterfsd是同一个二进制进程,仅仅是区别是glusterfsd和glusterfs软链了glusterd的进程。glusterfs使用同一个二进制程序,根据不同的业务功能,fork或者走不同逻辑分别来实现glusterfs/glusterfsd/glusterd的功能。

2.调试的注意点

3.具体调试方法

  • 1.执行命令gdb /usr/local/sbin/glusterfs
  • 2.进入gdb的交互控制界面,需要设置参数,参数设置如下
    1
    2
    3
    4
    5
    gdb /usr/local/sbin/glusterfs
    set args --acl --process-name fuse --volfile-server=10.193.189.153 --volfile-id=rep3_vol /mnt/rep3_vol
    (gdb) set print pretty on
    (gdb) br main
    (gdb) br create_fuse_mount
  • 3.当断点执行到create_fuse_mount,该函数通过xlator_init加载mount/fuse的translor。
    1
    2
    3
    4
    5
    6
    7
    8
    (gdb) 
    Detaching after fork from child process 37741.
    Breakpoint 3, create_fuse_mount (ctx=0x63e010) at glusterfsd.c:719
    //中间过程
    (gdb)
    Detaching after fork from child process 39472.
    770 if (ret) {
    }
  • 4.等create_fuse_mount执行完毕以后,需要设置调试进程的模式,这样不至于进程一致停留在父进程中,子进程是需要和调试参数中设置的ip的节点通信,获取相关的bricks信息和客户端需要加载的translator
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    Detaching after fork from child process 39472.
    770 if (ret) {//}
    (gdb) set follow-fork-mode child
    (gdb) set detach-on-fork off
    (gdb)
    main (argc=7, argv=0x7fffffffe368) at glusterfsd.c:2875
    2875 if (ret)
    (gdb)
    2878 ret = daemonize(ctx);
    (gdb)
    [New process 39570]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    [New Thread 0x7fffeee4d700 (LWP 39573)]
    [New Thread 0x7fffee64c700 (LWP 39574)]
    [Switching to Thread 0x7ffff7fe74c0 (LWP 39570)]
    main (argc=7, argv=0x7fffffffe368) at glusterfsd.c:2879
    2879 if (ret)
    Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.3.x86_64 libuuid-2.23.2-59.el7.x86_64 openssl-libs-1.0.2k-16.el7.x86_64 zlib-1.2.7-18.el7.x86_64
    (gdb)
    2887 mem_pools_init();
    (gdb)

    (gdb) set print elements 0
    (gdb) show print elements
  • 5.在子进程设置glusterfs客户端执行逻辑的函数,通过rpc请求,向服务端glusterd拉取bricks元数据和客户端需要加载的translator的信息
    1
    2
    3
    4
    5
    6
    7
    (gdb) mgmt_getspec_cbk  
    (gdb) glusterfs_volfile_fetch_one
    (gdb) glusterfs_volfile_fetch
    (gdb) glusterfs_mgmt_init
    (gdb) glusterfs_volumes_init
    (gdb) glusterfs_process_volfp
    (gdb) glusterfs_graph_construct

    附件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
volume rep3_vol-client-0
type protocol/client
option send-gids true
option transport.socket.keepalive-count 9
option transport.socket.keepalive-interval 2
option transport.socket.keepalive-time 20
option transport.tcp-user-timeout 0
option transport.socket.ssl-enabled off
option password 7e9a1877-0837-4563-b73d-aa4cde754c91
option username bdb0a45d-e70d-445d-8fe6-76118dfdb738
option transport.address-family inet
option transport-type tcp
option remote-subvolume /debug/glusterfs/rep3_vol/brick
option remote-host 10.193.189.153
option ping-timeout 42
end-volume

volume rep3_vol-client-1
type protocol/client
option send-gids true
option transport.socket.keepalive-count 9
option transport.socket.keepalive-interval 2
option transport.socket.keepalive-time 20
option transport.tcp-user-timeout 0
option transport.socket.ssl-enabled off
option password 7e9a1877-0837-4563-b73d-aa4cde754c91
option username bdb0a45d-e70d-445d-8fe6-76118dfdb738
option transport.address-family inet
option transport-type tcp
option remote-subvolume /debug/glusterfs/rep3_vol/brick
option remote-host 10.193.189.154
option ping-timeout 42
end-volume

volume rep3_vol-client-2
type protocol/client
option send-gids true
option transport.socket.keepalive-count 9
option transport.socket.keepalive-interval 2
option transport.socket.keepalive-time 20
option transport.tcp-user-timeout 0
option transport.socket.ssl-enabled off
option password 7e9a1877-0837-4563-b73d-aa4cde754c91
option username bdb0a45d-e70d-445d-8fe6-76118dfdb738
option transport.address-family inet
option transport-type tcp
option remote-subvolume /debug/glusterfs/rep3_vol/brick
option remote-host 10.193.189.155
option ping-timeout 42
end-volume

volume rep3_vol-replicate-0
type cluster/replicate
option use-compound-fops off
option afr-pending-xattr rep3_vol-client-0,rep3_vol-client-1,rep3_vol-client-2
subvolumes rep3_vol-client-0 rep3_vol-client-1 rep3_vol-client-2
end-volume

volume rep3_vol-dht
type cluster/distribute
option force-migration off
option lock-migration off
subvolumes rep3_vol-replicate-0
end-volume

volume rep3_vol-utime
type features/utime
option noatime on
subvolumes rep3_vol-dht
end-volume

volume rep3_vol-write-behind
type performance/write-behind
subvolumes rep3_vol-utime
end-volume

volume rep3_vol-read-ahead
type performance/read-ahead
subvolumes rep3_vol-write-behind
end-volume

volume rep3_vol-readdir-ahead
type performance/readdir-ahead
option rda-cache-limit 10MB
option rda-request-size 131072
option parallel-readdir off
subvolumes rep3_vol-read-ahead
end-volume

volume rep3_vol-io-cache
type performance/io-cache
subvolumes rep3_vol-readdir-ahead
end-volume

volume rep3_vol-open-behind
type performance/open-behind
subvolumes rep3_vol-io-cache
end-volume

volume rep3_vol-quick-read
type performance/quick-read
subvolumes rep3_vol-open-behind
end-volume

volume rep3_vol-md-cache
type performance/md-cache
subvolumes rep3_vol-quick-read
end-volume

volume rep3_vol
type debug/io-stats
option global-threading off
option count-fop-hits off
option latency-measurement off
option threads 16
option log-level DEBUG
subvolumes rep3_vol-md-cache
end-volume

OpenCase 内存耗尽的问题

1.现象

  • 使用3.5T SSD 初始化opencas初始化,宿主机内存为120G,初始化成功后,运行一段时间出现内存全部消耗完。

2.opencas 版本信息

1
2
3
opencas version:19.9
Linux kernel: 3.10.0-957.el7.x86_64
CentOS Linux release 7.6.1810 (Core)

3.opencas 出问题时候日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
use command:  grep vmalloc /proc/vmallocinfo |grep cas_cache
0xffffae2a4dffe000-0xffffae2a4e000000 8192 ocf_metadata_hash_ctrl_init+0x23/0xe0 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4ebb2000-0xffffae2a4ebb4000 8192 ocf_metadata_hash_ctrl_init+0x23/0xe0 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4ebb4000-0xffffae2a4ebb6000 8192 _raw_ram_init+0x2d/0x70 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4ebb6000-0xffffae2a4ebb8000 8192 _raw_ram_init+0x2d/0x70 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4ebb8000-0xffffae2a4ebc5000 53248 _raw_ram_init+0x2d/0x70 [cas_cache] pages=12 vmalloc N0=12
0xffffae2a4ebc5000-0xffffae2a4ebc7000 8192 _raw_ram_init+0x2d/0x70 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4ebc7000-0xffffae2a4ebd1000 40960 raw_dynamic_init+0x38/0x90 [cas_cache] pages=9 vmalloc N0=9
0xffffae2a4ebd1000-0xffffae2a4ebd3000 8192 _cache_mngt_cache_priv_init+0x2e/0x60 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4ebf1000-0xffffae2a4ebf3000 8192 _ocf_mngt_attach_cache_device+0x23/0x1b0 [cas_cache] pages=1 vmalloc N1=1
0xffffae2a4ebf3000-0xffffae2a4ebf5000 8192 ocf_volume_init+0x99/0x100 [cas_cache] pages=1 vmalloc N1=1
0xffffae2a4ebfc000-0xffffae2a4ebfe000 8192 ocf_freelist_init+0x25/0x100 [cas_cache] pages=1 vmalloc N1=1
0xffffae2a4ebfe000-0xffffae2a4ec00000 8192 ocf_freelist_init+0x64/0x100 [cas_cache] pages=1 vmalloc N1=1
0xffffae2a4edda000-0xffffae2a4edfb000 135168 _raw_ram_init+0x2d/0x70 [cas_cache] pages=32 vmalloc N0=32
0xffffae2a4edfb000-0xffffae2a4edfd000 8192 ocf_freelist_init+0x74/0x100 [cas_cache] pages=1 vmalloc N1=1
0xffffae2a4edfe000-0xffffae2a4ee00000 8192 _raw_ram_init+0x2d/0x70 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4ef01000-0xffffae2a4ef7b000 499712 _raw_ram_init+0x2d/0x70 [cas_cache] pages=121 vmalloc N0=121
0xffffae2a4ef7b000-0xffffae2a4ef87000 49152 _raw_ram_init+0x2d/0x70 [cas_cache] pages=11 vmalloc N1=11
0xffffae2a4ef87000-0xffffae2a4ef8f000 32768 _raw_ram_init+0x2d/0x70 [cas_cache] pages=7 vmalloc N1=7
0xffffae2a4ef8f000-0xffffae2a4ef9b000 49152 _raw_ram_init+0x2d/0x70 [cas_cache] pages=11 vmalloc N1=11
0xffffae2a4ef9b000-0xffffae2a4efab000 65536 _raw_ram_init+0x2d/0x70 [cas_cache] pages=15 vmalloc N1=15
0xffffae2a4efab000-0xffffae2a4efb9000 57344 ocf_metadata_concurrency_attached_init+0x3c/0x190 [cas_cache] pages=13 vmalloc N1=13
0xffffae2a4efbe000-0xffffae2a4efc0000 8192 _raw_ram_init+0x2d/0x70 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4efc0000-0xffffae2a4efe1000 135168 _raw_ram_init+0x2d/0x70 [cas_cache] pages=32 vmalloc N0=32
0xffffae2a4efe1000-0xffffae2a4efee000 53248 _raw_ram_init+0x2d/0x70 [cas_cache] pages=12 vmalloc N0=12
0xffffae2a4efee000-0xffffae2a4eff0000 8192 _raw_ram_init+0x2d/0x70 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4eff0000-0xffffae2a4effa000 40960 raw_dynamic_init+0x38/0x90 [cas_cache] pages=9 vmalloc N0=9
0xffffae2a4effa000-0xffffae2a4effc000 8192 _cache_mngt_cache_priv_init+0x2e/0x60 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4f101000-0xffffae2a4f17b000 499712 _raw_ram_init+0x2d/0x70 [cas_cache] pages=121 vmalloc N0=121
0xffffae2a4f17b000-0xffffae2a4f17d000 8192 _ocf_mngt_attach_cache_device+0x23/0x1b0 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4f17d000-0xffffae2a4f17f000 8192 ocf_volume_init+0x99/0x100 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4f17f000-0xffffae2a4f18b000 49152 _raw_ram_init+0x2d/0x70 [cas_cache] pages=11 vmalloc N0=11
0xffffae2a4f192000-0xffffae2a4f19a000 32768 _raw_ram_init+0x2d/0x70 [cas_cache] pages=7 vmalloc N0=7
0xffffae2a4f19a000-0xffffae2a4f1a6000 49152 _raw_ram_init+0x2d/0x70 [cas_cache] pages=11 vmalloc N0=11
0xffffae2a4f1a6000-0xffffae2a4f1b6000 65536 _raw_ram_init+0x2d/0x70 [cas_cache] pages=15 vmalloc N0=15
0xffffae2a4f1b6000-0xffffae2a4f1c4000 57344 ocf_metadata_concurrency_attached_init+0x3c/0x190 [cas_cache] pages=13 vmalloc N0=13
0xffffae2a4f1c4000-0xffffae2a4f1c6000 8192 ocf_freelist_init+0x25/0x100 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4f1c6000-0xffffae2a4f1c8000 8192 ocf_freelist_init+0x64/0x100 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4f1c8000-0xffffae2a4f1ca000 8192 ocf_freelist_init+0x74/0x100 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4f1ca000-0xffffae2a4f1d4000 40960 ocf_cache_line_concurrency_init+0x48/0x190 [cas_cache] pages=9 vmalloc N0=9
0xffffae2a4f1d4000-0xffffae2a4f1d9000 20480 _ocf_realloc_with_cp+0x158/0x1b0 [cas_cache] pages=4 vmalloc N0=4
0xffffae2a4f1d9000-0xffffae2a4f1de000 20480 _ocf_realloc_with_cp+0x158/0x1b0 [cas_cache] pages=4 vmalloc N1=4
0xffffae2a4f1df000-0xffffae2a4f1e1000 8192 ocf_promotion_init+0x2e/0xc0 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4f1e5000-0xffffae2a4f1ef000 40960 ocf_cache_line_concurrency_init+0x48/0x190 [cas_cache] pages=9 vmalloc N1=9
0xffffae2a4f1f5000-0xffffae2a4f1f7000 8192 ocf_promotion_init+0x2e/0xc0 [cas_cache] pages=1 vmalloc N1=1
0xffffae2a4f301000-0xffffae2a4f34b000 303104 ocf_metadata_concurrency_attached_init+0x57/0x190 [cas_cache] pages=73 vmalloc N0=73
0xffffae2a4f34b000-0xffffae2a4f395000 303104 ocf_metadata_concurrency_attached_init+0x57/0x190 [cas_cache] pages=73 vmalloc N1=73
0xffffae2a4f503000-0xffffae2a4f505000 8192 _raw_dynamic_get_item.isra.10+0xbc/0x160 [cas_cache] pages=1 vmalloc N0=1
0xffffae2a4f5fa000-0xffffae2a4f5fc000 8192 _raw_dynamic_get_item.isra.10+0xbc/0x160 [cas_cache] pages=1 vmalloc N1=1
0xffffae2a4fe90000-0xffffae2a4ff9a000 1089536 ocf_mngt_cache_start+0x1b0/0x7a0 [cas_cache] pages=265 vmalloc N0=265
0xffffae2a50b02000-0xffffae2a50c28000 1204224 _raw_ram_init+0x2d/0x70 [cas_cache] pages=293 vmalloc N0=293
0xffffae2a50c28000-0xffffae2a50d32000 1089536 ocf_mngt_cache_start+0x1b0/0x7a0 [cas_cache] pages=265 vmalloc N0=265
0xffffae2a50d32000-0xffffae2a50e58000 1204224 _raw_ram_init+0x2d/0x70 [cas_cache] pages=293 vmalloc N0=293
0xffffae2a50e58000-0xffffae2a52313000 21737472 _raw_ram_init+0x2d/0x70 [cas_cache] pages=5306 vmalloc vpages N0=5306
0xffffae2a52313000-0xffffae2a530e2000 14479360 _raw_ram_init+0x2d/0x70 [cas_cache] pages=3534 vmalloc vpages N0=3534
0xffffae2a530e2000-0xffffae2a5459d000 21737472 _raw_ram_init+0x2d/0x70 [cas_cache] pages=5306 vmalloc vpages N0=5306
0xffffae2a5459d000-0xffffae2a56311000 30883840 _raw_ram_init+0x2d/0x70 [cas_cache] pages=7539 vmalloc vpages N0=7539
0xffffae2a56311000-0xffffae2a564cc000 1814528 _raw_ram_init+0x2d/0x70 [cas_cache] pages=442 vmalloc N0=442
0xffffae2a564cc000-0xffffae2a57cf6000 25337856 ocf_metadata_concurrency_attached_init+0x3c/0x190 [cas_cache] pages=6185 vmalloc vpages N0=6185
0xffffae2a57cf6000-0xffffae2a58cf8000 16785408 ocf_cache_line_concurrency_init+0x48/0x190 [cas_cache] pages=4097 vmalloc vpages N0=4097
0xffffae2a58cf8000-0xffffae2a593e0000 7241728 _ocf_realloc_with_cp+0x158/0x1b0 [cas_cache] pages=1767 vmalloc vpages N0=1767
0xffffae2a5c278000-0xffffae2a5d733000 21737472 _raw_ram_init+0x2d/0x70 [cas_cache] pages=5306 vmalloc vpages N1=5306
0xffffae2a5d733000-0xffffae2a5e502000 14479360 _raw_ram_init+0x2d/0x70 [cas_cache] pages=3534 vmalloc vpages N1=3534
0xffffae2a5e502000-0xffffae2a5f9bd000 21737472 _raw_ram_init+0x2d/0x70 [cas_cache] pages=5306 vmalloc vpages N1=5306
0xffffae2a5f9bd000-0xffffae2a5fb78000 1814528 _raw_ram_init+0x2d/0x70 [cas_cache] pages=442 vmalloc N1=442
0xffffae2a70001000-0xffffae2d073a0000 11127091200 _raw_ram_init+0x2d/0x70 [cas_cache] pages=2716574 vmalloc vpages N0=2716574
0xffffae2d073a0000-0xffffae2ec0f22000 7410819072 _raw_ram_init+0x2d/0x70 [cas_cache] pages=1809281 vmalloc vpages N0=1809281
0xffffae2ec0f22000-0xffffae31582c1000 11127091200 _raw_ram_init+0x2d/0x70 [cas_cache] pages=2716574 vmalloc vpages N0=2716574
0xffffae31582c1000-0xffffae3506818000 15809736704 _raw_ram_init+0x2d/0x70 [cas_cache] pages=3859798 vmalloc vpages N0=3859798
0xffffae3506818000-0xffffae353db8a000 926359552 _raw_ram_init+0x2d/0x70 [cas_cache] pages=226161 vmalloc vpages N0=226161
0xffffae353db8a000-0xffffae3842bac000 12968927232 ocf_metadata_concurrency_attached_init+0x3c/0x190 [cas_cache] pages=3166241 vmalloc vpages N0=3166241
0xffffae3842bac000-0xffffae384bcc2000 152133632 ocf_metadata_concurrency_attached_init+0x57/0x190 [cas_cache] pages=37141 vmalloc vpages N0=37141
0xffffae384bcc2000-0xffffae3928a84000 3705413632 _ocf_realloc_with_cp+0x158/0x1b0 [cas_cache] pages=904641 vmalloc vpages N0=904641
0xffffae3928a84000-0xffffae3bbfe23000 11127091200 _raw_ram_init+0x2d/0x70 [cas_cache] pages=2716574 vmalloc vpages N1=2716574
0xffffae3bbfe23000-0xffffae3d799a5000 7410819072 _raw_ram_init+0x2d/0x70 [cas_cache] pages=1809281 vmalloc vpages N1=1809281
0xffffae3d799a5000-0xffffae4010d44000 11127091200 _raw_ram_init+0x2d/0x70 [cas_cache] pages=2716574 vmalloc vpages N1=2716574
0xffffae4010d44000-0xffffae43bf29b000 15809736704 _raw_ram_init+0x2d/0x70 [cas_cache] pages=3859798 vmalloc vpages N1=3859798
0xffffae43bf29b000-0xffffae43c100f000 30883840 _raw_ram_init+0x2d/0x70 [cas_cache] pages=7539 vmalloc vpages N1=7539
0xffffae43c100f000-0xffffae43f8381000 926359552 _raw_ram_init+0x2d/0x70 [cas_cache] pages=226161 vmalloc vpages N1=226161
0xffffae43f8381000-0xffffae46fd3a3000 12968927232 ocf_metadata_concurrency_attached_init+0x3c/0x190 [cas_cache] pages=3166241 vmalloc vpages N1=3166241
0xffffae46fd3a3000-0xffffae46febcd000 25337856 ocf_metadata_concurrency_attached_init+0x3c/0x190 [cas_cache] pages=6185 vmalloc vpages N1=6185
0xffffae46febcd000-0xffffae4707ce3000 152133632 ocf_metadata_concurrency_attached_init+0x57/0x190 [cas_cache] pages=37141 vmalloc vpages N1=37141
0xffffae4707ce3000-0xffffae4708ce5000 16785408 ocf_cache_line_concurrency_init+0x48/0x190 [cas_cache] pages=4097 vmalloc vpages N1=4097
0xffffae4708ce5000-0xffffae47e5aa7000 3705413632 _ocf_realloc_with_cp+0x158/0x1b0 [cas_cache] pages=904641 vmalloc vpages N1=904641
0xffffae47e5aa7000-0xffffae47e618f000 7241728 _ocf_realloc_with_cp+0x158/0x1b0 [cas_cache] pages=1767 vmalloc vpages N1=1767
0xffffae47e6eb3000-0xffffae47e6eec000 233472 cleaning_policy_acp_initialize+0x3e/0x330 [cas_cache] pages=56 vmalloc N1=56
0xffffae47e6eec000-0xffffae47e6ef3000 28672 cleaning_policy_acp_add_core+0x7c/0x160 [cas_cache] pages=6 vmalloc N1=6
0xffffae47e712a000-0xffffae47e7163000 233472 cleaning_policy_acp_initialize+0x3e/0x330 [cas_cache] pages=56 vmalloc N1=56
0xffffae47e7163000-0xffffae47e7169000 24576 cleaning_policy_acp_add_core+0x7c/0x160 [cas_cache] pages=5 vmalloc N1=5
0xffffae47e7692000-0xffffae47e8238000 12214272 cleaning_policy_acp_add_core+0x7c/0x160 [cas_cache] pages=2981 vmalloc vpages N1=2981
0xffffae47e8238000-0xffffae47e8af5000 9162752 cleaning_policy_acp_add_core+0x7c/0x160 [cas_cache] pages=2236 vmalloc vpages N1=2236

use command grep vmalloc /proc/vmallocinfo |grep cas_cache | awk '{total+=$2}; END {print total}'
126764556288
[root@szdpl1491 ~]# free -h
total used free shared buff/cache available
Mem: 125G 123G 967M 5.1M 864M 194M
Swap: 31G 7.9G 24G

the cas_cache use 118G, I have another sever with opencas for 1 month, the cas_cache use 59G

3.opencas 相关配置

  • cache mode: WT

  • cache config

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    casadm -L

    type id disk status write policy device
    cache 1 /dev/sda Running wt -
    └core 1 /dev/sdd Active - /dev/cas1-1
    cache 2 /dev/sdb Running wt -
    └core 1 /dev/sde Active - /dev/cas2-1
    casadm -P -i 1

    Cache Id 1
    Cache Size 926351414 [4KiB Blocks] / 3533.75 [GiB]
    Cache Device /dev/sda
    Core Devices 1
    Inactive Core Devices 0
    Write Policy wt
    Eviction Policy lru
    Cleaning Policy alru
    Promotion Policy always
    Cache line size 4 [KiB]
    Metadata Memory Footprint 46.7 [GiB]
    Dirty for 0 [s] / Cache clean
    Metadata Mode normal
    Status Running

    ╔══════════════════╤═══════════╤══════╤═════════════╗
    ║ Usage statistics │ Count │ % │ Units ║
    ╠══════════════════╪═══════════╪══════╪═════════════╣
    ║ Occupancy │ 500156640 │ 54.0 │ 4KiB blocks ║
    ║ Free │ 426194774 │ 46.0 │ 4KiB blocks ║
    ║ Clean │ 500156640 │ 54.0 │ 4KiB blocks ║
    ║ Dirty │ 0 │ 0.0 │ 4KiB blocks ║
    ╚══════════════════╧═══════════╧══════╧═════════════╝

    ╔══════════════════════╤════════════╤═══════╤══════════╗
    ║ Request statistics │ Count │ % │ Units ║
    ╠══════════════════════╪════════════╪═══════╪══════════╣
    ║ Read hits │ 2628479218 │ 61.6 │ Requests ║
    ║ Read partial misses │ 0 │ 0.0 │ Requests ║
    ║ Read full misses │ 64 │ 0.0 │ Requests ║
    ║ Read total │ 2628479282 │ 61.6 │ Requests ║
    ╟──────────────────────┼────────────┼───────┼──────────╢
    ║ Write hits │ 1579075690 │ 37.0 │ Requests ║
    ║ Write partial misses │ 4237509 │ 0.1 │ Requests ║
    ║ Write full misses │ 58309934 │ 1.4 │ Requests ║
    ║ Write total │ 1641623133 │ 38.4 │ Requests ║
    ╟──────────────────────┼────────────┼───────┼──────────╢
    ║ Pass-Through reads │ 0 │ 0.0 │ Requests ║
    ║ Pass-Through writes │ 0 │ 0.0 │ Requests ║
    ║ Serviced requests │ 4270102415 │ 100.0 │ Requests ║
    ╟──────────────────────┼────────────┼───────┼──────────╢
    ║ Total requests │ 4270102415 │ 100.0 │ Requests ║
    ╚══════════════════════╧════════════╧═══════╧══════════╝

    ╔══════════════════════════════════╤═════════════╤═══════╤═════════════╗
    ║ Block statistics │ Count │ % │ Units ║
    ╠══════════════════════════════════╪═════════════╪═══════╪═════════════╣
    ║ Reads from core(s) │ 356 │ 0.0 │ 4KiB blocks ║
    ║ Writes to core(s) │ 2954907028 │ 100.0 │ 4KiB blocks ║
    ║ Total to/from core(s) │ 2954907384 │ 100.0 │ 4KiB blocks ║
    ╟──────────────────────────────────┼─────────────┼───────┼─────────────╢
    ║ Reads from cache │ 0 │ 0.0 │ 4KiB blocks ║
    ║ Writes to cache │ 0 │ 0.0 │ 4KiB blocks ║
    ║ Total to/from cache │ 0 │ 0.0 │ 4KiB blocks ║
    ╟──────────────────────────────────┼─────────────┼───────┼─────────────╢
    ║ Reads from exported object(s) │ 12708231250 │ 81.1 │ 4KiB blocks ║
    ║ Writes to exported object(s) │ 2954907028 │ 18.9 │ 4KiB blocks ║
    ║ Total to/from exported object(s) │ 15663138278 │ 100.0 │ 4KiB blocks ║
    ╚══════════════════════════════════╧═════════════╧═══════╧═════════════╝

    ╔════════════════════╤═══════╤═════╤══════════╗
    ║ Error statistics │ Count │ % │ Units ║
    ╠════════════════════╪═══════╪═════╪══════════╣
    ║ Cache read errors │ 0 │ 0.0 │ Requests ║
    ║ Cache write errors │ 0 │ 0.0 │ Requests ║
    ║ Cache total errors │ 0 │ 0.0 │ Requests ║
    ╟────────────────────┼───────┼─────┼──────────╢
    ║ Core read errors │ 0 │ 0.0 │ Requests ║
    ║ Core write errors │ 0 │ 0.0 │ Requests ║
    ║ Core total errors │ 0 │ 0.0 │ Requests ║
    ╟────────────────────┼───────┼─────┼──────────╢
    ║ Total errors │ 0 │ 0.0 │ Requests ║
    ╚════════════════════╧═══════╧═════╧══════════╝
  • cache line size

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    # dmesg |grep "Cache line size"
    2751 [ 1505.783016] cache1: Hash offset : 44427904 kiB
    2752 [ 1505.783017] cache1: Hash size : 904644 kiB
    2753 [ 1505.783018] cache1: Cache line size: 4 kiB
    2754 [ 1505.783019] cache1: Metadata capacity: 47803 MiB
    2755 [ 1521.649327] cache1: OCF metadata self-test PASSED
    2756 [ 1527.389763] Thread cas_clean_cache1 started

    2893 [ 1823.699193] cache2: Hash offset : 44427904 kiB
    2894 [ 1823.699194] cache2: Hash size : 904644 kiB
    2895 [ 1823.699195] cache2: Cache line size: 4 kiB
    2896 [ 1823.699197] cache2: Metadata capacity: 47803 MiB
    2897 [ 1839.660385] cache2: OCF metadata self-test PASSED
    2898 [ 1845.359600] Thread cas_clean_cache2 started

5.解决方法

  • opencas在初始化时候casadm 需要添加–cache-line-size 参数,默认是4k,针对宿主机器消耗内存的计算公式为 内存消耗 = SSD_size / Cache_Line_size * 70 byte
  • 官方给的简要说明
    1
    2
    3
    4
    5
    6
    It looks quite normal to me. You have two huge cache devices (2 x 3.5TB) and size of CAS metadata is proportional to number of cache lines. CAS allocates about 70 bytes of metadata per cache line, so in your case it is about 60GiB of metadata per single cache, giving ~120GiB in total. That matches pretty well with your numbers.

    You can decrease memory consumption by choosing bigger cache line size. You can select cache line size up to 64kiB, which would decrease memory usage by factor of 16.

    I'd also recommend you, if it's possible, to switch to CAS v20.3.
    CAS v19.9 was tested only with basic set of tests, while v20.3 was thoroughly validated with extensive set of tests, thus it's much more stable than any previous version.

6.关于opencas cache line官方说明

Why does Open CAS Linux use some DRAM space?

Open CAS Linux uses a portion of system memory for metadata, which tells us where data resides. The amount of memory needed is proportional to the size of the cache space. This is true for any caching software solution. However with Open CAS Linux this memory footprint can be decreased using a larger cache line size set by the parameter –cache-line-size which may be useful in high density servers with many large HDDs.

Configuration Tool Details

The Open CAS Linux product includes a user-level configuration tool that provides complete control of the caching software. The commands and parameters available with this tool are detailed in this chapter.

To access help from the CLI, type the -H or --help parameter for details. You can also view the man page for this product by entering the following command:

# man casadm

Usage: casadm –start-cache –cache-device [option…]

Example:

# casadm –start-cache –cache-device /dev/sdc

or

# casadm -S -d /dev/sdc

Description: Prepares a block device to be used as device for caching other block devices. Typically the cache devices are SSDs or other NVM block devices or RAM disks. The process starts a framework for device mappings pertaining to a specific cache ID. The cache can be loaded with an old state when using the -l or –load parameter (previous cache metadata will not be marked as invalid) or with a new state as the default (previous cache metadata will be marked as invalid).

Required Parameters:

[-d, –cache-device ] : Caching device to be used. This is an SSD or any NVM block device or RAM disk shown in the /dev directory. needs to be the complete path describing the caching device to be used, for example /dev/sdc.

Optional Parameters:

[-i, –cache-id ]: Cache ID to create; <1 to 16384>. The ID may be specified or by default the command will use the lowest available number first.

[-l, –load]: Load existing cache metadata from caching device. If the cache device has been used previously and then disabled (like in a reboot) and it is determined that the data in the core device has not changed since the cache device was used, this option will allow continuing the use of the data in the cache device without the need to re-warm the cache with data.

  • Caution: You must ensure that the last shutdown followed the instructions in section Stopping Cache Instances. If there was any change in the core data prior to enabling the cache, data would be not synced correctly and will be corrupted.

[-f, –force]: Forces creation of a cache even if a file system exists on the cache device. This is typically used for devices that have been previously utilized as a cache device.

  • Caution: This will delete the file system and any existing data on the cache device.

[-c, –cache-mode ]: Sets the cache mode for a cache instance the first time it is started or created. The mode can be one of the following:

wt: (default mode) Turns write-through mode on. When using this parameter, the write-through feature is enabled which allows the acceleration of only read intensive operations.

wb: Turns write-back mode on. When using this parameter, the write-back feature is enabled which allows the acceleration of both read and write intensive operations.

  • Caution: A failure of the cache device may lead to the loss of data that has not yet been flushed to the core device.

wa: Turns write-around mode on. When using this parameter, the write-around feature is enabled which allows the acceleration of reads only. All write locations that do not already exist in the cache (i.e. the locations have not be read yet or have been evicted), are written directly to the core drive bypassing the cache. If the location being written already exists in cache, then both the cache and the core drive will be updated.

pt: Starts cache in pass-through mode. Caching is effectively disabled in this mode. This allows the user to associate all their desired core devices to be cached prior to actually enabling caching. Once the core devices are associated, the user would dynamically switch to their desired caching mode (see ‘-Q | –set-cache-mode’ for details).

wo: Turns write-only mode on. When using this parameter, the write-only feature is enabled which allows the acceleration of write intensive operations primarily.

  • Caution: A failure of the cache device may lead to the loss of data that has not yet been flushed to the core device.

[-x, –cache-line-size ]: Set cache line size {4 (default), 8, 16, 32, 64}. The cache line size can only be set when starting the cache and cannot be changed after cache is started.

chapter 1.lustre部署

0.节点信息

node role
centos1(mgs_node/mds_node) mgs/mds
centos2(oss_node) oss
centos3(client_node) client

1.kernel版本信息

1
2
3
4
[root@CentOS1 ~]# cat /etc/redhat-release 
CentOS Linux release 7.7.1908 (Core)
[root@CentOS1 ~]# uname -r
3.10.0-1062.el7.x86_64

2.配置离线lustre 安装包安装源

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// lustre-2.13.0 内核刚好匹配 kernel 3.10.0-1062.el7.x86_64
[root@CentOS1 lustre]# pwd
/root/lustre
[root@CentOS1 lustre]# ls
repo.conf
[root@CentOS1 lustre]# cat repo.conf
[lustre-server]
name=lustre-server
baseurl=https://downloads.whamcloud.com/public/lustre/lustre-2.13.0/el7.7.1908/server
gpgcheck=0


[patchless-ldiskfs-server]
name=patchless-ldiskfs-server
baseurl=https://downloads.whamcloud.com/public/lustre/lustre-2.13.0/el7.7.1908/patchless-ldiskfs-server
gpgcheck=0

[lustre-client]
name=lustre-client
baseurl=https://downloads.whamcloud.com/public/lustre/lustre-2.13.0/el7.7.1908/client
gpgcheck=0

[e2fsprogs-wc]
name=e2fsprogs-wc
baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el7
gpgcheck=0

3.下载lustre安装包到本地

1
2
3
4
5
6
7
8
9
10
11
12
[root@CentOS1 ~]# cd ~/lustre
[root@CentOS1 ~]# yum groupinstall “Development Tools” -y
[root@CentOS1 ~]# yum install epel-release quilt libselinux-devel python-docutils xmlto asciidoc elfutils-libelf-devel elfutils-devel zlib-devel rng-tools binutils-devel python-devel sg3_utils newt-devel perl-ExtUtils-Embed audit-libs-devel lsof hmaccalc -y
[root@CentOS1 ~]# systemctl stop firewalld.service
[root@CentOS1 ~]# systemctl disable firewalld.service
[root@CentOS1 ~]# yum install yum-utils dkms-y
[root@CentOS1 ~]# reposync -c lustre-repo.conf -n \
-r lustre-server \
-r lustre-client \
-r patchless-ldiskfs-server \
-r e2fsprogs-wc
[root@CentOS1 ~]# cp ~/lustre/repo.conf /etc/yum.repos.d/lustre.repo

4.安装lustre版本kernel

1
2
3
4
5
6
7
8
9
10
11
12
13
14
//after download all lustre package
[root@CentOS1 lustre]# ls
e2fsprogs-wc lustre-client lustre-server patchless-ldiskfs-server repo.conf

[root@CentOS1 lustre]# cd e2fsprogs-wc/RPMS/x86_64/ && pwd
/root/lustre/e2fsprogs-wc/RPMS/x86_64 && yum --nogpgcheck --disablerepo=* --enablerepo=e2fsprogs-wc install e2fsprogs-1.45.2.wc1-0.el7.x86_64.rpm

[root@CentOS1 lustre]# cd lustre-server/RPMS/x86_64/ && pwd
/root/lustre/lustre-server/RPMS/x86_64
yum --nogpgcheck --disablerepo=* --enablerepo=e2fsprogs-wc install e2fsprogs-1.45.2.wc1-0.el7.x86_64.rpm


yum --nogpgcheck --disablerepo=* --enablerepo=lustre-server install kernel-devel-3.10.0-1062.1.1.el7_lustre.x86_64.rpm kernel-headers-3.10.0-1062.1.1.el7_lustre.x86_64.rpm
[root@CentOS1 ~] reboot

5.安装zfs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@CentOS1 lustre]# cd lustre-server/RPMS/x86_64/ && pwd 
/root/lustre/lustre-server/RPMS/x86_64

[root@CentOS1 x86_64]# yum --nogpgcheck --disablerepo=* --enablerepo=lustre-server install zfs-0.7.13-1.el7.x86_64.rpm zfs-dkms-0.7.13-1.el7.noarch.rpm zfs-dracut-0.7.13-1.el7.x86_64.rpm

[root@CentOS1 x86_64]# reboot
[root@CentOS1 x86_64]# modprobe zfs
[root@CentOS1 x86_64]# lsmod |grep zfs
zfs 3564425 3
zunicode 331170 1 zfs
zavl 15236 1 zfs
icp 270148 1 zfs
zcommon 73440 1 zfs
znvpair 89131 2 zfs,zcommon
spl 102412 4 icp,zfs,zcommon,znvpair

6.安装lustre server 核心包

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@CentOS1 ~] yum --nogpgcheck --disablerepo=*  --enablerepo=lustre-server install kmod-lustre-2.13.0-1.el7.x86_64.rpm kmod-lustre-osd-ldiskfs-2.13.0-1.el7.x86_64.rpm lustre-2.13.0-1.el7.x86_64.rpm lustre-osd-ldiskfs-mount-2.13.0-1.el7.x86_64.rpm lustre-osd-zfs-mount-2.13.0-1.el7.x86_64.rpm lustre-resource-agents-2.13.0-1.el7.x86_64.rpm

[root@CentOS1 ~]# reboot
[root@CentOS1 ~]# modprobe lustre
[root@CentOS1 ~]# lsmod |grep lustre
lustre 875887 0
lmv 191957 1 lustre
mdc 247683 1 lustre
lov 320485 1 lustre
ptlrpc 2287996 7 fid,fld,lmv,mdc,lov,osc,lustre
obdclass 2649116 8 fid,fld,lmv,mdc,lov,osc,lustre,ptlrpc
lnet 595547 6 lmv,osc,lustre,obdclass,ptlrpc,ksocklnd
libcfs 388506 11 fid,fld,lmv,mdc,lov,osc,lnet,lustre,obdclass,ptlrpc,ksocklnd

[root@CentOS1 ~]# modprobe lnet
[root@CentOS1 ~]# lsmod |grep lnet
lnet 595547 6 lmv,osc,lustre,obdclass,ptlrpc,ksocklnd
libcfs 388506 11 fid,fld,lmv,mdc,lov,osc,lnet,lustre,obdclass,ptlrpc,ksocklnd

7.安装lustre client核心包(客户端节点)

1
2
3
4
5
6
7
8
9
10
//lustre server和lustre client版本版本冲突,客户端必须在其他节点安装
[root@CentOS3 ~]# yum install epel-release -y
[root@CentOS3 ~]# yum install dkms -y
[root@CentOS3 ~]# cd lustre/
[root@CentOS3 lustre]# cd lustre-client/RPMS/x86_64/
[root@CentOS3 x86_64]# ls
kmod-lustre-client-2.13.0-1.el7.x86_64.rpm lustre-client-debuginfo-2.13.0-1.el7.x86_64.rpm lustre-iokit-2.13.0-1.el7.x86_64.rpm
kmod-lustre-client-tests-2.13.0-1.el7.x86_64.rpm lustre-client-dkms-2.13.0-1.el7.noarch.rpm
lustre-client-2.13.0-1.el7.x86_64.rpm lustre-client-tests-2.13.0-1.el7.x86_64.rpm
[root@CentOS3 x86_64]# yum --nogpgcheck --disablerepo=* --enablerepo=lustre-client install kmod-lustre-client-2.13.0-1.el7.x86_64.rpm lustre-client-2.13.0-1.el7.x86_64.rpm lustre-client-dkms-2.13.0-1.el7.noarch.rpm

8.deploy msg/mst

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@CentOS1 ~]# fdisk -l|grep sd
Disk /dev/sda: 68.7 GB, 68719476736 bytes, 134217728 sectors
/dev/sda1 * 2048 2099199 1048576 83 Linux
/dev/sda2 2099200 134217727 66059264 8e Linux LVM
Disk /dev/sdc: 25.8 GB, 25769803776 bytes, 50331648 sectors
Disk /dev/sdb: 25.8 GB, 25769803776 bytes, 50331648 sectors
[root@CentOS1 ~]# mkfs.lustre --mgs /dev/sdb
[root@CentOS1 ~]# mkdir /mgt
[root@CentOS1 ~]# mount.lustre /dev/sdb /mgt/
[root@CentOS1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 905M 0 905M 0% /dev
tmpfs 917M 0 917M 0% /dev/shm
tmpfs 917M 9.0M 908M 1% /run
tmpfs 917M 0 917M 0% /sys/fs/cgroup
/dev/mapper/centos-root 41G 2.9G 39G 7% /
/dev/sda1 1014M 193M 822M 20% /boot
/dev/mapper/centos-home 20G 33M 20G 1% /home
tmpfs 184M 0 184M 0% /run/user/0
/dev/sdb 24G 46M 23G 1% /mgt

9.deploy mds/mdt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
//mkfs.lustre --fsname=lustrefs --mgsnode=msg_node@tcp0  --mdt --index=0 /dev/sdc
[root@CentOS1 ~]# mkfs.lustre --fsname=lustrefs --mgsnode=10.211.55.3@tcp0 --mdt --index=0 /dev/sdc

Permanent disk data:
Target: lustrefs:MDT0000
Index: 0
Lustre FS: lustrefs
Mount type: ldiskfs
Flags: 0x61
(MDT first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=10.211.55.3@tcp

checking for existing Lustre data: not found
device size = 24576MB
formatting backing filesystem ldiskfs on /dev/sdc
target name lustrefs:MDT0000
kilobytes 25165824
options -J size=983 -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,ea_inode,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustrefs:MDT0000 -J size=983 -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,ea_inode,flex_bg -E lazy_journal_init -F /dev/sdc 25165824k
Writing CONFIGS/mountdata
[root@CentOS1 ~]# mkdir /mdt
[root@CentOS1 ~]# mount.lustre /dev/sdc /mdt

10.deploy oss/ost

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
//mkfs.lustre --ost --fsname=lustrefs --mgsnode=mgs_node@tcp0 --index=1 /dev/sdb

[root@CentOS2 ~]# mkfs.lustre --ost --fsname=lustrefs --mgsnode=10.211.55.3@tcp0 --index=1 /dev/sdb

Permanent disk data:
Target: lustrefs:OST0001
Index: 1
Lustre FS: lustrefs
Mount type: ldiskfs
Flags: 0x62
(OST first_time update )
Persistent mount opts: ,errors=remount-ro
Parameters: mgsnode=10.211.55.3@tcp

checking for existing Lustre data: not found
device size = 24576MB
formatting backing filesystem ldiskfs on /dev/sdb
target name lustrefs:OST0001
kilobytes 25165824
options -J size=983 -I 512 -i 69905 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustrefs:OST0001 -J size=983 -I 512 -i 69905 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init -F /dev/sdb 25165824k
Writing CONFIGS/mountdata
[root@CentOS2 ~]# mkdir /ost
[root@CentOS2 ~]# mount.lustre /dev/sdb /ost/

11.mount on client node

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@CentOS3 ~]# mkdir /mnt/lustrefs
[root@CentOS3 ~]# mount -t lustre 10.211.55.3@tcp0:/lustrefs /mnt/lustrefs
[root@CentOS3 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 906M 0 906M 0% /dev
tmpfs 917M 0 917M 0% /dev/shm
tmpfs 917M 9.0M 908M 1% /run
tmpfs 917M 0 917M 0% /sys/fs/cgroup
/dev/mapper/centos-root 41G 2.8G 39G 7% /
/dev/sda1 1014M 149M 866M 15% /boot
/dev/mapper/centos-home 20G 33M 20G 1% /home
tmpfs 184M 0 184M 0% /run/user/0
10.211.55.3@tcp:/lustrefs 23G 46M 22G 1% /mnt/lustrefs

chapter 2. lustre process

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
//show luste mgs info
[root@CentOS1 ~]# ps -ef|grep mgs
root 2275 2 0 13:29 ? 00:00:00 [mgs_params_noti]
root 2276 2 0 13:29 ? 00:00:00 [ll_mgs_0000]
root 2277 2 0 13:29 ? 00:00:00 [ll_mgs_0001]
root 2278 2 0 13:29 ? 00:00:00 [ll_mgs_0002]
root 2375 2 0 13:31 ? 00:00:00 [mgs_lustrefs_no]
root 2548 2105 0 13:49 pts/0 00:00:00 grep --color=auto mgs

//show lustre mdt process
[root@CentOS1 ~]# ps -ef|grep mdt
root 2357 2 0 13:31 ? 00:00:00 [mdt00_000]
root 2358 2 0 13:31 ? 00:00:00 [mdt00_001]
root 2359 2 0 13:31 ? 00:00:01 [mdt00_002]
root 2360 2 0 13:31 ? 00:00:00 [mdt_rdpg00_000]
root 2361 2 0 13:31 ? 00:00:00 [mdt_rdpg00_001]
root 2362 2 0 13:31 ? 00:00:00 [mdt_attr00_000]
root 2363 2 0 13:31 ? 00:00:00 [mdt_attr00_001]
root 2364 2 0 13:31 ? 00:00:00 [mdt_out00_000]
root 2365 2 0 13:31 ? 00:00:00 [mdt_out00_001]
root 2366 2 0 13:31 ? 00:00:00 [mdt_seqs_0000]
root 2367 2 0 13:31 ? 00:00:00 [mdt_seqs_0001]
root 2368 2 0 13:31 ? 00:00:00 [mdt_seqm_0000]
root 2369 2 0 13:31 ? 00:00:00 [mdt_seqm_0001]
root 2370 2 0 13:31 ? 00:00:00 [mdt_fld_0000]
root 2371 2 0 13:31 ? 00:00:00 [mdt_fld_0001]
root 2372 2 0 13:31 ? 00:00:00 [mdt_io00_000]
root 2373 2 0 13:31 ? 00:00:00 [mdt_io00_001]
root 2374 2 0 13:31 ? 00:00:00 [mdt_io00_002]
root 2540 2 0 13:48 ? 00:00:00 [mdt00_003]
root 2552 2105 0 13:49 pts/0 00:00:00 grep --color=auto mdt

//show ost process
[root@CentOS2 ~]# ps -ef|grep ost
root 1854 1 0 13:09 ? 00:00:00 /usr/libexec/postfix/master -w
postfix 1859 1854 0 13:09 ? 00:00:00 pickup -l -t unix -u
postfix 1860 1854 0 13:09 ? 00:00:00 qmgr -l -t unix -u
root 2431 2 0 13:47 ? 00:00:00 [ll_ost00_000]
root 2432 2 0 13:47 ? 00:00:00 [ll_ost00_001]
root 2433 2 0 13:47 ? 00:00:00 [ll_ost00_002]
root 2434 2 0 13:47 ? 00:00:00 [ll_ost_create00]
root 2435 2 0 13:47 ? 00:00:00 [ll_ost_create00]
root 2436 2 0 13:47 ? 00:00:00 [ll_ost_io00_000]
root 2437 2 0 13:47 ? 00:00:00 [ll_ost_io00_001]
root 2438 2 0 13:47 ? 00:00:00 [ll_ost_io00_002]
root 2439 2 0 13:47 ? 00:00:00 [ll_ost_seq00_00]
root 2440 2 0 13:47 ? 00:00:00 [ll_ost_seq00_00]
root 2441 2 0 13:47 ? 00:00:00 [ll_ost_out00_00]
root 2442 2 0 13:47 ? 00:00:00 [ll_ost_out00_00]
root 2473 2111 0 13:49 pts/0 00:00:00 grep --color=auto ost

chapter 3.script

  • start mgs/mdt
1
2
3
4
5
6
modprobe  zfs
modprobe lustre
modprobe lnet
//execute on mgs/mdt node
mount.lustre /dev/sdc /mdt
mount.lustre /dev/sdb /mgt/
  • start ost
1
2
3
4
modprobe  zfs
modprobe lustre
modprobe lnet
mount.lustre /dev/sdb /ost/
  • mount
1
mount -t lustre 10.211.55.3@tcp0:/lustrefs /mnt/lustrefs

glusterfs集群glusterd显示Disconnect状态导致集群IO请求变慢

172.25.78.16节点的集群状态

1
2
3
4
5
6
7

[root@szdpl1491 ~]# gluster pool list
UUID Hostname State
bde7b9e2-af2a-419e-8242-6a8f8e18bb8a 172.25.78.242 Disconnected
54fa8ec0-3617-4c8e-968a-7402a80a9017 172.25.78.240 Connected
588bb497-5d9f-4bcf-b0b1-c9d364afb084 172.25.78.241 Connected
a8522deb-ee16-44a3-a800-9525cb8d64fa localhost Connected
  • 172.25.78.242节点在172.25.78.16的状态明显不对,查看其它的节点242的认证信息

172.25.78.240节点的集群状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@szdpl1543 ~]# gluster pool list
UUID Hostname State
588bb497-5d9f-4bcf-b0b1-c9d364afb084 172.25.78.241 Connected
bde7b9e2-af2a-419e-8242-6a8f8e18bb8a 172.25.78.242 Connected
a8522deb-ee16-44a3-a800-9525cb8d64fa 172.25.78.16 Connected
54fa8ec0-3617-4c8e-968a-7402a80a9017 localhost Connected
[root@szdpl1543 ~]# cd /var/lib/glusterd/peers/
[root@szdpl1543 peers]# ls
588bb497-5d9f-4bcf-b0b1-c9d364afb084 a8522deb-ee16-44a3-a800-9525cb8d64fa bde7b9e2-af2a-419e-8242-6a8f8e18bb8a
[root@szdpl1543 peers]# cat bde7b9e2-af2a-419e-8242-6a8f8e18bb8a
uuid=bde7b9e2-af2a-419e-8242-6a8f8e18bb8a
state=3
hostname1=172.25.78.242
hostname2=bogon
[root@szdpl1543 peers]# vi bde7b9e2-af2a-419e-8242-6a8f8e18bb8a
uuid=bde7b9e2-af2a-419e-8242-6a8f8e18bb8a
state=3
hostname1=172.25.78.242

172.25.78.16节点的集群状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@szdpl1491 ~]# gluster pool list
UUID Hostname State
bde7b9e2-af2a-419e-8242-6a8f8e18bb8a 172.25.78.242 Disconnected
54fa8ec0-3617-4c8e-968a-7402a80a9017 172.25.78.240 Connected
588bb497-5d9f-4bcf-b0b1-c9d364afb084 172.25.78.241 Connected
a8522deb-ee16-44a3-a800-9525cb8d64fa localhost Connected
[root@szdpl1491 ~]# cd /var/lib/glusterd/peers/
[root@szdpl1491 peers]# ls
54fa8ec0-3617-4c8e-968a-7402a80a9017 588bb497-5d9f-4bcf-b0b1-c9d364afb084 bde7b9e2-af2a-419e-8242-6a8f8e18bb8a
[root@szdpl1491 peers]# cat bde7b9e2-af2a-419e-8242-6a8f8e18bb8a
uuid=bde7b9e2-af2a-419e-8242-6a8f8e18bb8a
state=3
hostname1=bogon
hostname2=172.25.78.242
  • 修改每个节点的var/lib/glusterd/peers/bde7b9e2-af2a-419e-8242-6a8f8e18bb8a 中的hostname1修改为真实的IP,然后重启每个节点的glusterd的服务(systemctl restart glusterd)

1.glusterfs7 repo源

  • 添加此源到/etc/yum.repo.d/glusterfs.repo中
    1
    2
    3
    4
    5
    6
    7
    8
    9
    [centos-gluster7]
    gpgcheck = 0
    mirrorlist = http://mirrorlist.centos.org?arch=$basearch&release=$releasever&repo=storage-gluster-7
    name = CentOS-$releasever - Gluster 7

    [centos-gluster6]
    gpgcheck = 0
    mirrorlist = http://mirrorlist.centos.org?arch=$basearch&release=$releasever&repo=storage-gluster-6
    name = CentOS-$releasever - Gluster 6

    2.更新系统yum 源

    1
    yum check-update

    3.安装脚本(glusterfs_install.sh)

  • 脚本内容
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    #!/bin/bash

    version="$1"
    [[ "$version" == "" ]] && echo "version not provided" && exit 1

    rpm_packages=(
    glusterfs-server
    glusterfs-events
    glusterfs-extra-xlators
    glusterfs-geo-replication
    glusterfs-libs
    glusterfs-rdma
    glusterfs
    glusterfs-api
    glusterfs-api-devel
    glusterfs-cli
    glusterfs-client-xlators
    glusterfs-fuse
    python2-gluster
    )

    for index in ${!rpm_packages[@]}; do
    rpm_version_packages[$index]=${rpm_packages[$index]}"-"$version
    done


    yum install -y ${rpm_version_packages[*]}
  • 脚本执行
    1
    2
    // 7.2是版本号
    ./glusterfs_install.sh 7.2

  • go access c struct
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
  // go run cgo.go
package main

/*
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <assert.h>
typedef struct student {
int age;
char *name;
}student;

void student_init(void **ptr,char *name,int age) {
size_t len = strlen(name);
student *st = (student *)calloc(1,sizeof(student));
assert(st!=NULL);
st->age = age;
st->name = (char *)calloc(1,len+1);
memcpy(st->name,name,len);
*ptr = st;
fprintf(stdout,"...call student_init...\n");
}
student *student_new(char *name,int age) {
size_t len = strlen(name);
student *st = (student *)calloc(1,sizeof(student));
assert(st!=NULL);
st->age = age;
st->name = (char *)calloc(1,len+1);
memcpy(st->name,name,len);

fprintf(stdout,"...call student_new...\n");
return st;
}
void student_destroy(void *ptr) {
student *st = (student *)ptr;
if(st !=NULL)
{
free(st->name);
free(st);
st=NULL;
fprintf(stdout,"...call student_destroy...\n");
}
}
void student_print(void *ptr) {
student *st = (student *)ptr;
fprintf(stdout,"student addr=%p,name=%p,age=%p\n",st,st->name,&st->age);
fprintf(stdout," student {name=%s,age=%d}\n",st->name,st->age);
}
*/
import "C"
import (
"fmt"
"unsafe"
)

func main() {
var st1 unsafe.Pointer
name := C.CString("perrynzhou")
C.student_init(&st1, name, 30)
C.student_print(st1)
C.student_destroy(st1)
C.free(unsafe.Pointer(name))

var st2 *C.student
name2 := C.CString("hello")
st2 = C.student_new(name2, 100)
fmt.Printf("init student st2 {age:%d,name:%s}\n", st2.age, C.GoString(st2.name))
C.student_print(unsafe.Pointer(st2))

C.free(unsafe.Pointer(st2.name))
name3 := C.CString("join")
st2.name = name3
st2.age = 67
fmt.Printf("after change student st2 {age:%d,name:%s}\n", st2.age, C.GoString(st2.name))
C.student_print(unsafe.Pointer(st2))
C.student_destroy(unsafe.Pointer(st2))

}

  • go access c memory
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    //    go build  -o cgo_test cgo.go 

    package main

    /*

    #include <stdlib.h>

    void *alloc() {
    static int count = 0;
    void *d = malloc(sizeof(int));
    *((int *)d) = count++;
    return d;
    }
    */
    import "C"
    import (
    "fmt"
    "runtime"
    "sync"
    "time"
    "unsafe"
    )

    type CStruct struct {
    sync.Mutex
    name string
    allocCnt int
    memory unsafe.Pointer
    }

    func (cs *CStruct) alloc(name string, id int) {
    cs.name = fmt.Sprintf("CStruct-%s-%d", name, id)
    cs.Lock()
    defer cs.Unlock()
    cs.allocCnt++
    fmt.Printf("%s begin with alloc,count=%d\n", cs.name, cs.allocCnt)
    cs.memory = C.alloc()
    runtime.SetFinalizer(cs, free)
    }
    func free(cs *CStruct) {
    C.free(unsafe.Pointer(cs.memory))
    cs.Lock()
    defer cs.Unlock()
    cs.allocCnt--
    fmt.Printf("%s end with free count=%d\n", cs.name, cs.allocCnt)
    }
    func CStructTest(i int) {
    var c1, c2 CStruct
    c1.alloc("c1", i)
    c2.alloc("c2", i)
    }
    func main() {
    for i := 0; i < 10; i++ {
    CStructTest(i)
    time.Sleep(time.Second)
    }
    runtime.GC()
    time.Sleep(time.Second)
    fmt.Println("done..")
    }

  • shell实现代码
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    /*************************************************************************
    > File Name: shell.c
    > Author:perrynzhou
    > Mail:perrynzhou@gmail.com
    > Created Time: Thu 20 Jun 2019 09:15:59 PM CST
    ************************************************************************/

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <glob.h>
    #include <unistd.h>
    #include <sys/types.h>
    #include <sys/wait.h>
    static const char *delimiter = " \t\n";
    typedef struct
    {
    glob_t gt;
    int (*cmd_cd_fn)(char **argv);
    int (*cmd_exit_fn)(char **argv);
    int (*cmd_help_fn)(char **argv);
    } cmd_t;
    static void prompt()
    {
    fprintf(stdout, "zsh-0.1$ ");
    }
    void parsed_cmd(char *line, cmd_t *cmd)
    {
    char *token;
    int flag = 0;
    while (1)
    {
    token = strsep(&line, delimiter);
    if (token == NULL)
    {
    break;
    }
    if (*token == '\0')
    {
    continue;
    }
    glob(token, GLOB_NOCHECK | GLOB_APPEND * flag, NULL, &cmd->gt);
    flag = 1;
    }
    }
    int main(int argc, char *argv[])
    {
    char *line = NULL;
    size_t line_size = 0;
    cmd_t cmd;
    pid_t pid;
    while (1)
    {
    prompt();
    if (getline(&line, &line_size, stdin) < 0)
    {
    break;
    }
    parsed_cmd(line, &cmd);
    pid = fork();
    switch (pid)
    {
    case -1:
    perror("fork()");
    exit(1);
    case 0:
    execvp(cmd.gt.gl_pathv[0], cmd.gt.gl_pathv);
    exit(0);
    default:
    wait(NULL);
    }
    }
    }

为什么会有字节序这一说法

  • 对于单字节处理器不存在字节序,对于大于1个字节的位数的CPU,其寄存器的宽度也大于1个字节,所以存储字节排序的问题

大端字节序排序

  • 大端:高字节位在低地址存储,低字节位在高地址存储
  • 小端: 高字节位存储在高地址,低字节位存储在低地址
    1
    2
    3
    4
     例如 int  x = 0x0001,假设是在64位的CPU上,它的地址空间为:0xff13,其字节排序可能有两种可能:
    oxff13: 0 0 0 1 --->第一种情况 大端
    0xff13: 1 0 0 0 --->第二种情况 小端
    在x这个值中0001从左往右是高字节位到低字节位,也就是 0,0 是高字节,0 1 是低字节位。

    如何验证CPU的字节序呢?

  • C语言中有union结构,这个结构的数据存储是从低地址向高地址依次存储值,可以通过这个特性来验证CPU的字节序
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    /*************************************************************************
    > File Name: byte_order.c
    > Author:perrynzhou
    > Mail:perrynzhou@gmail.com
    > Created Time: Fri 15 Nov 2019 03:42:37 PM CST
    ************************************************************************/

    #include <stdio.h>
    typedef union object_t {
    int a;
    char b;
    } object;
    enum
    {
    big_endian_type = 0,
    small_endian_type,
    };
    int checkCpu(object *obj)
    {
    return obj->b == 1 ? small_endian_type : big_endian_type;
    }
    int main()
    {
    int v = 0x0001;
    object obj;
    obj.a = v;
    fprintf(stdout, "value:%d,value_string:%s,object address:%p\n", v, "0x0001", &obj);
    if (checkCpu(&obj) == big_endian_type)
    {
    fprintf(stdout, "store plan: |%d|%d|%d|%d|,big endian\n", 0, 0, 0, 1);
    }
    else
    {
    fprintf(stdout, "store plan: |%d|%d|%d|%d|,small endian\n", 1, 0, 0, 0);
    }
    return 0;
    }
    1
    2
    3
    4
    [perrynzhou@localhost ~/Source/vivo/linux_kernel_study/chapter_01]$ ./test 
    value:1,value_string:0x0001,object address:0x7ffcaf058a08
    store plan: |0|0|0|1|,big endian
    [perrynzhou@localhost ~/Source/vivo/linux_kernel_study/chapter_01]$