4节点机群调优流水账

一个4节点机群性能不及预期,进行测试和调优

4节点网络架构,lscpu查看CPU信息

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    1
Core(s) per socket:    48
Socket(s):             2
NUMA node(s):          2
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7K62 48-Core Processor
Stepping:              0
CPU MHz:               1500.000
CPU max MHz:           2600.0000
CPU min MHz:           1500.0000
BogoMIPS:              5200.00
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-47
NUMA node1 CPU(s):     48-95
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smcaCode language: PHP (php)

分别测试ip网络和ib网络

ip 测试ping和ssh,node03 ping值平均超过2ms,偏高

配置frp端口转发

ibutil安装无误,但ibping不通

$ ibping node01
ibwarn: [56768] mad_rpc_open_port: can't open UMAD port ((null):0)
ibping: iberror: failed: Failed to open '(null)' port '0'
Code language: PHP (php)

ifconfig发现ib端口也有报错

$ ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 222.199.132.59  netmask 255.255.255.0  broadcast 222.199.132.255
        inet6 2001:da8:20c:a133::efd4  prefixlen 128  scopeid 0x0<global>
        inet6 fe80::e092:20cf:1f15:65b0  prefixlen 64  scopeid 0x20<link>
        ether 3c:ec:ef:71:97:88  txqueuelen 1000  (Ethernet)
        RX packets 187011950  bytes 33023832426 (30.7 GiB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 43565873  bytes 31451560266 (29.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 3c:ec:ef:71:97:89  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 1.0.0.1  netmask 255.255.0.0  broadcast 1.0.255.255
        inet6 fe80::ac0b:e92c:2446:5d13  prefixlen 64  scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 403252956  bytes 746504437315 (695.2 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 49504976  bytes 900446939075 (838.6 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 4092
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:03:00:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 6045989368  bytes 334618451075 (311.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6045989368  bytes 334618451075 (311.6 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
Code language: HTML, XML (xml)

ip a 命令看下

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 3c:ec:ef:71:97:88 brd ff:ff:ff:ff:ff:ff
    inet 222.199.132.59/24 brd 222.199.132.255 scope global noprefixroute dynamic eno1
       valid_lft 6993sec preferred_lft 6993sec
    inet6 2001:da8:20c:a133::efd4/128 scope global noprefixroute dynamic
       valid_lft 6496sec preferred_lft 6196sec
    inet6 fe80::e092:20cf:1f15:65b0/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:ec:ef:71:97:89 brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband a0:00:02:20:fe:80:00:00:00:00:00:00:00:02:c9:03:00:a0:93:41 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 1.0.0.1/16 brd 1.0.255.255 scope global noprefixroute ib0
       valid_lft forever preferred_lft forever
    inet6 fe80::ac0b:e92c:2446:5d13/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
5: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN group default qlen 256
    link/infiniband a0:00:03:00:fe:80:00:00:00:00:00:00:00:02:c9:03:00:a0:93:42 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ffCode language: PHP (php)

报错没有了,应该是ib物理地址过长,ifconfig不支持,要用ip a命令查看。之后用ethtool看下这两个ib端口ib0和ib1。

$ ./ethtool ib0
Settings for ib0:
        Supported ports: [ ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 56000Mb/s
        Duplex: Full
        Port: Other
        PHYAD: 255
        Transceiver: internal
        Auto-negotiation: on
Cannot get wake-on-lan settings: Operation not permitted
        Link detected: yes
$ ./ethtool ib1
Settings for ib1:
        Supported ports: [ ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Unknown! (255)
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        MDI-X: Unknown
Cannot get wake-on-lan settings: Operation not permitted
        Link detected: noCode language: JavaScript (javascript)

ibstat也看下

$ ibstat
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.42.5000
        Hardware version: 1
        Node GUID: 0x0002c90300a09340
        System image GUID: 0x0002c90300a09343
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0x0002c90300a09341
                Link layer: InfiniBand
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02514868
                Port GUID: 0x0002c90300a09342
                Link layer: InfiniBandCode language: JavaScript (javascript)

ibnodes也看下

# ibnodes
Ca      : 0xf4521403006cba20 ports 2 "node03 HCA-1"
Ca      : 0x0002c90300a4a920 ports 2 "node02 HCA-1"
Ca      : 0x0002c90300a09340 ports 2 "node01 HCA-1"
Ca      : 0xe41d2d0300231ac0 ports 2 "node04 HCA-1"
Switch  : 0x0002c90300721600 ports 36 "SwitchX -  Mellanox Technologies" base port 0 lid 2 lmc 0Code language: PHP (php)

iblinkinfo

# iblinkinfo
CA: node04 HCA-1:
      0xe41d2d0300231ac1      6    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    4[  ] "SwitchX -  Mellanox Technologies" ( )
CA: node02 HCA-1:
      0x0002c90300a4a921      5    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    2[  ] "SwitchX -  Mellanox Technologies" ( )
CA: node03 HCA-1:
      0xf4521403006cba21      3    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    3[  ] "SwitchX -  Mellanox Technologies" ( )
Switch: 0x0002c90300721600 SwitchX -  Mellanox Technologies:
           2    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       1    1[  ] "node01 HCA-1" ( )
           2    2[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       5    1[  ] "node02 HCA-1" ( )
           2    3[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       3    1[  ] "node03 HCA-1" ( )
           2    4[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       6    1[  ] "node04 HCA-1" ( )
           2    5[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2    6[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2    7[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2    8[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   11[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   12[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   13[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   14[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   15[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   16[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   17[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   18[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   19[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   20[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   21[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   22[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   23[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   24[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   25[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   26[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   27[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   28[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   31[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   32[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   33[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   34[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   35[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           2   36[  ] ==(                Down/ Polling)==>             [  ] "" ( )
CA: node01 HCA-1:
      0x0002c90300a09341      1    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    1[  ] "SwitchX -  Mellanox Technologies" ( )Code language: PHP (php)

ping一下ib卡的IPoIB地址1.0.0.1~1.0.0.4都是通的。

可能是每个节点都运行了opensm导致冲突,判断依据如下

[root@node01 ~]# opensm -v
-------------------------------------------------
OpenSM 5.7.2.MLNX20201014.9378048
Command Line Arguments:
 Verbose option -v (log flags = 0x7)
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 5.7.2.MLNX20201014.9378048

Using default GUID 0x2c90300a09341
Entering DISCOVERING state


Error from osm_opensm_bind (0x2A)
Perhaps another instance of OpenSM is already running
Exiting SMCode language: PHP (php)
常恭

作者: 常恭

略懂 OpenFOAM

发表回复