课题组上了新货,这次玩点不一样的,两台服务器均配备了 Nvidia bluefield 2 DPU 智能网卡,支持200GB infiniband 网络,两台服务器,一共搭载了4块AMD Epyc 7773x处理器,总和算力超过了双路9684x,这种双机并行的计算性能会不会超过单机双路9684x,让我们拭目以待。今天的目标是1)配置好硬盘的lvm,2)装好智能网卡的驱动程序,3)更新智能网卡的固件,4)测试智能网卡和服务器的性能。
先看下这个服务器的靓照。


这台服务器和我们之前测试过的另一台(https://www.cfdem.cn/amd9654es-benchmark/),同属华硕,外观差异不大。
上架之后就是配置和测试了。完成一系列IP地址、用户、ssh配置、更新源等基础配置后,为两块三星PM1733 7.68TB硬盘安装驱动程序,并组建lvm虚拟卷。两块硬盘都安装在服务器1上,通过nfs共享给另一台服务器。具体操作方式参考chatgpt(https://chatgpt.com/share/1d25edce-44f7-406c-943f-91044324a64e),不再赘述。
安装网卡驱动和更新固件的过程比较曲折,这里详细记录下,或许可以帮到大家。这网卡是工程样品,英伟达官方不提供对应型号的固件,且有B站up主警告,不要尝试刷相似产品的固件,容易变砖。且这网卡似乎不支持IB模式,默认工作在以太网模式,不进行配置的情况下,Ubuntu 20.04可以正常识别成200Gbps以太网卡,因而可以无脑使用,没有问题。但小编不甘心,希望解锁IB模式。使用lspci | grep Mellanox
命令查看网卡的pci地址,结果如下
21:00.0 Ethernet controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller (rev 01)
21:00.1 DMA controller: Mellanox Technologies MT42822 BlueField-2 SoC Management Interface (rev 01)
61:00.0 Ethernet controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller (rev 01)
61:00.1 DMA controller: Mellanox Technologies MT42822 BlueField-2 SoC Management Interface (rev 01)
Code language: CSS (css)
使用mstconfig -d 21:00.0 q
命令查询网卡配置参数,如下
Device #1:
----------
Device type: BlueField2
Name: MBF2M345A-VENOT_ES_Ax
Description: NVIDIA BlueField-2 E-Series Eng. sample DPU; 200GbE single-port QSFP56; PCIe Gen4 x16; Secure Boot Disabled; Crypto Enabled; 16GB on-board DDR; 1GbE OOB management
Device: 21:00.0
Configurations: Next Boot
MEMIC_BAR_SIZE 0
MEMIC_SIZE_LIMIT _256KB(1)
HOST_CHAINING_MODE DISABLED(0)
HOST_CHAINING_CACHE_DISABLE False(0)
HOST_CHAINING_DESCRIPTORS Array[0..7]
HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0..7]
INTERNAL_CPU_MODEL EMBEDDED_CPU(1)
FLEX_PARSER_PROFILE_ENABLE 0
PROG_PARSE_GRAPH False(0)
FLEX_IPV4_OVER_VXLAN_PORT 0
ROCE_NEXT_PROTOCOL 254
ESWITCH_HAIRPIN_DESCRIPTORS Array[0..7]
ESWITCH_HAIRPIN_TOT_BUFFER_SIZE Array[0..7]
PF_BAR2_SIZE 3
NON_PREFETCHABLE_PF_BAR False(0)
VF_VPD_ENABLE False(0)
PF_NUM_PF_MSIX_VALID False(0)
PER_PF_NUM_SF False(0)
STRICT_VF_MSIX_NUM False(0)
VF_NODNIC_ENABLE False(0)
NUM_PF_MSIX_VALID True(1)
NUM_OF_VFS 8
NUM_OF_PF 1
PF_BAR2_ENABLE True(1)
HIDE_PORT2_PF False(0)
SRIOV_EN True(1)
PF_LOG_BAR_SIZE 5
VF_LOG_BAR_SIZE 0
NUM_PF_MSIX 63
NUM_VF_MSIX 11
INT_LOG_MAX_PAYLOAD_SIZE AUTOMATIC(0)
PCIE_CREDIT_TOKEN_TIMEOUT 0
ACCURATE_TX_SCHEDULER False(0)
PARTIAL_RESET_EN False(0)
RESET_WITH_HOST_ON_ERRORS False(0)
NVME_EMULATION_ENABLE False(0)
NVME_EMULATION_NUM_VF 0
NVME_EMULATION_NUM_PF 1
NVME_EMULATION_VENDOR_ID 5555
NVME_EMULATION_DEVICE_ID 24577
NVME_EMULATION_CLASS_CODE 67586
NVME_EMULATION_REVISION_ID 0
NVME_EMULATION_SUBSYSTEM_VENDOR_ID 0
NVME_EMULATION_SUBSYSTEM_ID 0
NVME_EMULATION_NUM_MSIX 0
NVME_EMULATION_MAX_QUEUE_DEPTH 0
PCI_SWITCH_EMULATION_NUM_PORT 0
PCI_SWITCH_EMULATION_ENABLE False(0)
VIRTIO_NET_EMULATION_ENABLE False(0)
VIRTIO_NET_EMULATION_NUM_VF 0
VIRTIO_NET_EMULATION_NUM_PF 0
VIRTIO_NET_EMU_SUBSYSTEM_VENDOR_ID 6900
VIRTIO_NET_EMULATION_SUBSYSTEM_ID 1
VIRTIO_NET_EMULATION_NUM_MSIX 2
VIRTIO_BLK_EMULATION_ENABLE False(0)
VIRTIO_BLK_EMULATION_NUM_VF 0
VIRTIO_BLK_EMULATION_NUM_PF 0
VIRTIO_BLK_EMU_SUBSYSTEM_VENDOR_ID 6900
VIRTIO_BLK_EMULATION_SUBSYSTEM_ID 2
VIRTIO_BLK_EMULATION_NUM_MSIX 2
PCI_DOWNSTREAM_PORT_OWNER Array[0..15]
CQE_COMPRESSION BALANCED(0)
IP_OVER_VXLAN_EN False(0)
MKEY_BY_NAME False(0)
PRIO_TAG_REQUIRED_EN False(0)
UCTX_EN True(1)
REAL_TIME_CLOCK_ENABLE False(0)
RDMA_SELECTIVE_REPEAT_EN False(0)
PCI_ATOMIC_MODE PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
TUNNEL_ECN_COPY_DISABLE False(0)
LRO_LOG_TIMEOUT0 6
LRO_LOG_TIMEOUT1 7
LRO_LOG_TIMEOUT2 8
LRO_LOG_TIMEOUT3 13
LOG_TX_PSN_WINDOW 7
LOG_MAX_OUTSTANDING_WQE 7
TUNNEL_IP_PROTO_ENTROPY_DISABLE False(0)
ICM_CACHE_MODE DEVICE_DEFAULT(0)
TLS_OPTIMIZE False(0)
TX_SCHEDULER_BURST 0
ROCE_CC_LEGACY_DCQCN True(1)
LOG_DCR_HASH_TABLE_SIZE 11
DCR_LIFO_SIZE 16384
ROCE_CC_PRIO_MASK_P1 255
CLAMP_TGT_RATE_AFTER_TIME_INC_P1 True(1)
CLAMP_TGT_RATE_P1 False(0)
RPG_TIME_RESET_P1 300
RPG_BYTE_RESET_P1 32767
RPG_THRESHOLD_P1 1
RPG_MAX_RATE_P1 0
RPG_AI_RATE_P1 5
RPG_HAI_RATE_P1 50
RPG_GD_P1 11
RPG_MIN_DEC_FAC_P1 50
RPG_MIN_RATE_P1 1
RATE_TO_SET_ON_FIRST_CNP_P1 0
DCE_TCP_G_P1 1019
DCE_TCP_RTT_P1 1
RATE_REDUCE_MONITOR_PERIOD_P1 4
INITIAL_ALPHA_VALUE_P1 1023
MIN_TIME_BETWEEN_CNPS_P1 4
CNP_802P_PRIO_P1 6
CNP_DSCP_P1 48
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
KEEP_ETH_LINK_UP_P1 True(1)
KEEP_IB_LINK_UP_P1 False(0)
KEEP_LINK_UP_ON_BOOT_P1 False(0)
KEEP_LINK_UP_ON_STANDBY_P1 False(0)
DO_NOT_CLEAR_PORT_STATS_P1 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P1 False(0)
NUM_OF_VL_P1 _4_VLs(3)
NUM_OF_TC_P1 _8_TCs(0)
NUM_OF_PFC_P1 8
VL15_BUFFER_SIZE_P1 0
DUP_MAC_ACTION_P1 LAST_CFG(0)
UNKNOWN_UPLINK_MAC_FLOOD_P1 False(0)
SRIOV_IB_ROUTING_MODE_P1 LID(1)
IB_ROUTING_MODE_P1 LID(1)
PF_TOTAL_SF 0
PF_SF_BAR_SIZE 0
PF_NUM_PF_MSIX 63
ROCE_CONTROL ROCE_ENABLE(2)
PCI_WR_ORDERING per_mkey(0)
MULTI_PORT_VHCA_EN False(0)
PORT_OWNER True(1)
ALLOW_RD_COUNTERS True(1)
RENEG_ON_CHANGE True(1)
TRACER_ENABLE True(1)
IP_VER IPv4(0)
BOOT_UNDI_NETWORK_WAIT 0
UEFI_HII_EN True(1)
BOOT_DBG_LOG False(0)
UEFI_LOGS DISABLED(0)
BOOT_VLAN 1
LEGACY_BOOT_PROTOCOL PXE(1)
BOOT_RETRY_CNT NONE(0)
BOOT_INTERRUPT_DIS False(0)
BOOT_LACP_DIS True(1)
BOOT_VLAN_EN False(0)
BOOT_PKEY 0
P2P_ORDERING_MODE DEVICE_DEFAULT(0)
EXP_ROM_VIRTIO_NET_PXE_ENABLE True(1)
EXP_ROM_VIRTIO_NET_UEFI_x86_ENABLE True(1)
EXP_ROM_VIRTIO_BLK_UEFI_x86_ENABLE True(1)
EXP_ROM_NVME_UEFI_x86_ENABLE True(1)
ATS_ENABLED False(0)
DYNAMIC_VF_MSIX_TABLE False(0)
EXP_ROM_UEFI_ARM_ENABLE True(1)
EXP_ROM_UEFI_x86_ENABLE True(1)
EXP_ROM_PXE_ENABLE True(1)
ADVANCED_PCI_SETTINGS False(0)
SAFE_MODE_THRESHOLD 10
SAFE_MODE_ENABLE True(1)
Code language: PHP (php)
有关QSFP56端口的信息,参考https://community.fs.com/cn/article/introduction-to-qsfp56-form-factor.html
使用ibstat
命令查看网卡信息如下
CA 'mlx5_0'
CA type: MT41686
Number of ports: 1
Firmware version: 24.31.0356
Hardware version: 1
Node GUID: 0xb8cef60300fd0d80
System image GUID: 0xb8cef60300fd0d80
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xbacef6fffefd0d80
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT41686
Number of ports: 1
Firmware version: 24.31.0356
Hardware version: 1
Node GUID: 0xb8cef60300f661e6
System image GUID: 0xb8cef60300f661e6
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xbacef6fffef661e6
Link layer: Ethernet
Code language: JavaScript (javascript)
英伟达最新的 bluefield 驱动集成在 DOCA 开发套件中,下载对应版本的deb包,利用dpkg
和apt-get
安装即可,doca和原先的ofed会有依赖冲突,安装doca前需要卸载之前的驱动
for f in $( dpkg --list | grep doca | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done
/usr/sbin/ofed_uninstall.sh --force
Code language: PHP (php)
安装后可以通过systemctl status rshim
命令确认驱动程序是否正常运行,结果如下
● rshim.service - rshim driver for BlueField SoC
Loaded: loaded (/lib/systemd/system/rshim.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2024-07-17 15:46:33 CST; 1s ago
Docs: man:rshim(8)
Process: 86070 ExecStart=/usr/sbin/rshim $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 86071 (rshim)
Tasks: 6 (limit: 154304)
Memory: 3.2M
CPU: 811ms
CGroup: /system.slice/rshim.service
└─86071 /usr/sbin/rshim
7月 17 15:46:33 ps systemd[1]: Started rshim driver for BlueField SoC.
7月 17 15:46:33 ps rshim[86071]: Probing pcie-0000:61:00.1(vfio)
7月 17 15:46:33 ps rshim[86071]: Create rshim pcie-0000:61:00.1
7月 17 15:46:33 ps rshim[86071]: Fall-back to uio
7月 17 15:46:33 ps rshim[86071]: rshim pcie-0000:61:00.1 enable
7月 17 15:46:34 ps rshim[86071]: rshim0 attached
7月 17 15:46:34 ps rshim[86071]: Probing pcie-0000:21:00.1(vfio)
7月 17 15:46:34 ps rshim[86071]: Create rshim pcie-0000:21:00.1
7月 17 15:46:34 ps rshim[86071]: Fall-back to uio
7月 17 15:46:34 ps rshim[86071]: rshim pcie-0000:21:00.1 enable
Code language: JavaScript (javascript)
如果没有正确运行的话,可以reboot
重启下服务器。另一台服务器也执行类似操作安装doca-all。这个包安装应该会很顺利,小编看错系统版本,折腾了一通,特此提醒,doca版本的命名方式和ubuntu相同,都是利用发行时间命名,所以要仔细区分doca的版本和ubuntu的版本,千万不要下错,否则会有复杂的依赖错误,且错装的版本难以完全卸载。
这个ES版本默认的固件不支持infiniband,但类似型号的正式版支持,刷固件可能解锁infiniband,但风险很大,我们先做简单测试,之后再更新刷固件的部分。200G的以太网也非常生猛了,我们先用以太网并行,跑通mpi,这种带宽的网络,瓶颈一般在延迟,而非带宽,这也是infiniband相比以太网的优势。
配置好IP地址后,使用iperf3测速,在其中一台服务器启动iperf server
iperf3 -s
另一台服务器启动 iperf client
iperf3 -c 169.254.174.32 -P 16 -t 30
Code language: CSS (css)
测试结果节选如下
Connecting to host 169.254.174.32, port 5201
[ 5] local 169.254.16.66 port 59382 connected to 169.254.174.32 port 5201
[ 7] local 169.254.16.66 port 59394 connected to 169.254.174.32 port 5201
[ 9] local 169.254.16.66 port 59408 connected to 169.254.174.32 port 5201
[ 11] local 169.254.16.66 port 59422 connected to 169.254.174.32 port 5201
[ 13] local 169.254.16.66 port 59434 connected to 169.254.174.32 port 5201
[ 15] local 169.254.16.66 port 59436 connected to 169.254.174.32 port 5201
[ 17] local 169.254.16.66 port 59440 connected to 169.254.174.32 port 5201
[ 19] local 169.254.16.66 port 59454 connected to 169.254.174.32 port 5201
[ 21] local 169.254.16.66 port 59460 connected to 169.254.174.32 port 5201
[ 23] local 169.254.16.66 port 59464 connected to 169.254.174.32 port 5201
[ 25] local 169.254.16.66 port 59472 connected to 169.254.174.32 port 5201
[ 27] local 169.254.16.66 port 59480 connected to 169.254.174.32 port 5201
[ 29] local 169.254.16.66 port 59496 connected to 169.254.174.32 port 5201
[ 31] local 169.254.16.66 port 59504 connected to 169.254.174.32 port 5201
[ 33] local 169.254.16.66 port 59514 connected to 169.254.174.32 port 5201
[ 35] local 169.254.16.66 port 59526 connected to 169.254.174.32 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 157 MBytes 1.32 Gbits/sec 0 314 KBytes
[ 7] 0.00-1.00 sec 161 MBytes 1.35 Gbits/sec 0 482 KBytes
[ 9] 0.00-1.00 sec 160 MBytes 1.35 Gbits/sec 0 520 KBytes
[ 11] 0.00-1.00 sec 159 MBytes 1.34 Gbits/sec 0 454 KBytes
[ 13] 0.00-1.00 sec 159 MBytes 1.33 Gbits/sec 0 452 KBytes
[ 15] 0.00-1.00 sec 161 MBytes 1.35 Gbits/sec 0 650 KBytes
[ 17] 0.00-1.00 sec 160 MBytes 1.34 Gbits/sec 0 513 KBytes
[ 19] 0.00-1.00 sec 156 MBytes 1.31 Gbits/sec 0 303 KBytes
[ 21] 0.00-1.00 sec 158 MBytes 1.33 Gbits/sec 0 646 KBytes
[ 23] 0.00-1.00 sec 158 MBytes 1.33 Gbits/sec 0 400 KBytes
[ 25] 0.00-1.00 sec 161 MBytes 1.35 Gbits/sec 0 561 KBytes
[ 27] 0.00-1.00 sec 158 MBytes 1.32 Gbits/sec 0 410 KBytes
[ 29] 0.00-1.00 sec 158 MBytes 1.32 Gbits/sec 0 397 KBytes
[ 31] 0.00-1.00 sec 160 MBytes 1.34 Gbits/sec 0 467 KBytes
[ 33] 0.00-1.00 sec 157 MBytes 1.32 Gbits/sec 0 419 KBytes
[ 35] 0.00-1.00 sec 158 MBytes 1.32 Gbits/sec 0 444 KBytes
[SUM] 0.00-1.00 sec 2.48 GBytes 21.3 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
...
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 3.74 GBytes 1.07 Gbits/sec 0 sender
[ 5] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[ 7] 0.00-30.00 sec 3.76 GBytes 1.08 Gbits/sec 0 sender
[ 7] 0.00-30.02 sec 3.75 GBytes 1.07 Gbits/sec receiver
[ 9] 0.00-30.00 sec 3.76 GBytes 1.08 Gbits/sec 0 sender
[ 9] 0.00-30.02 sec 3.75 GBytes 1.07 Gbits/sec receiver
[ 11] 0.00-30.00 sec 3.75 GBytes 1.07 Gbits/sec 0 sender
[ 11] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[ 13] 0.00-30.00 sec 3.75 GBytes 1.07 Gbits/sec 0 sender
[ 13] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[ 15] 0.00-30.00 sec 3.76 GBytes 1.08 Gbits/sec 0 sender
[ 15] 0.00-30.02 sec 3.76 GBytes 1.07 Gbits/sec receiver
[ 17] 0.00-30.00 sec 3.76 GBytes 1.08 Gbits/sec 0 sender
[ 17] 0.00-30.02 sec 3.75 GBytes 1.07 Gbits/sec receiver
[ 19] 0.00-30.00 sec 3.74 GBytes 1.07 Gbits/sec 0 sender
[ 19] 0.00-30.02 sec 3.73 GBytes 1.07 Gbits/sec receiver
[ 21] 0.00-30.00 sec 3.75 GBytes 1.07 Gbits/sec 0 sender
[ 21] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[ 23] 0.00-30.00 sec 3.75 GBytes 1.07 Gbits/sec 0 sender
[ 23] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[ 25] 0.00-30.00 sec 3.76 GBytes 1.08 Gbits/sec 0 sender
[ 25] 0.00-30.02 sec 3.75 GBytes 1.07 Gbits/sec receiver
[ 27] 0.00-30.00 sec 3.75 GBytes 1.07 Gbits/sec 0 sender
[ 27] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[ 29] 0.00-30.00 sec 3.75 GBytes 1.07 Gbits/sec 0 sender
[ 29] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[ 31] 0.00-30.00 sec 3.76 GBytes 1.08 Gbits/sec 0 sender
[ 31] 0.00-30.02 sec 3.75 GBytes 1.07 Gbits/sec receiver
[ 33] 0.00-30.00 sec 3.74 GBytes 1.07 Gbits/sec 0 sender
[ 33] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[ 35] 0.00-30.00 sec 3.74 GBytes 1.07 Gbits/sec 0 sender
[ 35] 0.00-30.02 sec 3.74 GBytes 1.07 Gbits/sec receiver
[SUM] 0.00-30.00 sec 60.0 GBytes 17.2 Gbits/sec 0 sender
[SUM] 0.00-30.02 sec 59.9 GBytes 17.1 Gbits/sec receiver
iperf Done.
可以看到30 s的平均比特率为17.2Gbps,比标称的200Gbps低很多,调整一下并行的CPU核数到32,也就是-P 32
,比特率小幅上升至19.9Gbps。这种网卡的瓶颈可能出现在其他硬件或者iperf的测试方法,暂时不管iperf的测试结果,测试下rdma的应用表现和openmpi的运行速度。
类似的,使用ib_write_bw测试rdma写性能,结果如下
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0026 PSN 0x48d983 RKey 0x184ded VAddr 0x007afee78fb000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:16:66
remote address: LID 0000 QPN 0x0026 PSN 0xedbefd RKey 0x201dbd VAddr 0x00781cdea40000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:174:32
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1500.000000 != 3518.545000. CPU Frequency is not max.
65536 5000 17078.53 16826.73 0.269228
---------------------------------------------------------------------------------------
Code language: PHP (php)
带宽峰值换算成比特率是136.63 Gbps,和标称值200 Gbps在同一量级,比较合理了。
bluefield-2 的可玩性非常强,无论是CPU并行还是GPU并行,都能提供强大的通讯能力,今天简单配置并测试性能,之后分享更详细的玩法。