Dear Mellanox Gurus,
Would you please advice the pros/cons of Omni-Path vs. Mellanox + Ethernet vs. InfiniBand?
Thanks in advance + Happy Monday
Henry
>> Learn about InfiniBand benefits for free on the Mellanox Academy
Dear Mellanox Gurus,
Would you please advice the pros/cons of Omni-Path vs. Mellanox + Ethernet vs. InfiniBand?
Thanks in advance + Happy Monday
Henry
>> Learn about InfiniBand benefits for free on the Mellanox Academy
Hi Jae-Hoon,
Thank you for your reply. I contacted support and got the switches upgraded to the latest firmware. For others seeking the latest and probably the final version of the firmware, it is Firmware (latest. EOL. No more upgrades): EFM_PPC_M405EX EFM_1.1.3004.
Hello,everyone
I want to know the status of Mellanox SN2100'S LED.Today,I use the 100G cables in the switch and connect it between port 1 to port 3 ,then the led links up,but the status is flap,the flap frequency is one second;but when I ues the 100G SR4 optical transceiver ,the flap frequency is once in 30 seconds,this is the first time i use the SN2100, So is it normal ? which status should it be? Thanks for your help in advance.
Good afternoon Colleagues!
I ask you to help me with the connection of CICCO switches in Mellanox NEO.
After the start of provisioning and waiting, there is an error of timeout. Here's what's in the logs:
2017-11-15 20:54:06.273 job INFO performAction created a new job (20) for Provisioning
2017-11-15 20:54:06.300 job INFO performAction created sub-job (20.1) for device: 10.10.0.4
2017-11-15 20:54:06.301 job INFO Preparing job notification for job (20 - Provisioning), status:(New), progress: (0)
2017-11-15 20:54:06.301 job INFO job: (20.1), status: (New), progress: (0), device: (10.10.0.4)
2017-11-15 20:54:06.301 zmq INFO Send Message Topic:notification, category:notifications/jobs
2017-11-15 20:54:06.355 netservice INFO Performing action run_cli on devices
2017-11-15 20:54:06.355 netservice INFO commandline: [u'show running-config']
2017-11-15 20:54:06.355 netservice INFO arguments: {"globals": {}, "devices": {}}
2017-11-15 20:54:07.103 cli-facility INFO running : /opt/neo/providers/common/bin/providers/common/tools/clifacility/cli_facility.pyo --hosts /tmp/tmpMADHzY/devices.csv --listen-port 60374 --file /tmp/tmpMADHzY/commands.txt --pool-size 30 --operation-timeout 120
2017-11-15 20:54:07.115 cli-facility INFO Starting thread pool: size=30
2017-11-15 20:54:07.119 cli-facility INFO Handling Host: 10.10.0.4
2017-11-15 20:56:07.121 cli-facility WARNING Killing thread for context: 10.10.0.4
2017-11-15 20:56:07.121 cli-facility WARNING Timeout operation for Host(s): 10.10.0.4
2017-11-15 20:56:07.121 cli-facility INFO Result is ready...
2017-11-15 20:56:07.132 cli-facility INFO Sending result to client...
2017-11-15 20:56:07.132 job WARNING job 20.1 failed: Timeout while communicating with: 10.10.0.4
2017-11-15 20:56:07.133 job INFO updateJobStatus updated job (20.1) with status 32772
2017-11-15 20:56:07.133 job INFO Preparing job notification for job (20 - Provisioning), status:(Completed With Errors), progress: (100)
2017-11-15 20:56:07.133 job INFO job: (20.1), status: (Completed With Errors - Timeout while communicating with: 10.10.0.4), progress: (100), device: (10.10.0.4)
2017-11-15 20:56:07.135 cli-facility INFO Stopping thread pool
2017-11-15 20:56:07.134 zmq INFO Send Message Topic:notification, category:notifications/jobs
Thank you in advance!
Hi Ben,
When you say flap, you mean led blinking?
blink every second is probably STP bpdus.
blink every 30 seconds is LLDP maybe?
Hi,
I am noticing that when my connectx-4 25Gbps card is connected to the Arista 7150 switch (10Gbps) ports, the link is not getting detected.
Based on the documents I read, my understanding is that the auto-negotiate should successfully negotiate the link to 10Gbps.
I also tried to disable auto negotiate and set the port speed to 10Gbps and still it is not working.
The details:
Arista switch:
localhost>show version
Software image version: 4.12.7.1
Architecture: i386
Host:
linux-6cof:~ # cat /etc/SuSE-release
SUSE Linux Enterprise Server 12 (x86_64)
VERSION = 12
PATCHLEVEL = 3
# This file is deprecated and will be removed in a future service pack or release.
# Please check /etc/os-release for details about this release.
linux-6cof:~ # uname -r
4.4.73-5-default
linux-6cof:~ # ethtool eth1
Settings for eth1:
Supported ports: [ FIBRE Backplane ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes: 10000baseKR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: No
Speed: Unknown!
Duplex: Unknown! (255)
Port: FIBRE
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: d
Wake-on: d
Link detected: no
linux-6cof:~ # modinfo mlx5_core
filename: /lib/modules/4.4.73-5-default/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
version: 3.0-1
license: Dual BSD/GPL
description: Mellanox Connect-IB, ConnectX-4 core driver
author: Eli Cohen <eli@mellanox.com>
Hi Eddie,
yes, the flap means led blinking,and it's very regular,I will ping IP address between two switches to observe the STP bpdus, It can help us to analyse the status of led,I will tell you the result when I finish,Thanks so much.
Hi,
i'm trying to get some more understanding of achieving near line speed of 40Gbps on the following adapter card:
[root@compute8 scripts]# lspci -vv -s 07:00.0 | grep "Part number" -A 3
[PN] Part number: MCX354A-FCBT
[EC] Engineering changes: A4
[SN] Serial number: MT1334U01416
[V0] Vendor specific: PCIe Gen3 x8.
Some system info
[root@compute8 scripts]# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)
[root@compute8 scripts]# ofed_info -s
MLNX_OFED_LINUX-4.1-1.0.2.0:
[root@compute8 scripts]# uname -r
3.10.0-514.26.2.el7.x86_64
2 identical HP ProLiant DL360p Gen8 servers equipped with 2 Quad Core Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz CPUs and 32GB RAM. Performance profile is network-throughput (using tuned) on both servers.
The ConnectX-3 cards are connected back to back (no switch) with a Mellanox FDR copper cable (1m long). It's been put into Ethernet mode and i've followed the recommended optimization guide at Performance Tuning for Mellanox Adapters .
The problem is achieving anything near line speed of 40Gbps. I've tested with iperf2 since iperf, iperf2, iperf3 recommends it (and not to use iperf3):
Server side:
[root@compute7 ~]# iperf -v
..
..
Client side:
[root@compute8 scripts]# iperf -c 192.168.100.1 -P2
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.100.2 port 54430 connected with 192.168.100.1 port 5001
[ 4] local 192.168.100.2 port 54432 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 17.1 GBytes 14.7 Gbits/sec
[ 4] 0.0-10.0 sec 17.1 GBytes 14.7 Gbits/sec
[SUM] 0.0-10.0 sec 34.1 GBytes 29.3 Gbits/sec
[root@compute8 scripts]# iperf -c 192.168.100.1 -P3
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 5] local 192.168.100.2 port 54438 connected with 192.168.100.1 port 5001
[ 4] local 192.168.100.2 port 54434 connected with 192.168.100.1 port 5001
[ 3] local 192.168.100.2 port 54436 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 14.4 GBytes 12.4 Gbits/sec
[ 4] 0.0-10.0 sec 15.2 GBytes 13.1 Gbits/sec
[ 3] 0.0-10.0 sec 15.3 GBytes 13.1 Gbits/sec
[SUM] 0.0-10.0 sec 44.9 GBytes 38.6 Gbits/sec
[root@compute8 scripts]# iperf -c 192.168.100.1 -P4
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 6] local 192.168.100.2 port 54446 connected with 192.168.100.1 port 5001
[ 4] local 192.168.100.2 port 54440 connected with 192.168.100.1 port 5001
[ 5] local 192.168.100.2 port 54444 connected with 192.168.100.1 port 5001
[ 3] local 192.168.100.2 port 54442 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 11.0 GBytes 9.47 Gbits/sec
[ 4] 0.0-10.0 sec 12.4 GBytes 10.6 Gbits/sec
[ 5] 0.0-10.0 sec 13.0 GBytes 11.2 Gbits/sec
[ 3] 0.0-10.0 sec 8.09 GBytes 6.95 Gbits/sec
[SUM] 0.0-10.0 sec 44.5 GBytes 38.2 Gbits/sec
So it seems a minimum of 3 threads is needed to get close to line speed. Increasing threads above 3 doesn't improve anything. While running the above tests, i was able to observe a lot of changes in /proc/interrupts for ens2 ( port 1 of ConnectX-3), which means there interrupts being generated to request CPU time. This should not happen when RDMA is in use and i've confirmed RDMA is working using some of the tools (ib_send_bw, rping, udaddy, rdma_server etc) mentioned in HowTo Enable, Verify and Troubleshoot RDMA.
Why are these Mellanox utilities performing as intended? Is the answer built in RDMA support?
Further, using perf gives me some details:
[root@compute8 scripts]# perf stat -e cpu-migrations,context-switches,task-clock,cycles,instructions,cache-references,cache-misses iperf -c 192.168.100.1 -P4
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 6] local 192.168.100.2 port 54470 connected with 192.168.100.1 port 5001
[ 4] local 192.168.100.2 port 54464 connected with 192.168.100.1 port 5001
[ 3] local 192.168.100.2 port 54466 connected with 192.168.100.1 port 5001
[ 5] local 192.168.100.2 port 54468 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 10.6 GBytes 9.08 Gbits/sec
[ 4] 0.0-10.0 sec 11.5 GBytes 9.85 Gbits/sec
[ 3] 0.0-10.0 sec 12.4 GBytes 10.7 Gbits/sec
[ 5] 0.0-10.0 sec 10.1 GBytes 8.69 Gbits/sec
[SUM] 0.0-10.0 sec 44.6 GBytes 38.3 Gbits/sec
Performance counter stats for 'iperf -c 192.168.100.1 -P4':
126 cpu-migrations # 0.005 K/sec
11,934 context-switches # 0.446 K/sec
26730.400620 task-clock (msec) # 2.666 CPUs utilized
63,926,425,845 cycles # 2.392 GHz
25,417,772,891 instructions # 0.40 insn per cycle
1,786,983,037 cache-references # 66.852 M/sec
446,840,327 cache-misses # 25.005 % of all cache refs
10.025755759 seconds time elapsed
For instance, i observe the high numbers of CPU context switches that are very costly.
But then after some research efforts i discovered http://ftp100.cewit.stonybrook.edu/rperf. Using rperf server and client, i was able to achieve near line speed without any further effort and increased threads:
Server side:
root@compute7 ~]# rperf -s -p 5001 -l 500M -H
...
Client side:
[root@compute8 scripts]# perf stat -e cpu-migrations,context-switches,task-clock,cycles,instructions,cache-references,cache-misses rperf -c $IP -p 5001 -H -G pw -l 500M -i 2
------------------------------------------------------------
RDMA Client connecting to 192.168.100.1, TCP port 5001
TCP window size: -1.00 Byte (default)
------------------------------------------------------------
[ 4] local 192.168.100.2 port 40580 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 2.0 sec 8.79 GBytes 37.7 Gbits/sec
[ 4] 2.0- 4.0 sec 9.28 GBytes 39.8 Gbits/sec
[ 4] 4.0- 6.0 sec 8.79 GBytes 37.7 Gbits/sec
[ 4] 6.0- 8.0 sec 9.28 GBytes 39.8 Gbits/sec
[ 4] 8.0-10.0 sec 9.28 GBytes 39.8 Gbits/sec
[ 4] 0.0-10.1 sec 45.9 GBytes 39.1 Gbits/sec
Performance counter stats for 'rperf -c 192.168.100.1 -p 5001 -H -G pw -l 500M -i 2':
13 cpu-migrations # 0.007 K/sec
1,348 context-switches # 0.734 K/sec
1836.487997 task-clock (msec) # 0.152 CPUs utilized
4,393,238,230 cycles # 2.392 GHz
9,201,275,892 instructions # 2.09 insn per cycle
26,855,320 cache-references # 14.623 M/sec
23,419,862 cache-misses # 87.208 % of all cache refs
12.084867922 seconds time elapsed
Note the numbers in CPU context switches that are very low compared to the iperf run and the low CPU utlization task-clock (msec). Monitoring /proc/loadavg also showed low CPU utilization.
Another important observation i did is that there are just a few interrupts generated as seen from /proc/interrupts and more importantly the transmitted and received packets are unchanged monitoring /proc/net/dev from the client side while rperf is being run. This clearly indicates that RDMA is being used here to move data from the application and directly into the server, bypassing the kernel.
Also, what i've found out is that the tuning of MTU is 9000 is vital, with default settings, even rperf is performing pretty bad!
According to Vangelis post Cannot get 40Gbps on Ethernet mode with ConnectX-3 VPI ,he was able to achieve near line speed using iperf2 without the need of using multi threads. Why am i not able to any longer? Is RDMA support removed from iperf2 (and it's been there at some point earlier)?
What's important in the end is not the benchmarks, but the actual application the systems will be running on Linux with such an above mentioned setup. How and will they be able to take advantage of RDMA and achieve anything close to the line speed? Do each of the running applications explicitly support RDMA in order to achieve the high speeds?
Hello,
I realize I may be better posting on the HP forum, but frankly I have always found this forum the most knowledgeable when it comes to anything Infiniband.
I have a number of HPE BL460c blades that I have installed the HP FDR IB Adapters 545M(702213-B21). I have installed the IB Drivers from HPE v5.35. Strangely I could not find an Installation Guide or a Manual for this card.
The Adapters show up as Unknown Devices in Windows. Tried Win12 R2 and Win16 with same results.
I have update the firmware on the adapter to the current 10.16.1058.
I have been working on this for a couple days and cannot get the cards to work.
I thought someone here might be able to help shed some insight on what I my try.
Thanks,
Todd
With Windows Server 2016, documentations are available related to SET on MLNX card in 40GB ethernet mode with multiple VLAN with support of RDMA. Is this possible with MLNX card in IB mode running IPoIB with multiple PKey if the infrastructure is based on IB and has IB-IP gateway available? part_man.exe creates new NIC for each PKey specified. Do you create multiple SET per PKey based on those "virtual" NIC created by part_man.exe? Will the SET still have the capability of RDMA?
Hallo,
I recently got a ConnectX4, and I found that I cannot set RX / TX channels using ethtool as I was able to do with the ConnectX3 Pro. If I check, I see:
# ethtool -l ens5f0Channel parameters for ens5f0:
Pre-set maximums:
RX: 0
TX: 0
Other: 512
Combined: 36
# ethtool -L ens5f0 tx 3
Cannot set device channel parameters: Invalid argument
which seems a bit weird. I had a look at the documentation, and found nothing in this respect. Is this something expected? I am using 4.1.1.0.2, and:
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.20.1030
Thanks a lot for any help!
L3D
Sorry if it's a noob question. Can I connect Intel Omnipath host interface card (PCIe) to the Mellanox SX6012 switch?
I think I configured everything on the workstation, but when I plug in the cable to Mellanox switch, I never see the lights lit up.
Status of the OPA interface card shows it's either "polling" or "disconnected".
Thanks!
You can change to ETH + ETH mode via firmware tools from Mellanox...:)
BR,
Jae-Hoon Choi
IHAC,
2 questions,
1,SX36xx series managment module included different Mellanox OS version.
then upper OS version copy to lower OS version?
2, I need to get the release note of Mellanox OS3.5.1006-000 and
what does it support documents of CRA and plz get of CRA's link to me
thanks,
regrads,
tetsu
Hello Mellanox people
We have built an openstack implementation on a Cisco ACI network. Cisco 9k's in spine/leaf topology. All the cloud nodes are configured with dual port ConnectX-3 nics at 40 Gb. The ports are LACP bonds to separate leaf switches. They are cabled with twinax. We plan on growing this particular clouds compute base by > %400, meaning we will add another 100+ compute nodes, adding more 9k's and more mellanox nic's. we are concerned about this as we have so many problems with what we have right now.
We are having huge problems, to the point where many applications have stopped. They either crash or timeout waiting on io. Storage for our cloud is over the network, in the form of a ceph cluster. So network latency is important.
Host configuration:
NIC: Mellanox Ethernet Controller MT27500 - ConnectX-3 Dual Port 40Gbe QSFP+ - Device 0079
Red Hat 7.2 kernel 3.10.0-514 (which is in the 7.3 tree)
Mellanox driver: Stock driver 2.2-1 from red hat kernel package
/etc/modprobe.d/mlx4.conf left as is and tried:
options mlx4_en pfctx=3 pfcrx=3
Symtoms:
Network latency, and possibly even packet loss. Can't prove this yet, but I believe packets are disappearing. Causes outages and outright failures of client applications and services.
Ceph (storage cluster) has huge problems. Random 5-10 second outages. I believe its packet loss. Red Hat says our ceph problems are caused by network problems and won't support it until its fixed.. So they agree with me!
Cloud hosts drop millions of packets. Packet drop rate is directly proportional to data rates.
Cloud hosts send a lot of pause frames, even at low data rates. 50-100/s Is this normal?
Cloud hosts receives no pause frames
Our network support people say they are seeing a ton of pause frames, large number of buffer drops on switch uplinks
Question:
Does anyone have any experience with Mellanox cards in a Cisco ACI environment?
I tried to enable the priority based flow control via the driver configuration. But couldn't tell if it was enabled or not. The cards sends out pause frames, but can't tell what type. Below shows the setting:
cat /sys/module/mlx4_en/parameters/pfctx
0
cat /sys/module/mlx4_en/parameters/pfcrx
0
My theory is we have a flow control problem, the switch is confused, the cards as well. My only concern is, even at low data rates, the cards are sending out 50-100 pause frames a second. At higher rates, 100's a second. Is this normal?
When under load, say 5-7 Gb/s bursts of traffic, we get 100-200 dropped packets a second. One system could have 100 million dropped packets on the bond and / or physical nics.
If its not a flow control problem, what could be another root cause, so to speak. Sorry, this is a big question with a lot of factors.
Any help is a huge help. We will be growing this environment, but don't want to commit to these specific switches/NIC's until we can get this working.
Cheers
Rocke
Hello
release 2.2-1 for red hat driver shows support for 802.1Qbb priority based flow control. Is the default LLFC or port based? Which flow control is enabled by default? I have tried making changes to the /etc/modprobe.d/mlx4.conf to set PFC flow control.
My entry:
options mlx4_en pfctx=3 pfcrx=3
Will this work?
Is there any progress with the SRV-IOV ? We already have ESXi 6.5 u1 and still do not work: /. In native driver there is still no support for max_vfs (esxcli system module parameters list -m nmlx4_core), test on ConnectX-3 EN:
enable_64b_cqe_eqe | int | Enable 64 byte CQEs/EQEs when the the FW supports this |
Values : 1 - enabled, 0 - disabled
Default: 0
enable_dmfs | int | Enable Device Managed Flow Steering |
Values : 1 - enabled, 0 - disabled
Default: 1
enable_qos | int | Enable Quality of Service support in the HCA |
Values : 1 - enabled, 0 - disabled
Default: 0
enable_rocev2 | int | Enable RoCEv2 mode for all devices |
Values : 1 - enabled, 0 - disabled
Default: 0
enable_vxlan_offloads int | Enable VXLAN offloads when supported by NIC |
Values : 1 - enabled, 0 - disabled
Default: 1
log_mtts_per_seg | int | Log2 number of MTT entries per segment |
Values : 1-7
Default: 3
log_num_mgm_entry_size int | Log2 MGM entry size, that defines the number of QPs per MCG, for example: value 10 results in 248 QP per MGM entry |
Values : 9-12
Default: 12
msi_x | int | Enable MSI-X |
Values : 1 - enabled, 0 - disabled
Default: 1
mst_recovery | int | Enable recovery mode(only NMST module is loaded) |
Values : 1 - enabled, 0 - disabled
Default: 0
rocev2_udp_port | int | Destination port for RoCEv2 |
Values : 1-65535 for RoCEv2
Default: 4791
I have a Ubuntu 16 server with a single port ConnectX-3 HCA. I have enabled SRIOV with 8 VFs on the HCA and configured the kernel with 'intel_iommu=on'. /etc/modprobe.d/mlx.conf is configured to load and probe all eight VFs. The the VFs are listed by lspci, but the system does not probe the VFs and create the virtual interfaces on the system. dmesg indicates "Skipping virtual function" for all the VFs. All instructions for configuring VFs I have read indicate my configuration should probe the VFs. Can anyone point me towards what else needs to be configured to enable the VFs?
Below are the details of the system and HCA configuration:
# uname -a
Linux cfmm-h2 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
# lspci | grep Mellanox
05:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
05:00.1 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.3 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.4 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.5 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.6 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:00.7 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
05:01.0 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
# cat /etc/modprobe.d/mlx4.conf
options mlx4_core num_vfs=8 probe_vf=8 port_type_array=1
# mlx4_core gets automatically loaded, load mlx4_en also (LP: #1115710)
softdep mlx4_core post: mlx4_en
# mstflint -d 05:00.0 q
Image type: FS2
FW Version: 2.36.5000
Product Version: 02.36.50.00
Rom Info: type=PXE version=3.4.718 devid=4099
Device ID: 4099
Description: Node Port1 Port2 Sys image
GUIDs: 248a070300ba8e20 248a070300ba8e21 248a070300ba8e22 248a070300ba8e23
MACs: 248a07ba8e21 248a07ba8e22
VSD:
PSID: DEL1100001019
# mstconfig -d 05:00.0 q
Device #1:
----------
Device type: ConnectX3
PCI device: 05:00.0
Configurations: Current
SRIOV_EN 1
NUM_OF_VFS 8
LINK_TYPE_P1 3
LINK_TYPE_P2 3
LOG_BAR_SIZE 3
BOOT_PKEY_P1 0
BOOT_PKEY_P2 0
BOOT_OPTION_ROM_EN_P1 0
BOOT_VLAN_EN_P1 0
BOOT_RETRY_CNT_P1 0
LEGACY_BOOT_PROTOCOL_P1 0
BOOT_VLAN_P1 1
BOOT_OPTION_ROM_EN_P2 0
BOOT_VLAN_EN_P2 0
BOOT_RETRY_CNT_P2 0
LEGACY_BOOT_PROTOCOL_P2 0
BOOT_VLAN_P2 1
# lspci -vv -s 05:00.0
05:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Subsystem: Mellanox Technologies MT27500 Family [ConnectX-3]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 83
Region 0: Memory at 92400000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at 38000800000 (64-bit, prefetchable) [size=8M]
Expansion ROM at <ignored> [disabled]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Product Name: CX353A - ConnectX-3 QSFP
Read-only fields:
[PN] Part number: 079DJ3
[EC] Engineering changes: A03
[SN] Serial number: IL079DJ37403172G0033
[V0] Vendor specific: PCIe Gen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved
Read/write fields:
[V1] Vendor specific: N/A
[YA] Asset tag: N/A
[RW] Read-write area: 104 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 252 byte(s) free
End
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148 v1] Device Serial Number 24-8a-07-03-00-ba-8e-20
Capabilities: [154 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
Capabilities: [18c v1] #19
Capabilities: [108 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy+
IOVSta: Migration-
Initial VFs: 8, Total VFs: 8, Number of VFs: 8, Function Dependency Link: 00
VF offset: 1, stride: 1, Device ID: 1004
Supported Page Size: 000007ff, System Page Size: 00000001
Region 2: Memory at 0000038001000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core
# dmesg (edited for mlx4 related lines)
[ 0.000000] Command line: BOOT_IMAGE=/ROOT/ubuntu@/boot/vmlinuz-4.4.0-98-generic root=ZFS=rpool/ROOT/ubuntu ro swapaccount=1 intel_iommu=on
[ 3.540033] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[ 3.546709] mlx4_core: Initializing 0000:05:00.0
[ 9.511353] mlx4_core 0000:05:00.0: Enabling SR-IOV with 8 VFs
[ 9.620597] pci 0000:05:00.1: [15b3:1004] type 00 class 0x028000
[ 9.627586] pci 0000:05:00.1: Max Payload Size set to 256 (was 128, max 512)
[ 9.637048] iommu: Adding device 0000:05:00.1 to group 49
[ 9.643872] mlx4_core: Initializing 0000:05:00.1
[ 9.650716] mlx4_core 0000:05:00.1: enabling device (0000 -> 0002)
[ 9.658389] mlx4_core 0000:05:00.1: Skipping virtual function:1
[ 9.665756] pci 0000:05:00.2: [15b3:1004] type 00 class 0x028000
[ 9.672748] pci 0000:05:00.2: Max Payload Size set to 256 (was 128, max 512)
[ 9.682097] iommu: Adding device 0000:05:00.2 to group 50
[ 9.688952] mlx4_core: Initializing 0000:05:00.2
[ 9.695814] mlx4_core 0000:05:00.2: enabling device (0000 -> 0002)
[ 9.703545] mlx4_core 0000:05:00.2: Skipping virtual function:2
[ 9.710846] pci 0000:05:00.3: [15b3:1004] type 00 class 0x028000
[ 9.717871] pci 0000:05:00.3: Max Payload Size set to 256 (was 128, max 512)
[ 9.727364] iommu: Adding device 0000:05:00.3 to group 51
[ 9.734288] mlx4_core: Initializing 0000:05:00.3
[ 9.741154] mlx4_core 0000:05:00.3: enabling device (0000 -> 0002)
[ 9.748832] mlx4_core 0000:05:00.3: Skipping virtual function:3
[ 9.756277] pci 0000:05:00.4: [15b3:1004] type 00 class 0x028000
[ 9.763268] pci 0000:05:00.4: Max Payload Size set to 256 (was 128, max 512)
[ 9.772810] iommu: Adding device 0000:05:00.4 to group 52
[ 9.779724] mlx4_core: Initializing 0000:05:00.4
[ 9.786588] mlx4_core 0000:05:00.4: enabling device (0000 -> 0002)
[ 9.794256] mlx4_core 0000:05:00.4: Skipping virtual function:4
[ 9.801504] pci 0000:05:00.5: [15b3:1004] type 00 class 0x028000
[ 9.808499] pci 0000:05:00.5: Max Payload Size set to 256 (was 128, max 512)
[ 9.817815] iommu: Adding device 0000:05:00.5 to group 53
[ 9.824644] mlx4_core: Initializing 0000:05:00.5
[ 9.831359] mlx4_core 0000:05:00.5: enabling device (0000 -> 0002)
[ 9.838894] mlx4_core 0000:05:00.5: Skipping virtual function:5
[ 9.846081] pci 0000:05:00.6: [15b3:1004] type 00 class 0x028000
[ 9.853086] pci 0000:05:00.6: Max Payload Size set to 256 (was 128, max 512)
[ 9.862328] iommu: Adding device 0000:05:00.6 to group 54
[ 9.868962] mlx4_core: Initializing 0000:05:00.6
[ 9.875601] mlx4_core 0000:05:00.6: enabling device (0000 -> 0002)
[ 9.883002] mlx4_core 0000:05:00.6: Skipping virtual function:6
[ 9.890082] pci 0000:05:00.7: [15b3:1004] type 00 class 0x028000
[ 9.897070] pci 0000:05:00.7: Max Payload Size set to 256 (was 128, max 512)
[ 9.906218] iommu: Adding device 0000:05:00.7 to group 55
[ 9.912806] mlx4_core: Initializing 0000:05:00.7
[ 9.919380] mlx4_core 0000:05:00.7: enabling device (0000 -> 0002)
[ 9.926935] mlx4_core 0000:05:00.7: Skipping virtual function:7
[ 9.934160] pci 0000:05:01.0: [15b3:1004] type 00 class 0x028000
[ 9.941145] pci 0000:05:01.0: Max Payload Size set to 256 (was 128, max 512)
[ 9.950497] iommu: Adding device 0000:05:01.0 to group 56
[ 9.957354] mlx4_core: Initializing 0000:05:01.0
[ 9.964295] mlx4_core 0000:05:01.0: enabling device (0000 -> 0002)
[ 9.972187] mlx4_core 0000:05:01.0: Skipping virtual function:8
[ 9.979753] mlx4_core 0000:05:00.0: Running in master mode
[ 9.986885] mlx4_core 0000:05:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 9.994178] mlx4_core 0000:05:00.0: PCIe link width is x8, device supports x8
[ 10.178999] mlx4_core: Initializing 0000:05:00.1
[ 10.186463] mlx4_core 0000:05:00.1: enabling device (0000 -> 0002)
[ 10.194814] mlx4_core 0000:05:00.1: Skipping virtual function:1
[ 10.202620] mlx4_core: Initializing 0000:05:00.2
[ 10.210054] mlx4_core 0000:05:00.2: enabling device (0000 -> 0002)
[ 10.218256] mlx4_core 0000:05:00.2: Skipping virtual function:2
[ 10.225917] mlx4_core: Initializing 0000:05:00.3
[ 10.233189] mlx4_core 0000:05:00.3: enabling device (0000 -> 0002)
[ 10.241256] mlx4_core 0000:05:00.3: Skipping virtual function:3
[ 10.248961] mlx4_core: Initializing 0000:05:00.4
[ 10.256108] mlx4_core 0000:05:00.4: enabling device (0000 -> 0002)
[ 10.264085] mlx4_core 0000:05:00.4: Skipping virtual function:4
[ 10.271598] mlx4_core: Initializing 0000:05:00.5
[ 10.278675] mlx4_core 0000:05:00.5: enabling device (0000 -> 0002)
[ 10.286606] mlx4_core 0000:05:00.5: Skipping virtual function:5
[ 10.294012] mlx4_core: Initializing 0000:05:00.6
[ 10.301002] mlx4_core 0000:05:00.6: enabling device (0000 -> 0002)
[ 10.308782] mlx4_core 0000:05:00.6: Skipping virtual function:6
[ 10.316133] mlx4_core: Initializing 0000:05:00.7
[ 10.323060] mlx4_core 0000:05:00.7: enabling device (0000 -> 0002)
[ 10.330772] mlx4_core 0000:05:00.7: Skipping virtual function:7
[ 10.338002] mlx4_core: Initializing 0000:05:01.0
[ 10.344802] mlx4_core 0000:05:01.0: enabling device (0000 -> 0002)
[ 10.352404] mlx4_core 0000:05:01.0: Skipping virtual function:8
[ 10.366153] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014)
[ 23.574588] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v2.2-1 (Feb 2014)
[ 23.585484] <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0
[ 23.664067] mlx4_core 0000:05:00.0: mlx4_ib: multi-function enabled
[ 23.679324] mlx4_core 0000:05:00.0: mlx4_ib: initializing demux service for 128 qp1 clients
The issue appears to be that the module parameter need to be /etc/modprobe.d/mlx4_core.conf and not /etc/modprobe.d/mlx4.conf. After moving the parameters into the correct file the VFs are probed as expected on boot.
# cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core num_vfs=8 probe_vf=8 port_type_array=1
Hi. OmniPath (OPA) is an Intel proprietary protocol and is only implemented by OPA switches. You didn't mention what type of cable you're using, but OPA cables are also proprietary, and not compatible with Ethernet or InfiniBand.