Has anyone figured this out, or can anyone from mellanox comment if this is something that is coming back in the next driver release?
Re: Port 1 IB Port 2 Ethernet - ESXI 5.1 (5.5)
Factory Reset Voltaire ISR9024m without console access?
I purchased one of these units from ebay and just got my confirmation from wireeshark that the unit was never reset prior to being sold, is there a way to reset the unit to factory default without console access? It looks like i'm locked out of it at this time.
Re: Voltaire ISR 9024 Console Cable
looks like the easiest way to make one is a usb - serial camera cable and then take a screw drive to it, sound like a plan?
Re: Voltaire ISR 9024 Console Cable
Which pins are used the mini usb end?
Re: Cannot direct ping between two ConnectX-3 cards
I think you have IP addresses on ib ports that are down.
Looks like the other ports are up, gpu0-ib0 and gpu1-ib1. Run ibdev2netdev to make sure.
If you start opensm on gpu1 you need to specify the GUID of port2 (ib1) otherwise opensm will bind on port1.
You can start opensm with this "opensm -g 0xf4521403007f6082".
Or you can use opensmd, but first you need to update the guid in /etc/opensm/opensm.conf (create the conf with opensm -c /etc/opensm/opensm.conf if it does not exist)
Use RRAS to nat Infiniband connections to Virtual Machines?
Has anyone tried this? I've gotten about halfway there, but i'm stuck on getting communication to / from the infiniband IPoIB fabric. My suspicions are that RRAS is trying to route to the fabric but since there is no gateway on the actual fabric, it can't figure out where to route the traffic. Has anyone else tried? I was originally going to use Network Virtualization, but that will only send the virtual traffic over the fabric, it won't allow the machines to access physical resources without a gateway device and that's adding too much complexity for my comfort. I'm trying to map Infiniband fabric resources to Hyper-V guests on Server 2012 R2 and I've had no luck since infiniband doesn't support bridging connections.
Mellanox with RHEL 5 & 6 Performance Test
Hello Team Mellanox,
1st of all thank so much & congratulates !! "Singapore Exchange" the fastest trading on earth.
I have few question
1. Where do I get Performance test on RHEL 5.x & 6.x test paper /white paper ?
Thanks and Regards
Rajat
Re: Upgrading 4036, media full
Hi inbusiness,
Thank you for your help in the matter. The U-Boot, do not contain the build_flash command unfortunately. The best I could find there was the tftpboot, however I could not figure out the correct syntax. Manual and help menu don't mention ether.
Yes, I have console cable and I figured the 38400 speed out myself, so +1 there.
Magnus
Re: Upgrading 4036, media full
hi Rian,
Could you be so kind to clarify, where to link these symbolic files? I guess, basically we need to know where the file should go. We only find traces of upgrade files pertaining to firmware upgrades.
Deleting the /dev/nul only freed 1,7mb, so no luck there.
Much appreciated
Magnus
How to install MLNX_OFED_LINUX on fedora18?
How to install MLNX_OFED_LINUX on fedora18? Driver can support fedora18?
Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
Unfortunately nothing has changed, still got connectivity issues with ib_srp.
2014-01-03T05:57:25.944Z stratus203 vmkwarning: cpu39:9479)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:3 (driver name: ib_srp) - Message repeated 20 times
2014-01-03T05:57:25.944Z stratus203 vmkernel: cpu39:9479)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:3 (driver name: ib_srp) - Message repeated 20 times
2014-01-03T05:57:25.955Z stratus203 vmkernel: cpu35:8227)ScsiDeviceIO: 2311: Cmd(0x412580aae440) 0x2a, CmdSN 0x194e1d from world 9479 to dev "eui.3632656331666463" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T06:57:39.529Z stratus203 vmkwarning: cpu8:8200)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 25 times
2014-01-03T06:57:39.529Z stratus203 vmkernel: cpu8:8200)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 25 times
2014-01-03T06:57:39.529Z stratus203 vmkernel: cpu9:11701)ScsiDeviceIO: 2311: Cmd(0x4124c0fcafc0) 0x2a, CmdSN 0x800e0021 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T07:25:05.061Z stratus203 vmkwarning: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 9 times
2014-01-03T07:25:05.061Z stratus203 vmkernel: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 9 times
2014-01-03T07:25:05.063Z stratus203 vmkernel: cpu13:962303)ScsiDeviceIO: 2311: Cmd(0x4124c0c8dd00) 0x2a, CmdSN 0x800e0041 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x24 0x0.
2014-01-03T07:43:15.573Z stratus203 vmkwarning: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 5 times
2014-01-03T07:43:15.573Z stratus203 vmkernel: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 5 times
2014-01-03T07:43:15.575Z stratus203 vmkernel: cpu11:8203)ScsiDeviceIO: 2311: Cmd(0x4124c0fc90c0) 0x2a, CmdSN 0x800e006a from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T08:09:20.121Z stratus203 vmkwarning: cpu14:673009)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 9 times
2014-01-03T08:09:20.121Z stratus203 vmkernel: cpu14:673009)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 9 times
2014-01-03T08:09:20.122Z stratus203 vmkernel: cpu14:673009)ScsiDeviceIO: 2311: Cmd(0x4124c0bad8c0) 0x2a, CmdSN 0x800e0042 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T08:24:50.148Z stratus203 vmkwarning: cpu9:8201)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 10 times
2014-01-03T08:24:50.148Z stratus203 vmkernel: cpu9:8201)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 10 times
2014-01-03T08:24:50.149Z stratus203 vmkernel: cpu9:8201)ScsiDeviceIO: 2311: Cmd(0x4124c0eb3b80) 0x2a, CmdSN 0x800e006e from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T08:41:06.731Z stratus203 vmkwarning: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 15 times
2014-01-03T08:41:06.731Z stratus203 vmkernel: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 15 times
2014-01-03T08:41:06.732Z stratus203 vmkernel: cpu11:8203)ScsiDeviceIO: 2311: Cmd(0x4124c00fa7c0) 0x2a, CmdSN 0x800e0028 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T08:56:42.696Z stratus203 vmkwarning: cpu10:8202)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 7 times
2014-01-03T08:56:42.696Z stratus203 vmkernel: cpu10:8202)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 7 times
2014-01-03T08:56:42.696Z stratus203 vmkernel: cpu15:11702)ScsiDeviceIO: 2311: Cmd(0x4124c13ae5c0) 0x2a, CmdSN 0x800e0015 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T09:11:44.233Z stratus203 vmkwarning: cpu9:8201)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 13 times
2014-01-03T09:11:44.233Z stratus203 vmkernel: cpu9:8201)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 13 times
2014-01-03T09:11:44.234Z stratus203 vmkernel: cpu9:8201)ScsiDeviceIO: 2311: Cmd(0x4124c121b540) 0x2a, CmdSN 0x800e0003 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T09:33:48.769Z stratus203 vmkwarning: cpu12:8204)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 5 times
2014-01-03T09:33:48.769Z stratus203 vmkernel: cpu12:8204)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 5 times
2014-01-03T09:33:48.770Z stratus203 vmkernel: cpu12:8204)ScsiDeviceIO: 2311: Cmd(0x4124c0bab1c0) 0x2a, CmdSN 0x800e000f from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T09:51:31.157Z stratus203 vmkwarning: cpu50:8344)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:1 (driver name: ib_srp) - Message repeated 33 times
2014-01-03T09:51:31.157Z stratus203 vmkernel: cpu50:8344)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:1 (driver name: ib_srp) - Message repeated 33 times
2014-01-03T09:51:31.158Z stratus203 vmkernel: cpu54:8246)ScsiDeviceIO: 2311: Cmd(0x4126009b0980) 0x2a, CmdSN 0x3bb009 from world 8344 to dev "eui.623233346565652d" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T09:51:31.550Z stratus203 vmkernel: cpu51:8243)ScsiDeviceIO: 2311: Cmd(0x41260097df80) 0x2a, CmdSN 0x3bb0dd from world 8344 to dev "eui.623233346565652d" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T09:51:31.565Z stratus203 vmkernel: cpu54:8246)ScsiDeviceIO: 2311: Cmd(0x412600d83f40) 0x2a, CmdSN 0x3bb0e4 from world 8344 to dev "eui.623233346565652d" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T10:02:36.464Z stratus203 vmkernel: cpu8:8200)ScsiDeviceIO: 2311: Cmd(0x4124c07ee1c0) 0x2a, CmdSN 0x800e0023 from world 13211 to dev "eui.3435613932663332" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T10:02:36.465Z stratus203 vmkernel: cpu8:8200)ScsiDeviceIO: 2311: Cmd(0x4124c13acbc0) 0x2a, CmdSN 0x800e000f from world 13211 to dev "eui.3435613932663332" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T10:02:36.465Z stratus203 vmkernel: cpu8:8200)ScsiSched: 2147: Reduced the queue depth for device eui.3435613932663332 to 28, due to queue full/busy conditions. The queue depth could be reduced further if the condition persists.
2014-01-03T10:07:01.466Z stratus203 vmkwarning: cpu16:8310)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:2 (driver name: ib_srp) - Message repeated 124 times
2014-01-03T10:07:01.466Z stratus203 vmkernel: cpu16:8310)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:2 (driver name: ib_srp) - Message repeated 124 times
2014-01-03T10:07:01.467Z stratus203 vmkernel: cpu19:8211)ScsiDeviceIO: 2311: Cmd(0x4125009f6700) 0x2a, CmdSN 0x597ab3 from world 8310 to dev "eui.3138383164363939" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x25 0x0.
2014-01-03T10:11:54.815Z stratus203 vmkernel: cpu15:8207)ScsiDeviceIO: 2311: Cmd(0x4124c0c29580) 0x2a, CmdSN 0x800e004f from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T10:11:54.816Z stratus203 vmkernel: cpu15:8207)ScsiDeviceIO: 2311: Cmd(0x4124c00e6300) 0x2a, CmdSN 0x800e0056 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T10:11:54.817Z stratus203 vmkernel: cpu15:8207)ScsiSched: 2147: Reduced the queue depth for device eui.3731346538376162 to 1, due to queue full/busy conditions. The queue depth could be reduced further if the condition persists.
Re: Any success with OFED 2.0-3.0.0 on RHEL 6.5/CentOS 6.5?
I've got OFED 2.0-3.0.0 to build against RHEL 6.5 2.6.32-431.1.2.0.1.el6.x86_64 with
minor tweaks to the source rpm.
However, someone (from Mellanox?) in another thread mentioned an ETA of a couple of weeks
for 2.1 with RHEL 6.5 compatibility.
Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
ESXi host with the queue problems has been in Maintenance Mode for 4h now and after I issue a "esxcli storage core adapter rescan --all" command, this pops out in the vmkernel.log file:
2014-01-03T14:10:11.389Z cpu56:14437)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:6 (driver name: ib_srp) - Message repeated 561 times
2014-01-03T14:11:31.334Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c0c0b300) 0x28, CmdSN 0x2b from world 1067687 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T14:11:31.366Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c084bf40) 0x28, CmdSN 0x2f from world 1067687 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x25 0x0.
2014-01-03T14:11:31.388Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c00e4fc0) 0x28, CmdSN 0x30 from world 1067687 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-01-03T14:11:31.755Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c0d69000) 0x28, CmdSN 0x32 from world 1067687 to dev "eui.3632656331666463" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x25 0x0.
It looks like the ib_srp driver is stuck with some SCSI commands, which are retried even after the VM world isn't there.
Re: MLNX_OFED_LINUX installtion Error: Error: The current MLNX_OFED_LINUX is intended for rhel6.4
ibv_devices should show something like this, regardless how the ports are configured
device node GUID
------ ----------------
mlx4_0 0002c9030....
Do you see the adapter with this command?
lspci -d 15b3:
If yes, can you check if mlx modules are loaded? lsmod | grep mlx
Re: Any success with OFED 2.0-3.0.0 on RHEL 6.5/CentOS 6.5?
2.1 version with RHEL 6.5 support should be available from mellanox.com web site on 1/8 or 1/9
Re: Does ConnectX-3 support Amphenol's QSFP+ transceiver?
Hi,
If a valid cable or module is connected QSFP, SFP+, or SFP with EEPROM in the cable/module, the connectX-3 will get the cable information from the EEPROM on the transceiver.
Did you try another AMPHENOL's QSFP+ or just one?,and did you check it on another equipment ?
Re: Vmware ESXi 5.1 & ConnectX SRP Support
Sorry for the late reply, haven't got around to looking into Infiniband until just recently.
My CX-1 HCA was originally a HP one:
C:\>flint -d mt25418_pciconf0 query full
Image type: ConnectX
FW Version: 2.6.0
Device ID: 25418
Description: Node Port1 Port2 Sys image
GUIDs: 001a4bffff0c6214 001a4bffff0c6215 001a4bffff0c6216 001a4bffff0c6217
MACs: 001a4b0c6215 001a4b0c6216
VSD:
PSID: HP_09D0000001
But then I changed it over to the Mellanox one:
C:\Program Files\Mellanox\WinMFT>flint -d mt25418_pciconf0 query
Image type: ConnectX
FW Version: 2.9.1000
Device ID: 25418
Description: Node Port1 Port2 Sys image
GUIDs: 001a4bffff0c6214 001a4bffff0c6215 001a4bffff0c6216 001a4bffff0c6217
MACs: 001a4b0c6215 001a4b0c6216
VSD:
PSID: MT_04A0110002
Issues with SRP and unexplainable "ibping" behaviour.
Hello everyone,
I'm chasing a bit of assistance in troubleshooting SRP over an Infiniband setup which I have at home. Essentially I'm not seeing the disk I/O performance I was expecting between my SRP Initiator and Target and want to troubleshoot where the problem could be. I wanted to start at the Infiniband infrastructure and working up from there. If I can verify that my Infiniband is setup correctly and performing as it should, I can start to troubleshoot the additional technologies and protocols involved.
Some basic information first:
SRP Target: Oracle Solaris v11.1 Server with ZFS pools as LU (Logical Units).
SRP Initiator: VMware ESXi v5.5.
Mellanox MHGH28-XTC (MT25418) cards are being used in both the Infiniband devices above. A CX4 cable is used to directly connect between them.
Now to the best of my knowledge, the drivers, VIBs and configuration has all been done correctly and I'm at the point where my ESXi v5.5 can actually see the LU, mount it and I can store data on there. At this stage, it seems to be purely a performance issue which I'm trying to resolve.
Some CLI outputs below:
STORAGE-SERVER
STORAGE-SERVER:/# ibstat
CA 'mlx4_0'
CA type: 0
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: 160
Node GUID: 0x001a4bffff0c6214
System image GUID: 0x001a4bffff0c6217
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x00000038
Port GUID: 0x001a4bffff0c6215
Link layer: IB
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00000038
Port GUID: 0x001a4bffff0c6216
Link layer: IB
VM-HYPER:
/opt/opensm/bin # ./ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.7.0
Hardware version: a0
Node GUID: 0x001a4bffff0cb178
System image GUID: 0x001a4bffff0cb17b
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251086a
Port GUID: 0x001a4bffff0cb179
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 8
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x001a4bffff0cb17a
Link layer: InfiniBand
The "LIDs" in the above outputs indicate that the SM (Subnet Manager) is working as far as I'm aware.
From the SRP target, I can see the other Infiniband host:
STORAGE-SERVER:/# ibhosts
Ca : 0x001a4bffff0cb178 ports 2 "****************** HCA-1"
Ca : 0x001a4bffff0c6214 ports 2 "MT25408 ConnectX Mellanox Technologies"
I thought I'd start with using the "ibping" utility to verify Infiniband connectivity. This is where I got some really strange results:
Firstly, I could not get the ibping daemon running on the SRP initiator (ESXi) at all. The command would execute, but then just return to the shell:
/opt/opensm/bin # ./ibping -S
/opt/opensm/bin #
So I tried to switch to running the ibping daemon on the SRP target (Oracle Solaris), which seemed to work as it should and it appeared to be awaiting some pings to come through. Great! Now going back to the SRP initiator, I ran the ibping utility with the LID of the SRP target. But it was unsuccessful:
/opt/opensm/bin # ./ibping -L 2
ibwarn: [3502756] _do_madrpc: recv failed: Resource temporarily unavailable
ibwarn: [3502756] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)
ibwarn: [3502756] _do_madrpc: recv failed: Resource temporarily unavailable
ibwarn: [3502756] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)
ibwarn: [3502756] _do_madrpc: recv failed: Resource temporarily unavailable
...
..
.
--- (Lid 2) ibping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9360 ms
rtt min/avg/max = 0.000/0.000/0.000 ms
OK, let's try the Port GUID of the SRP target instead of the LID:
/opt/opensm/bin # ./ibping -G 0x001a4bffff0c6215
ibwarn: [3504924] _do_madrpc: recv failed: Resource temporarily unavailable
ibwarn: [3504924] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1)
ibwarn: [3504924] ib_path_query_via: sa call path_query failed
./ibping: iberror: failed: can't resolve destination port 0x001a4bffff0c6215
I restarted the ibping daemon on the SRP target with 1 level of debugging, and re-ran the pings from the client (SRP initiator). I can see that the pings are actually reaching the SRP target and a reply is being sent:
STORAGE-SERVER:/# ibping -S -d
ibdebug: [11188] ibping_serv: starting to serve...
ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER
ibwarn: [11188] mad_respond_via: dest Lid 1
ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER
ibwarn: [11188] mad_respond_via: dest Lid 1
ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER
ibwarn: [11188] mad_respond_via: dest Lid 1
ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER
ibwarn: [11188] mad_respond_via: dest Lid 1
ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER
ibwarn: [11188] mad_respond_via: dest Lid 1
ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
The strangest observation is yet to come however. If I run the ibping on the client with 2 levels of debug, I get a few replies in the final statistics output when the ibping is terminated (this does not work under single level of debugging in my experience):
/opt/opensm/bin # ./ibping -L -dd 2
...
..
.
ibdebug: [3508744] ibping: Ping..
ibwarn: [3508744] ib_vendor_call_via: route Lid 2 data 0x3ffcebc7aa0
ibwarn: [3508744] ib_vendor_call_via: class 0x132 method 0x1 attr 0x0 mod 0x0 datasz 216 off 40 res_ex 1
ibwarn: [3508744] mad_rpc_rmpp: rmpp (nil) data 0x3ffcebc7aa0
ibwarn: [3508744] umad_set_addr: umad 0x3ffcebc7570 dlid 2 dqp 1 sl 0, qkey 80010000
ibwarn: [3508744] _do_madrpc: >>> sending: len 256 pktsz 320
send buf
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0001 8001 0000 0002 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0132 0101 0000 0000 0000 0000 4343 c235
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 1405 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
ibwarn: [3508744] umad_send: fd 3 agentid 1 umad 0x3ffcebc7570 timeout 1000
ibwarn: [3508744] umad_recv: fd 3 umad 0x3ffcebc7170 timeout 1000
ibwarn: [3508744] umad_recv: mad received by agent 1 length 320
ibwarn: [3508744] _do_madrpc: rcv buf:
rcv buf
0132 0181 0000 0000 0000 00ac 4343 c234
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 1405 6763 2d73 746f 7261
6765 312e 6461 726b 7265 616c 6d2e 696e
7465 726e 616c 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
ibwarn: [3508744] umad_recv: fd 3 umad 0x3ffcebc7170 timeout 1000
ibwarn: [3508744] umad_recv: mad received by agent 1 length 320
ibwarn: [3508744] _do_madrpc: rcv buf:
rcv buf
0132 0181 0000 0000 0000 00ac 4343 c235
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 1405 6763 2d73 746f 7261
6765 312e 6461 726b 7265 616c 6d2e 696e
7465 726e 616c 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
ibwarn: [3508744] mad_rpc_rmpp: data offs 40 sz 216
rmpp mad data
6763 2d73 746f 7261 6765 312e 6461 726b
7265 616c 6d2e 696e 7465 726e 616c 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000
Pong from STORAGE-SERVER (Lid 2): time 7.394 ms
ibdebug: [3508744] report: out due signal 2
--- STORAGE-SERVER (Lid 2) ibping statistics ---
10 packets transmitted, 3 received, 70% packet loss, time 9556 ms
rtt min/avg/max = 7.394/12.335/15.344 ms
I'm stumped. Anyone have any ideas on what is going on or how to troubleshoot further?
Re: Issues with SRP and unexplainable "ibping" behaviour.
Actually, looking at the Level 2 debugs a bit further, it seems that the replies are indeed making their way back to the ibping client (ESXi), you can see this in the receive buffers and the hex dump, but the following message seem to indicate something is amiss on the ESXi server:
ibwarn: [3511788] _do_madrpc: recv failed: Resource temporarily unavailable
On a side note, I'm seeing a lot of references to the word "mad" in all the debugging information. I wonder is someone is hinting at something.
Re: Issues with SRP and unexplainable "ibping" behaviour.
And some additional information on the Mellanox VIBs installed on the ESXi 5.5 Server:
~ # esxcli software vib list | egrep Mellanox
net-ib-cm 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
net-ib-core 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
net-ib-ipoib 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
net-ib-mad 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
net-ib-sa 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
net-ib-umad 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
net-mlx4-core 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
net-mlx4-ib 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
scsi-ib-srp 1.8.2.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2013-12-22
~ # esxcli software vib list | egrep opensm
ib-opensm 3.3.15 Intel VMwareAccepted 2013-12-22