I am in the process setting up an iSCSI storage system for Vmware using IPoIB as high speed interconnect. The switch used is 4036 running 3.9.1. iSCSI cluster has two nodes each has one ConnectX-3 dual port. The second port of the HCA is directly cross connected without a switch. IB SM is running on CentOS box with the following definition:
Default=0xffff , ipoib, mtu=5: ALL=full;
vmotion20=0x8014 , ipoib, mtu=5, defmember=full: ALL=full;
iscsi40=0x8028 , ipoib, mtu=5, defmember=full: ALL=full;
iscsi50=0x8032 , ipoib, mtu=5, defmember=full: ALL=full;
On node A of the storage cluster, "part_man add IPoIB#1 ipoib_8028 8028". The IPoIB#1 is configured with 10.0.0.1/24, and new virtual interface for PKey 0x8028 has 192.168.40.1/24.
On node B of the storage cluster, "part_man add IPoIB#1 ipoib_8032 8032". The IPoIB#1 is configured with 10.0.0.2/24, and new virtual interface for PKey 0x8032 has 192.168.50.1/24.
On SM (CentOS), ib0 is configured with 10.0.0.254/24, ib0.8028 has 192.168.40.254/24, ib0.8032 has 192.168.50.254/24.
The problem I am having is that connectivity to virtual interface on Windows server is sporadic after system restart/reboot. The ping among 10.0.0.1/10.0.0.2/10.0.0.254 are always successful. ibping and ibtracert between all three nodes are always successful. The ping on 192.168.40.0/24 and 192.168.50.0/24 are hit and miss. Sometimes, a restart of SM will reestablish the connectivity. When the connectivity is not there, tcpdump and wireshark on the virtual interface showed ARP who-has packet out of the ping originator but never showed up on the other end.
However, if SM is cross connected to the port #1 on the storage cluster node (bypassing 4036), I was not able to reproduce the problem. Of course, by doing this, the SM will see link goes down and then up, and it will pretty much trigger an event similar to SM restart. As indicated above, restart of SM seems to fix the connectivity on virtual interface (PKey partitions) often times.
Any help pointing me to the right direction will be greatly appreciated!
Thanks!