Hi,
Thanks for replying. Its not from what i understand, and if it was, then I would think it would still function, the communication for RSF-1 is over ethernet. The storage network is using infiniband, this is using IPoIB for iSCSI on windows as SRP support was removed in 2012, and NFS for ESXi, do to the VMs being exposed in the snapshots making vm recovery from snapshots easier, SRP was tested also, didnt provide much greater performance overall.
The storage nodes have a management network over ethernet, which is also use for heartbeats, they have a serial link for heartbeat and a set of quorum disks. Failover in these tests were manually initiated which unmounts the pool, removes the network configuration, remounts it on the other node and reconfigures the network.
The infiniband network config and subnet manager could be the problem.
Each system has 2 ports, 1 port from each is connected to an independent switch (no link between them), so port 1 on all systems go to switch 1 and port 2 switch 2.
Port 1 is on subnet 10.200.46.0/25 and port 2 is on 10.200.46.128/25 (IPoIB).
But both subnets have the same pkey. In the IPoIB release notes it says that different subnet need a different pkey if they are on the same switch, otherwise arp updates may produce an incorrect route. These are not on the same switch, but thought that it could be the problem. But checking the arp updates on ESXi, appear to show the IP addresses moving over to the correct MAC and using the correct interface.
There are 2 ESXi NFS datastores, running over the different subnets, datastore 1 over subnet1 and 2 over 2.
The subnet manager could also be a problem, reading different config appear to show different notations, so not sure which is correct. The correct partition config is: Default=0xffff,ipoib,rate=7,x mtu=5,defmember=full:ALL; The subnet manager logs also produce some errors multiple times:
583281 [23991700] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR 1B10: Provided Join State != FullMember - required for create, MGID: ff12:401b:ffff::2 from port 0x0002c903002af1cf (MT25408 ConnectX Mellanox Technologies)
558044 [23190700] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:9 GID:fe80::2:c903:2a:f16f
I have set all the systems to use an MTU of 4K, debian, the subnet manager didnt want to go above 2K until it was set into connected mode. I was going to set everything back to 2K as its default to be sure thats not causing a problem.
I have 2 setups that i was going to try, the first was, keep the layout the same as now, but have 2 pkeys, 1 for each subnet and change the MTU back to default, see if that works. The next is to use only 1 subnet and pkey and link the 2 switches together, again with 2K MTU. The later is not the recommended config for an iSCSI network with multiple paths, so isnt really wanted.
Any additional help with this would be appreciated, and any additional info you may need I will try to provide.
Thanks