Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6226

Issues with SRP and unexplainable "ibping" behaviour.

$
0
0

Hello everyone,

 

I'm chasing a bit of assistance in troubleshooting SRP over an Infiniband setup which I have at home. Essentially I'm not seeing the disk I/O performance I was expecting between my SRP Initiator and Target and want to troubleshoot where the problem could be. I wanted to start at the Infiniband infrastructure and working up from there. If I can verify that my Infiniband is setup correctly and performing as it should, I can start to troubleshoot the additional technologies and protocols involved.

 

Some basic information first:

 

SRP Target: Oracle Solaris v11.1 Server with ZFS pools as LU (Logical Units).

SRP Initiator: VMware ESXi v5.5.

 

Mellanox MHGH28-XTC (MT25418) cards are being used in both the Infiniband devices above. A CX4 cable is used to directly connect between them.

 

Now to the best of my knowledge, the drivers, VIBs and configuration has all been done correctly and I'm at the point where my ESXi v5.5 can actually see the LU, mount it and I can store data on there. At this stage, it seems to be purely a performance issue which I'm trying to resolve.

 

Some CLI outputs below:

 

STORAGE-SERVER

STORAGE-SERVER:/# ibstat

CA 'mlx4_0'

        CA type: 0

        Number of ports: 2

        Firmware version: 2.9.1000

        Hardware version: 160

        Node GUID: 0x001a4bffff0c6214

        System image GUID: 0x001a4bffff0c6217

        Port 1:

                State: Active

                Physical state: LinkUp

                Rate: 20

                Base lid: 2

                LMC: 0

                SM lid: 1

                Capability mask: 0x00000038

                Port GUID: 0x001a4bffff0c6215

                Link layer: IB

        Port 2:

                State: Down

                Physical state: Polling

                Rate: 10

                Base lid: 0

                LMC: 0

                SM lid: 0

                Capability mask: 0x00000038

                Port GUID: 0x001a4bffff0c6216

                Link layer: IB

 

VM-HYPER:

/opt/opensm/bin # ./ibstat

CA 'mlx4_0'

        CA type: MT25418

        Number of ports: 2

        Firmware version: 2.7.0

        Hardware version: a0

        Node GUID: 0x001a4bffff0cb178

        System image GUID: 0x001a4bffff0cb17b

        Port 1:

                State: Active

                Physical state: LinkUp

                Rate: 20

                Base lid: 1

                LMC: 0

                SM lid: 1

                Capability mask: 0x0251086a

                Port GUID: 0x001a4bffff0cb179

                Link layer: InfiniBand

        Port 2:

                State: Down

                Physical state: Polling

                Rate: 8

                Base lid: 0

                LMC: 0

                SM lid: 0

                Capability mask: 0x0251086a

                Port GUID: 0x001a4bffff0cb17a

                Link layer: InfiniBand


The "LIDs" in the above outputs indicate that the SM (Subnet Manager) is working as far as I'm aware.

 

From the SRP target, I can see the other Infiniband host:

 

STORAGE-SERVER:/# ibhosts

Ca      : 0x001a4bffff0cb178 ports 2 "****************** HCA-1"

Ca      : 0x001a4bffff0c6214 ports 2 "MT25408 ConnectX Mellanox Technologies"

 

I thought I'd start with using the "ibping" utility to verify Infiniband connectivity. This is where I got some really strange results:

 

Firstly, I could not get the ibping daemon running on the SRP initiator (ESXi) at all. The command would execute, but then just return to the shell:

 

/opt/opensm/bin # ./ibping -S

/opt/opensm/bin #

 

So I tried to switch to running the ibping daemon on the SRP target (Oracle Solaris), which seemed to work as it should and it appeared to be awaiting some pings to come through. Great! Now going back to the SRP initiator, I ran the ibping utility with the LID of the SRP target. But it was unsuccessful:

 

/opt/opensm/bin # ./ibping -L 2

ibwarn: [3502756] _do_madrpc: recv failed: Resource temporarily unavailable

ibwarn: [3502756] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)

ibwarn: [3502756] _do_madrpc: recv failed: Resource temporarily unavailable

ibwarn: [3502756] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)

ibwarn: [3502756] _do_madrpc: recv failed: Resource temporarily unavailable

...

..

.

---  (Lid 2) ibping statistics ---

10 packets transmitted, 0 received, 100% packet loss, time 9360 ms

rtt min/avg/max = 0.000/0.000/0.000 ms

 

OK, let's try the Port GUID of the SRP target instead of the LID:

 

/opt/opensm/bin # ./ibping -G 0x001a4bffff0c6215

ibwarn: [3504924] _do_madrpc: recv failed: Resource temporarily unavailable

ibwarn: [3504924] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1)

ibwarn: [3504924] ib_path_query_via: sa call path_query failed

./ibping: iberror: failed: can't resolve destination port 0x001a4bffff0c6215

 

I restarted the ibping daemon on the SRP target with 1 level of debugging, and  re-ran the pings from the client (SRP initiator). I can see that the pings are actually reaching the SRP target and a reply is being sent:

 

STORAGE-SERVER:/# ibping -S -d

ibdebug: [11188] ibping_serv: starting to serve...

ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER

ibwarn: [11188] mad_respond_via: dest Lid 1

ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000

ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER

ibwarn: [11188] mad_respond_via: dest Lid 1

ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000

ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER

ibwarn: [11188] mad_respond_via: dest Lid 1

ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000

ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER

ibwarn: [11188] mad_respond_via: dest Lid 1

ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000

ibdebug: [11188] ibping_serv: Pong: STORAGE-SERVER

ibwarn: [11188] mad_respond_via: dest Lid 1

ibwarn: [11188] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000

 

The strangest observation is yet to come however. If I run the ibping on the client with 2 levels of debug, I get a few replies in the final statistics output when the ibping is terminated (this does not work under single level of debugging in my experience):

 

/opt/opensm/bin # ./ibping -L -dd 2

...

..

.

ibdebug: [3508744] ibping: Ping..

ibwarn: [3508744] ib_vendor_call_via: route Lid 2 data 0x3ffcebc7aa0

ibwarn: [3508744] ib_vendor_call_via: class 0x132 method 0x1 attr 0x0 mod 0x0 datasz 216 off 40 res_ex 1

ibwarn: [3508744] mad_rpc_rmpp: rmpp (nil) data 0x3ffcebc7aa0

ibwarn: [3508744] umad_set_addr: umad 0x3ffcebc7570 dlid 2 dqp 1 sl 0, qkey 80010000

ibwarn: [3508744] _do_madrpc: >>> sending: len 256 pktsz 320

send buf

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0001 8001 0000 0002 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0132 0101 0000 0000 0000 0000 4343 c235

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 1405 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

ibwarn: [3508744] umad_send: fd 3 agentid 1 umad 0x3ffcebc7570 timeout 1000

ibwarn: [3508744] umad_recv: fd 3 umad 0x3ffcebc7170 timeout 1000

ibwarn: [3508744] umad_recv: mad received by agent 1 length 320

ibwarn: [3508744] _do_madrpc: rcv buf:

rcv buf

0132 0181 0000 0000 0000 00ac 4343 c234

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 1405 6763 2d73 746f 7261

6765 312e 6461 726b 7265 616c 6d2e 696e

7465 726e 616c 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

ibwarn: [3508744] umad_recv: fd 3 umad 0x3ffcebc7170 timeout 1000

ibwarn: [3508744] umad_recv: mad received by agent 1 length 320

ibwarn: [3508744] _do_madrpc: rcv buf:

rcv buf

0132 0181 0000 0000 0000 00ac 4343 c235

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 1405 6763 2d73 746f 7261

6765 312e 6461 726b 7265 616c 6d2e 696e

7465 726e 616c 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

ibwarn: [3508744] mad_rpc_rmpp: data offs 40 sz 216

rmpp mad data

6763 2d73 746f 7261 6765 312e 6461 726b

7265 616c 6d2e 696e 7465 726e 616c 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000

0000 0000 0000 0000

Pong from STORAGE-SERVER (Lid 2): time 7.394 ms

ibdebug: [3508744] report: out due signal 2

 

--- STORAGE-SERVER (Lid 2) ibping statistics ---

10 packets transmitted, 3 received, 70% packet loss, time 9556 ms

rtt min/avg/max = 7.394/12.335/15.344 ms

 

 

I'm stumped. Anyone have any ideas on what is going on or how to troubleshoot further?


Viewing all articles
Browse latest Browse all 6226

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>