Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all 6226 articles
Browse latest View live

ibv_reg_mr got file exists error when used nv_peer_mem

$
0
0

Hi, everyone

 

I like to test the GPUDirect with RDMA, so i use  ConnectX-3, Nvidia-K80 to do the experiment. the environment is list bellow:

kernel-4.8.7

cuda-drivers: 384.66

cuda-toolkit: 375.26

nv_peer_mem: 1.0.5

 

 

I use perftest tool to do the expeirment.

server1: ./ib_write_bw -a -F -n10000 --use_cuda

server2: ./ib_write_bw -a -F -n10000 server1

 

but the server1 output error:

Couldn't allocate MR
failed to create mr
Failed to create MR

 

at last, i printout the error and errno, the error is 14, and errno is "Bad address".

 

can anyone help me, tell me is there any question. thank you very much.


Installing the MLNX driver is failed

$
0
0

when I execute the command, "./mnxofedinstall" , the execution is failed .

the error message is "Error: One or more packages depends on MLNX_OFED. Those packages should be removed before uninstalling MLNX_OFED: ibutils-libs"

Do you know the reason of the failure?

Re: Installing the MLNX driver is failed

$
0
0

Hi 

You need remove : ibutils-libs

 

THanks

Re: Installing the MLNX driver is failed

$
0
0

After uninstalling the relevant rpm packages manually , it is successful.  But what is the root cause for this error?

Re: Installing the MLNX driver is failed

$
0
0

Seems you install the inbox driver relate package, and it can't be removed by install process.

Re: Installing the MLNX driver is failed

$
0
0

Although uninstalling the relevant rpm package can be restored, but in turn,it can not reproduce.Need to find the root cause.

Whether there are other skeptics, or reproducible methods.

Re: ConnectX-3 WinOF 5.35 on Win2016 Multiple Partitions

$
0
0

Following up on this topic... 

 

I reverted the firmware back to 2.36.5150 and driver to 5.25.12665, all the issues mentioned above are gone.

Re: Installing the MLNX driver is failed

$
0
0

Is there any method to avoid this error? such as adding params or others.


Re: ConnectX VPI (MT26418) nic and SFP modules

Re: ibv_reg_mr got file exists error when used nv_peer_mem

$
0
0

Hi Haizhu,

 

Thank you for contacting the Mellanox Community.

 

For your test, please install the latest Mellanox OFED version and redo the test with ib_send_bw WITHOUT cuda to check if RDMA is working properly including the option to define the device you want to use.

Example without CUDA

Server:

# ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits

Client:

# ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits <ip-address-server>

 

 

Example with CUDA

Server:

# ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda

Client:

# ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda <ip-address-server>

 

Also we recommend following the benchmark test from the GPUDirect UM ( http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf ), Section 3.

 

For further support, we recommend opening a support case with Mellanox Support.

 

Thanks.

 

Cheers,

~Martijn

Re: ibv_reg_mr got file exists error when used nv_peer_mem

$
0
0

Hi Martijin

Thank you for your reply about the issue.

 

I didn't describe the question clearly, the h/w environment is list below:

1. Hardware:

ConnectX-3 (Mellanox Technologies MT27500 Family [ConnectX-3])

Nvidia K80

2. Software:

ubuntu-16.04, kernel 4.8.7

nvidia-driver: nvidia-diag-driver-local-repo-ubuntu1604-384.66_1.0-1_amd64.deb (downsite: NVIDIA DRIVERS Tesla Driver for Ubuntu 16.04 )

cuda-toolkit: cuda_8.0.61_375.26_linux.run (CUDA Toolkit Download | NVIDIA Developer )

MLNX_OFED: MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz  http://www.mellanox.com/downloads/ofed/MLNX_OFED-4.1-1.0.2.0/MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz

nv_peer_mem: 1.0.5

 

I have two servers, with one server has a K80 GPU. I want to use perftest to test the RDMA and GPUDirect. Reference to this , I install nv_peer_mem in server with 80 GPU.

When i didn't use --use_cuda, the ib_write_bw work well, but when i use --use_cuda, it hase error, and i print the error message, the ib_write_bw run into ibv_reg_mr, and then got an error: "File has opened". If i didn't insmod nv_peer_mem, ibv_reg_mr got an error: "Bad address".

 

The background is that i had run the same experiment correct before, which i use kernel 4.4.0, and MLNX_OFED 4.0-2.0.0.1, and didn't install NVMe over Fabrics. Then my workmate install kernel 4.8.7, and NVMe over Fabrics. After then, the ib_write_bw with --use_cuda can never run collect.

 

Is there any question in my experiment, and experiment environment. And another question, can i use one ConnectX-3 to support NVMe over Fabrics and GPUDirect RDMA at the same time.

 

 

 

Thanks very much for your reply again, and looking forward to your reply.

 

Yours

Haizhu Shao

 

Re: Win server 2016 Switch Embedded Teaming (SET) and SR-IOV

RDS-TOOLS PACKAGE ON MLNX_OFED_LINUX-4.1-1.0.2.0-fc24-x86_64.iso ?

$
0
0

Hello

 

I have installed Mellanox OFED for Linux Sofware version 4.1-1.0.2.0 (MLNX_OFED_LINUX-4.1-1.0.2.0-fc24-x86_64.iso) on Fedora 24. But I do not find rds-tools package. Can You tell me wher I can find the rds-tools package for OFED  Linux Sofware version 4.1-1.0.2.0 ? I 'm  looking for this package to use rds-ping and rds-stress tools .

 

Here is the content of my RPMS directory ->

 

root@aigle CDROM]# cd RPMS

 

[root@aigle RPMS]# ll *rds*

ls: cannot access '*rds*': No such file or directory

 

[root@aigle RPMS]# rpm -qa *.rpm | grep rds

[root@aigle RPMS]#

 

[root@aigle RPMS]# ll

total 119846

-r--r--r--. 1 abdel abdel   151642 Jun 27 18:05 ar_mgr-1.0-0.34.g9bd7c9a.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    63946 Jun 27 18:04 cc_mgr-1.0-0.33.g9bd7c9a.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   272414 Jun 27 17:56 dapl-2.1.10mlnx-OFED.3.4.2.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    46998 Jun 27 17:56 dapl-devel-2.1.10mlnx-OFED.3.4.2.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   151326 Jun 27 17:56 dapl-devel-static-2.1.10mlnx-OFED.3.4.2.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   138054 Jun 27 17:56 dapl-utils-2.1.10mlnx-OFED.3.4.2.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    13098 Jun 27 18:04 dump_pr-1.0-0.29.g9bd7c9a.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel  4678882 Jun 27 18:15 hcoll-3.8.1649-1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    84054 Jun 27 17:52 ibacm-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel     9918 Jun 27 17:52 ibacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel  1207010 Jun 27 17:42 ibdump-5.0.0-1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    70838 Jun 27 17:52 ibsim-0.6mlnx1-0.8.g9d76581.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel  1801990 Jun 27 18:03 ibutils2-2.1.1-0.91.MLNX20170612.g2e0d52a.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   364950 Jun 27 18:05 infiniband-diags-1.6.7.MLNX20170511.7595646-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    34762 Jun 27 18:05 infiniband-diags-compat-1.6.7.MLNX20170511.7595646-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    27702 Jun 27 18:05 infiniband-diags-guest-1.6.7.MLNX20170511.7595646-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    29574 Jun 27 17:50 iser-4.0-OFED.4.1.1.0.2.1.gc22af88.kver.4.5.5_300.fc24.x86_64.x86_64.rpm

-r--r--r--. 1 abdel abdel    13454 Jun 27 17:50 kernel-mft-4.7.0-41.kver.4.5.5_300.fc24.x86_64.x86_64.rpm

-r--r--r--. 1 abdel abdel    69162 Jun 27 17:49 knem-1.1.2.90mlnx2-OFED.4.0.1.6.3.1.g4faa297.kver.4.5.5_300.fc24.x86_64.x86_64.rpm

-r--r--r--. 1 abdel abdel    22730 Jun 27 17:52 libibcm-41mlnx1-OFED.4.1.0.1.0.41102.i686.rpm

-r--r--r--. 1 abdel abdel    22042 Jun 27 17:52 libibcm-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    17662 Jun 27 17:52 libibcm-devel-41mlnx1-OFED.4.1.0.1.0.41102.i686.rpm

-r--r--r--. 1 abdel abdel    17502 Jun 27 17:52 libibcm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    69438 Jun 27 17:52 libibmad-1.3.13.MLNX20170511.267a441-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    68510 Jun 27 17:52 libibmad-1.3.13.MLNX20170511.267a441-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    17262 Jun 27 17:52 libibmad-devel-1.3.13.MLNX20170511.267a441-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    17238 Jun 27 17:52 libibmad-devel-1.3.13.MLNX20170511.267a441-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    36510 Jun 27 17:52 libibmad-static-1.3.13.MLNX20170511.267a441-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    38106 Jun 27 17:52 libibmad-static-1.3.13.MLNX20170511.267a441-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   510193 Jun 27 18:25 libibprof-1.1.41-1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    68994 Jun 27 17:52 libibumad-13.10.2.MLNX20170511.dcc9f7a-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    68658 Jun 27 17:52 libibumad-13.10.2.MLNX20170511.dcc9f7a-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    12830 Jun 27 17:52 libibumad-devel-13.10.2.MLNX20170511.dcc9f7a-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    12814 Jun 27 17:52 libibumad-devel-13.10.2.MLNX20170511.dcc9f7a-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    20014 Jun 27 17:52 libibumad-static-13.10.2.MLNX20170511.dcc9f7a-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    20882 Jun 27 17:52 libibumad-static-13.10.2.MLNX20170511.dcc9f7a-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    91114 Jun 27 17:39 libibverbs-41mlnx1-OFED.4.1.0.1.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    87870 Jun 27 17:38 libibverbs-41mlnx1-OFED.4.1.0.1.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   175950 Jun 27 17:39 libibverbs-devel-41mlnx1-OFED.4.1.0.1.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel   175918 Jun 27 17:38 libibverbs-devel-41mlnx1-OFED.4.1.0.1.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    51542 Jun 27 17:39 libibverbs-devel-static-41mlnx1-OFED.4.1.0.1.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    51414 Jun 27 17:38 libibverbs-devel-static-41mlnx1-OFED.4.1.0.1.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   122938 Jun 27 17:39 libibverbs-utils-41mlnx1-OFED.4.1.0.1.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel   124446 Jun 27 17:38 libibverbs-utils-41mlnx1-OFED.4.1.0.1.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    68870 Jun 27 17:50 libmlx4-41mlnx1-OFED.4.1.0.1.0.41102.i686.rpm

-r--r--r--. 1 abdel abdel    66642 Jun 27 17:50 libmlx4-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    58786 Jun 27 17:50 libmlx4-devel-41mlnx1-OFED.4.1.0.1.0.41102.i686.rpm

-r--r--r--. 1 abdel abdel    60682 Jun 27 17:50 libmlx4-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   130934 Jun 27 17:51 libmlx5-41mlnx1-OFED.4.1.0.1.5.41102.i686.rpm

-r--r--r--. 1 abdel abdel   128794 Jun 27 17:51 libmlx5-41mlnx1-OFED.4.1.0.1.5.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   129798 Jun 27 17:51 libmlx5-devel-41mlnx1-OFED.4.1.0.1.5.41102.i686.rpm

-r--r--r--. 1 abdel abdel   134494 Jun 27 17:51 libmlx5-devel-41mlnx1-OFED.4.1.0.1.5.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    70494 Jun 27 17:53 librdmacm-41mlnx1-OFED.4.1.0.1.0.41102.i686.rpm

-r--r--r--. 1 abdel abdel    72106 Jun 27 17:53 librdmacm-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   119514 Jun 27 17:53 librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.i686.rpm

-r--r--r--. 1 abdel abdel   127738 Jun 27 17:53 librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    84530 Jun 27 17:53 librdmacm-utils-41mlnx1-OFED.4.1.0.1.0.41102.i686.rpm

-r--r--r--. 1 abdel abdel    83722 Jun 27 17:53 librdmacm-utils-41mlnx1-OFED.4.1.0.1.0.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    24714 Jun 27 17:51 librxe-41mlnx1-OFED.4.1.0.1.7.41102.i686.rpm

-r--r--r--. 1 abdel abdel    23854 Jun 27 17:51 librxe-41mlnx1-OFED.4.1.0.1.7.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    12318 Jun 27 17:51 librxe-devel-static-41mlnx1-OFED.4.1.0.1.7.41102.i686.rpm

-r--r--r--. 1 abdel abdel    11982 Jun 27 17:51 librxe-devel-static-41mlnx1-OFED.4.1.0.1.7.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   706362 Jun 27 18:29 libvma-8.3.7-1.x86_64.rpm

-r--r--r--. 1 abdel abdel    15254 Jun 27 18:29 libvma-devel-8.3.7-1.x86_64.rpm

-r--r--r--. 1 abdel abdel    37686 Jun 27 18:29 libvma-utils-8.3.7-1.x86_64.rpm

-r--r--r--. 1 abdel abdel 65966777 Jun 21 09:18 mft-4.7.0-41.x86_64.rpm

-r--r--r--. 1 abdel abdel   114746 Jun 27 18:34 mlnx-ethtool-4.2-1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel 10578239 Jun 27 21:34 mlnx-fw-updater-4.1-1.0.2.0.x86_64.rpm

-r--r--r--. 1 abdel abdel   100742 Jun 27 17:49 mlnx-ofa_kernel-4.1-OFED.4.1.1.0.2.1.gc22af88.fc24.x86_64.rpm

-r--r--r--. 1 abdel abdel  4801566 Jun 27 17:49 mlnx-ofa_kernel-devel-4.1-OFED.4.1.1.0.2.1.gc22af88.fc24.x86_64.rpm

-r--r--r--. 1 abdel abdel   948986 Jun 27 17:49 mlnx-ofa_kernel-modules-4.1-OFED.4.1.1.0.2.1.gc22af88.kver.4.5.5_300.fc24.x86_64.x86_64.rpm

-r--r--r--. 1 abdel abdel     6693 Jun 27 21:33 mlnx-ofed-all-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     4835 Jun 27 21:33 mlnx-ofed-basic-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel    55356 Jun 27 21:33 mlnxofed-docs-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     4869 Jun 27 21:33 mlnx-ofed-dpdk-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     6135 Jun 27 21:34 mlnx-ofed-guest-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     6226 Jun 27 21:34 mlnx-ofed-hpc-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     6263 Jun 27 21:34 mlnx-ofed-hypervisor-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     4537 Jun 27 21:34 mlnx-ofed-kernel-only-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     6541 Jun 27 21:34 mlnx-ofed-vma-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     6245 Jun 27 21:34 mlnx-ofed-vma-eth-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel     6577 Jun 27 21:34 mlnx-ofed-vma-vpi-4.5.5-300.fc24.x86_64-4.1-1.0.2.0.noarch.rpm

-r--r--r--. 1 abdel abdel    28750 Jun 27 18:10 mpi-selector-1.0.3-1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   448546 Jun 27 18:35 mpitests_openmpi-3.2.19-acade41.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   991474 Jun 27 17:40 mstflint-4.7.0-1.6.g26037b7.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel  3782129 Jun 27 18:07 mxm-3.6.3102-1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    45270 Jun 27 17:38 ofed-scripts-4.1-OFED.4.1.1.0.2.x86_64.rpm

-r--r--r--. 1 abdel abdel 13966862 Jun 27 18:25 openmpi-2.1.2a1-1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   818378 Jun 27 17:55 opensm-4.9.0.MLNX20170607.280b8f7-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel   824930 Jun 27 17:54 opensm-4.9.0.MLNX20170607.280b8f7-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   228774 Jun 27 17:55 opensm-devel-4.9.0.MLNX20170607.280b8f7-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel   228778 Jun 27 17:54 opensm-devel-4.9.0.MLNX20170607.280b8f7-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    76098 Jun 27 17:55 opensm-libs-4.9.0.MLNX20170607.280b8f7-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    71238 Jun 27 17:54 opensm-libs-4.9.0.MLNX20170607.280b8f7-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    68354 Jun 27 17:55 opensm-static-4.9.0.MLNX20170607.280b8f7-0.1.41102.i686.rpm

-r--r--r--. 1 abdel abdel    67522 Jun 27 17:54 opensm-static-4.9.0.MLNX20170607.280b8f7-0.1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   254870 Jun 27 17:57 perftest-4.1-0.4.g16dbf63.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    63946 Jun 27 18:05 qperf-0.4.9-9.41102.x86_64.rpm

dr-xr-xr-x. 2 abdel abdel     2048 Jun 27 21:34 repodata

-r--r--r--. 1 abdel abdel  1674339 Jun 27 18:10 sharp-1.3.1.MLNX20170625.859dc24-1.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel   857178 Jun 27 18:33 sockperf-3.1-14.gita9f6056282ef.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel    40270 Jun 27 17:50 srp-4.0-OFED.4.1.1.0.2.1.gc22af88.kver.4.5.5_300.fc24.x86_64.x86_64.rpm

-r--r--r--. 1 abdel abdel    51066 Jun 27 17:57 srptools-41mlnx1-4.41102.x86_64.rpm

-r--r--r--. 1 abdel abdel  2156449 Jun 27 18:08 ucx-1.2.2947-1.41102.x86_64.rpm

 

thanks

Re: ConnectX-3 Pro connecting at 10g instead of 40g

$
0
0

The cable was the issue. I got a cable from the list provided, did the loop test and the speed went up to 40g. It also increased the speed to 40g between the Win and Vmware servers.

This is the cable I used: link

Thank you.

How to test RDMA traffic congestion

$
0
0

Hi. We're trying to debug issues we see periodically with Lustre Networking on top of CX-3 and CX-4 based RoCE(v1) fabrics using SR-IOV for connections from Lustre clients running as KVM guests (servers are bare-metal). When we hit these errors we see drop/error counters going up on the hosts.

 

So far all simple ib tests between host-pairs look ok, now we want to test congestion scenarios, e.g., 2 hosts sending to 1 host. However we've discovered that whilst e.g. ib_write_bw has an option to specify more than one QP, it actually doesn't support it! Is there a simple way to engineer such a test or are we going to have to write something or move to an MPI based test suite...?


Question about ESXi 6.5 iSER driver with PFC port configuraion.

$
0
0

Hi!

I saw a Mellanox iSER driver for ESXi 6.5 released.

But this driver seem to be need Global Pause configuration on each ports that show on driver's manual.

 

Is there any solution with PFC configuration on switch's ports?

 

Best Regard,

 

P.S

Why iser storage adapter disappeared after every ESXi 6.5 host reboot?

 

P.S 02

This Global Pause based iSER initiator can't connect SCST Ethernet iSER Target.

How can I resolve this problem?

Tested iSER Target

01. LIO

02. SCST

03. StarWind vSAN

LAG problems

$
0
0

My team is currently standing up a new cluster that has an SN2700 core ethernet switch on our boot network.  LAG links are working fine between this core and the leaf switches in the new cluster.  We also have an older cluster with an SX1036 ethernet switch serving as its core switch.  LAG links are also working fine between this older core switch and the older leaf switches in that cluster.  Several of us have tried to get LAG working between the SX1036 and SN2700 and we can't working link (single link works fine).  We've done typical troubleshooting looking for bad cables/ports etc.  We can find no differences comparing the configurations and status for working LAG links and the failing link.

 

The SX1036 is a PPC switch and is running a much older firmware:

 

Product name:      MLNX-OS

Product release:   3.4.3002

Build ID:          #1-dev

Build date:        2015-07-30 20:13:15

Target arch:       ppc

Target hw:         m460ex

Built by:          jenkins@fit74

Version summary:   PPC_M460EX 3.4.3002 2015-07-30 20:13:15 ppc

 

Product model:     ppc

 

than the SN2700 (X86):

 

Product name:      MLNX-OS

Product release:   3.6.3200

Build ID:          #1-dev

Build date:        2017-03-09 17:55:58

Target arch:       x86_64

Target hw:         x86_64

Built by:          jenkins@e3f42965d5ee

Version summary:   X86_64 3.6.3200 2017-03-09 17:55:58 x86_64

 

Product model:     x86onie

 

The obvious thing to try is updating the firmware on the SX1036, but this cluster is in production and our team is nervous about messing with that core switch as it's pretty critical to our infrastructure.  Would a firmware mismatch cause this behavior.

 

I have seen documentation indicating that MLAG doesn't work between PPC and X86 switches.  I sure hope that's not the case for LAG...

iSER driver for ESXi 6.5 - Does it support ConnectX-3 or not

$
0
0

I saw your iSER driver 1.0.0.1 release notes.

 

 

nmlx4-core 3.16.0.0 is ESXi 6.5 inbox driver.

After install this iSER driver 1.0.0.1 for ESXi 6.5, is there any more configuration needed?

Issues with setting up Storage Spaces Direct

$
0
0

Hello everyone,

I am working on setting up a S2D Cluster and have ran into an issue where I am unable to get my nodes to communicate via RDMA I have used the Test-RDMA.ps1 script and DISKSPD provided in another post in the Mellanox Community.

Here is my hardware configuration:

Configuration 4 Nodes with Following configurations on each

Hardware:       Intel R2224WTTYSR Server Systems

                        256GB Samsung DDR4 LRDIMMs

                        2x Intel E5-2620 v4 Xeon CPU

                        1x Mellanox ConnectX4 - MCX414A-BCAT

                        1x Broadcom LSI 3805-24i HBA

                        2x Intel DC P3700 800GB for Journal\cache drives

                        4x Seagate 2TB SAS HDs for Capacity drives

Networking     1x Netgear 10GbE network switch for VMs

                        2x Mellanox SX1012 12 Port QSFP28 Switch for RDMA\cluster Traffic

                        8x MC2210128-003 Mellanox LinkX Cables

We are not utilizing SET Teams and are only using the ConnectX4 NICs for RoCE traffic for the storage traffic.

 

All nodes are setup with this configuration for the RDMA enabled NICs:

Attached is the configuration of my Mellanox 1012X Switch.

 

Any help in the right direction is very appreciated.

 

Thanks!

ceph + rdma error: ibv_open_device failed

$
0
0

I followed this doc:

Bring Up Ceph RDMA - Developer's Guide

But mon could not start with this error:

7f5acb890700 -1 Infiniband Device open rdma device failed. (2) No such file or directory

 

I checked ceph code:

116  name = ibv_get_device_name(device);

117  ctxt = ibv_open_device(device);

118  if (ctxt == NULL) {

119    lderr(cct) << __func__ << " open rdm a device failed. "<< cpp_strerror(errno) << dendl;

120    ceph_abort();

121  }

 

Then

gdb info:

Breakpoint 1, Device::Device (this=0x55555f3597e0, cct=0x55555eec01c0, d=<optimized out>)

    at /usr/src/debug/ceph-12.2.0/src/msg/async/rdma/Infiniband.cc:116

116  name = ibv_get_device_name(device);

$7 = {ops = {alloc_context = 0x0, free_context = 0x0}, node_type = IBV_NODE_CA, transport_type = IBV_TRANSPORT_IB,                                 -------------------------------------------*(struct ibv_device *) device

  name = "mlx4_0", '\000' <repeats 57 times>, dev_name = "uverbs0", '\000' <repeats 56 times>,

  dev_path = "/sys/class/infiniband_verbs/uverbs0", '\000' <repeats 220 times>,

  ibdev_path = "/sys/class/infiniband/mlx4_0", '\000' <repeats 227 times>}

 

Breakpoint 2, Device::Device (this=0x55555f3597e0, cct=0x55555eec01c0, d=<optimized out>)

    at /usr/src/debug/ceph-12.2.0/src/msg/async/rdma/Infiniband.cc:117

117  ctxt = ibv_open_device(device);

Cannot access memory at address 0x646f6e2f305f3478                                                 -------------------------------------------*(struct CephContext *) ctxt

 

It seems that ibv_open_device failed

 

# ibstat

CA 'mlx4_0'

CA type: MT26428

Number of ports: 1

Firmware version: 2.9.1000

Hardware version: b0

Node GUID: 0x0002c90300589efc

System image GUID: 0x0002c90300589eff

Port 1:

State: Active

Physical state: LinkUp

Rate: 40

Base lid: 33

LMC: 0

SM lid: 23

Capability mask: 0x0251086a

Port GUID: 0x0002c90300589efd

Link layer: InfiniBand

 

Is there any problem with the data of struct ibv_device?

Viewing all 6226 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>