But from earlier comments, it looks like nv_peer_mem is loaded on vega2
Actually for some time it even didn't load on vega2, as you can see in post #1 of this thread, then at a certain moment it loaded and appears in lsmod command (post #3), but anyway the nv_peer_mem service always fails to start. Just this fact that insmod sometimes works and sometimes not is quite strange, isn't it ?
Check also /etc/modprobe.d folder content on both system
It is exactly the same on both systems:
[root@vega2 boot]# ll /etc/modprobe.d
total 44
-rw-r--r--. 1 root root 52 Dec 2 2015 anaconda.conf
-rw-r--r--. 1 root root 884 Oct 16 2014 blacklist.conf
-rw-r--r--. 1 root root 382 Oct 15 2014 dist-alsa.conf
-rw-r--r--. 1 root root 5596 Oct 15 2014 dist.conf
-rw-r--r--. 1 root root 473 Oct 15 2014 dist-oss.conf
-rw-r--r--. 1 root root 26 Dec 2 2015 ib_ipoib.conf
-rw-r--r--. 1 root root 46 Jul 13 2015 ib_sdp.conf
-rw-r--r--. 1 root root 49 Jul 13 2015 mlnx.conf
-rw-r--r--. 1 root root 76 Dec 3 2015 nvidia-installer-disable-nouveau.conf
-rw-r--r--. 1 root root 30 Oct 10 2009 openfwwf.conf
Run bash -x /etc/init.d/nv_peer_mem start and see where it fails and what is in dmesg
I already checked it, the modprobe command fails, and in dmesg appear few lines about the duplicate symbol:
[root@vega2 boot]# bash -x /etc/init.d/nv_peer_mem start
+ CONFIG=/etc/infiniband/nv_peer_mem.conf
+ modname=nv_peer_mem
+ reqmods='ib_core nvidia'
+ '[' '!' -f /etc/infiniband/nv_peer_mem.conf ']'
+ . /etc/infiniband/nv_peer_mem.conf
++ ONBOOT=yes
++ pwd
+ CWD=/boot
+ cd /etc/infiniband
++ pwd
+ WD=/etc/infiniband
+ modprobe=/sbin/modprobe
+ /sbin/modprobe -c
+ grep -q '^allow_unsupported_modules *0'
+ ACTION=start
+ shift
+ '[' Xyes '!=' Xyes ']'
+ RC=0
+ case $ACTION in
+ start
+ local RC=0
+ echo -n 'starting... '
starting... + for mod in '$reqmods'
+ is_module ib_core
+ local RC
+ /sbin/lsmod
+ grep -w ib_core
+ RC=0
+ return 0
+ continue
+ for mod in '$reqmods'
+ is_module nvidia
+ local RC
+ grep -w nvidia
+ /sbin/lsmod
+ RC=0
+ return 0
+ continue
+ load_module nv_peer_mem
+ local module=nv_peer_mem
++ modinfo nv_peer_mem
++ grep filename
++ awk '{print $NF}'
+ filename=/lib/modules/2.6.32-504.el6.x86_64/extra/nv_peer_mem.ko
+ '[' '!' -n /lib/modules/2.6.32-504.el6.x86_64/extra/nv_peer_mem.ko ']'
+ /sbin/modprobe nv_peer_mem
FATAL: Error inserting nv_peer_mem (/lib/modules/2.6.32-504.el6.x86_64/extra/nv_peer_mem.ko): Invalid module format
+ RC=1
+ '[' 1 -eq 0 ']'
+ echo 'Failed to load nv_peer_mem'
Failed to load nv_peer_mem
+ log_msg 'Failed to load nv_peer_mem'
+ logger -i 'nv_peer_mem: Failed to load nv_peer_mem'
+ return 1
+ RC=1
+ exit 1
[root@vega2 boot]# dmesg
...
nvidia 0000:03:00.0: irq 164 for MSI/MSI-X
nvidia_uvm: Unregistered the UVM driver
nvidia 0000:03:00.0: PCI INT A disabled
nvidia 0000:03:00.0: PCI INT A -> GSI 32 (level, low) -> IRQ 32
nvidia 0000:03:00.0: setting latency timer to 64
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 352.39 Fri Aug 14 18:09:10 PDT 2015
nvidia_uvm: Loaded the UVM driver, major device number 245
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
nvidia 0000:03:00.0: irq 164 for MSI/MSI-X
nvidia 0000:03:00.0: irq 164 for MSI/MSI-X
nvidia 0000:03:00.0: irq 164 for MSI/MSI-X
nvidia 0000:03:00.0: irq 164 for MSI/MSI-X
nvidia 0000:03:00.0: irq 164 for MSI/MSI-X
nvidia 0000:03:00.0: irq 164 for MSI/MSI-X
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
ip_tables: (C) 2000-2006 Netfilter Core Team
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
nv_p2p_dummy: exports duplicate symbol nvidia_p2p_free_page_table (owned by nvidia)
Be sure you are using exact kernel version, you may run md5sum on the vmlinuz and initramfs images
This is a good idea, I didn't think about it ! In fact there is a difference in the initramfs file, while the others are identical. What could be the cause ? Might this be the issue ? Here is the output:
[root@vega1 boot]# md5sum vmlinuz-2.6.32-504.el6.x86_64
0805f85b126ebc6adf84b6ead56a080b vmlinuz-2.6.32-504.el6.x86_64
[root@vega1 boot]# md5sum initramfs-2.6.32-504.el6.x86_64.img
744d9c3ae08cb795e1b2142250d51c74 initramfs-2.6.32-504.el6.x86_64.img
[root@vega1 boot]# md5sum System.map-2.6.32-504.el6.x86_64
f9fda70c10eb7a2e3bedac7c73606519 System.map-2.6.32-504.el6.x86_64
[root@vega2 boot]# md5sum vmlinuz-2.6.32-504.el6.x86_64
0805f85b126ebc6adf84b6ead56a080b vmlinuz-2.6.32-504.el6.x86_64
[root@vega2 boot]# md5sum initramfs-2.6.32-504.el6.x86_64.img
df9afba8ad789256ccec2e715f514d02 initramfs-2.6.32-504.el6.x86_64.img
[root@vega2 boot]# md5sum System.map-2.6.32-504.el6.x86_64
f9fda70c10eb7a2e3bedac7c73606519 System.map-2.6.32-504.el6.x86_64
Thanks and bye,
Stefano