Hi alkx,
What will happen if host1 will be rebooted?
Host1 has been rebooted several times, and nv_peer_mem service still starts successfully every time.
Are you able to unload and load nvidia modules on the host1?
Yes, on Host1 I unloaded (rmmod) first nv_peer_mem and nvidia-uvm, then nvidia module. After this I loaded (insmod) again all three modules, tried to start nv_peer_mem and it starts ok !
On Host2 I performed the same steps, nv_peer_mem module loads but it fails to start with the "service" command:
[root@vega2 ~]# insmod /lib/modules/2.6.32-504.el6.x86_64/extra/nv_peer_mem.ko
[root@vega2 ~]#
[root@vega2 ~]# lsmod |grep nvidia
nvidia_uvm 71579 0
nvidia 8594822 2 nv_peer_mem,nvidia_uvm
i2c_core 29964 5 nvidia,nouveau,drm_kms_helper,drm,i2c_algo_bit
[root@vega2 ~]#
[root@vega2 ~]# service nv_peer_mem status
nv_peer_mem module is loaded.
[root@vega2 ~]# service nv_peer_mem start
starting... FATAL: Error inserting nv_peer_mem (/lib/modules/2.6.32-504.el6.x86_64/extra/nv_peer_mem.ko): Invalid module format
Failed to load nv_peer_mem
Did you try to recompile the nv_peer_mem module?
Yes I did, for both versions (from source RPM as stated in the instructions): the original 1.0.0 installed also on Host1, and the new 1.0.1 available on Mellanox web site, but the result didn't change.
What is the output of "modinfo <MODULE> |grep srcversion" command, where <MODULE> is all related modules on involved hosts?
Same output for both hosts:
[root@vega2 ~]# modinfo nv_peer_mem |grep srcversion
srcversion: CE9FD4B496BA8CAF40A6E95
[root@vega2 ~]#
[root@vega2 ~]# modinfo nvidia |grep srcversion
[root@vega2 ~]#
[root@vega2 ~]# modinfo nvidia_uvm |grep srcversion
srcversion: A347F556C35EE8E88DF9DEB
Did you check a timestamps of the related modules? Maybe one of them has fresh timestamp?
Only nv_peer_mem on host2 because I recompiled it, all the other have the original timestamp.
Thanks for all the hints, I have no clue anymore on where to look at.
Stefano