Dear all,
I have a connect-IB adapter, firmware version 10.10.5020, connected via a PCIe switch to a host running CENTOS 7 and Mellanox drivers MLNX_OFED_LINUX-2.4-1.0.4-rhel7.0-x86_64.
Just after boot, I do see the following messages:
localhost kernel: mlx5_core 0000:03:00.0: device's health compromised
localhost kernel: mlx5_core 0000:03:00.0: assert_var[0] 0x0000007a
localhost kernel: mlx5_core 0000:03:00.0: assert_var[1] 0x0000006e
localhost kernel: mlx5_core 0000:03:00.0: assert_var[2] 0x00000000
localhost kernel: mlx5_core 0000:03:00.0: assert_var[3] 0x00000000
localhost kernel: mlx5_core 0000:03:00.0: assert_var[4] 0x00000000
localhost kernel: mlx5_core 0000:03:00.0: assert_exit_ptr 0x006a013c
localhost kernel: mlx5_core 0000:03:00.0: assert_callra 0x006a0c9c
localhost kernel: mlx5_core 0000:03:00.0: fw_ver 0xa00a139c
localhost kernel: mlx5_core 0000:03:00.0: hw_id 0x000001ff
localhost kernel: mlx5_core 0000:03:00.0: irisc_index 0
localhost kernel: mlx5_core 0000:03:00.0: synd 0x10: High temprature
localhost kernel: mlx5_core 0000:03:00.0: ext_synd 0x0000
localhost kernel: mlx5_core 0000:03:00.0: handling bad device here
PCIe device 03:00.0 is the connect-IB card.
The system runs safely so far and the ports link-up with both QDR or FDR cables.
However, I am worried about the health of the system.
Anybody knows in specific what's the meaning of the messages reported above?
Many thanks.