Hi
I have the same issue in a a 9 node cluster with almost only multcast traffic,two of the nodes suffer from
high number of rx_dropped and ( the same number of rx_over errors), on both of these nodes I can see that CPU1 is running 100% but 0% user and system mode, e,g. it is 100% occupied servicing interupts.
Thus I suspect non optimal interupt coalescing,
ethtool -C says Adaptive RX: on
Should I try using manual confiugration instead ? any sugget values ?
Any other suggestions ?
The strange thing is that the other 7 nodes seems to be coping with the same load without problems.
I have checked the PCI affinity it it seems OK the conenctx3 card is on the PCI bus connected to the same socket as CPU1
One data point is that I have more receivers on some of the nodes, could taht affect the issues?
The Mellanox pictrue describesrx_ over_erros as being the hardware buffer on card,
thus I can not quite see that the number of consumers should matter,...