Hi Mikyung,
Are you sure your issue is not related to other virtualisation factors? E.g., are you pinning your VMs to CPU cores and exposing the host NUMA topology to them? If your VMs have memory accesses that cross NUMA nodes (e.g., need to cross QPI) then that would explain your performance degradation as the message size increases and the effect of CPU caches is reduced to be dominated by memory.
Good luck!