We’re experimenting with our RDMA software on the Lib IB Verbs library, and have a puzzling problem.
Our 19-node cluster is unusual: we have Mellanox 100Gbps switches both for IB and ROCE, two switches side by side. This lets us run the identical software on RDMA/IB and then again on RDMA/ROCE.
What we are seeing is that both options have identical behavior with the test suite (ib_send_bw, for example) but with our software, we get 97Gbpx from RDMA/IB but only about 25Gbps from RDMA/ROCE. We’ve pinned this down to long send pauses: the data transfer rate is quite high, then suddenly the sender and receiver see a half second of delay, then the high-rate transfers resume. We’ve observed this identical problem with two different software systems we’re developing at Cornell, both of which work well on RDMA and have been used pretty widely on RDMA/IB setups.
The pause problem is easy to reproduce under the following conditions:
1) The sender and receiver use RDMA/RoCE to transfer data.
2) The sender and receiver hosts are connected to SN2700 switch. (NOT back-to-back )
3) If the sender posts one RDMA request a time (ibv_post_send()) and waiting for the ACK(ibv_poll_cq()) before posting the next request. Then, by a low but high enough probability, say ~1% , we can see a long delay before ibv_poll_cq() returns. The length of the delay is related to the timer settings of the queue pair.
We saw ~500ms delays with the following settings:
===========================================
static int modify_qp_to_rts(struct ibv_qp *qp) {
struct ibv_qp_attr attr;
int flags;
int rc;
memset(&attr, 0, sizeof(attr));
attr.qp_state = IBV_QPS_RTS;
attr.timeout = 4; // 65.536 usec
attr.retry_cnt = 6;
attr.rnr_retry = 6;
attr.sq_psn = 0;
attr.max_rd_atomic = 1;
flags = IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |
IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC;
rc = ibv_modify_qp(qp, &attr, flags);
if(rc) fprintf(stderr, "failed to modify QP state to RTS. ERRNO=%d\n", rc);
return rc;
}
===========================================
and 5~8 second delays with the following settings:
===========================================
static int modify_qp_to_rts(struct ibv_qp *qp) {
struct ibv_qp_attr attr;
int flags;
int rc;
memset(&attr, 0, sizeof(attr));
attr.qp_state = IBV_QPS_RTS;
attr.timeout = 20; // 4294967 usec
attr.retry_cnt = 7;
attr.rnr_retry = 7;
attr.sq_psn = 0;
attr.max_rd_atomic = 1;
flags = IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |
IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC;
rc = ibv_modify_qp(qp, &attr, flags);
if(rc) fprintf(stderr, "failed to modify QP state to RTS. ERRNO=%d\n", rc);
return rc;
}
===========================================
Note that we don't see the same problem if
1) we use a direct back-to-back connection (the sender host is connected to the receiver host by one cable without a switch), or
2) the sender post more than one request once, or
3) test with "ib_send_bw" tool even by setting the tx and rx depth to 1.
So having the Mellanox 100Gbps RDMA/ROCE switch in the path seems to trigger the issue. We logged into the switch and tried enabling and disabling the PFC/FC(Priority Flow-Control) feature, by which, a receiver can generate an MAC control frame and send a PAUSE request to a sender when it predicts the potential for buffer overflow; that had no impact of any kind.
Can anyone suggest ways that we might pin down the issue here? Has anyone seen this kind of problem in their own experiments?