Observing long delay with RoCE using ConnectX-4 and SN2700

We’re experimenting with our RDMA software on the Lib IB Verbs library, and have a puzzling problem.

Our 19-node cluster is unusual: we have Mellanox 100Gbps switches both for IB and ROCE, two switches side by side. This lets us run the identical software on RDMA/IB and then again on RDMA/ROCE.

What we are seeing is that both options have identical behavior with the test suite (ib_send_bw, for example) but with our software, we get 97Gbpx from RDMA/IB but only about 25Gbps from RDMA/ROCE. We’ve pinned this down to long send pauses: the data transfer rate is quite high, then suddenly the sender and receiver see a half second of delay, then the high-rate transfers resume. We’ve observed this identical problem with two different software systems we’re developing at Cornell, both of which work well on RDMA and have been used pretty widely on RDMA/IB setups.

The pause problem is easy to reproduce under the following conditions:

1) The sender and receiver use RDMA/RoCE to transfer data.

2) The sender and receiver hosts are connected to SN2700 switch. (NOT back-to-back )

3) If the sender posts one RDMA request a time (ibv_post_send()) and waiting for the ACK(ibv_poll_cq()) before posting the next request. Then, by a low but high enough probability, say ~1% , we can see a long delay before ibv_poll_cq() returns. The length of the delay is related to the timer settings of the queue pair.

We saw ~500ms delays with the following settings:

===========================================

static int modify_qp_to_rts(struct ibv_qp *qp) {

struct ibv_qp_attr attr;

int flags;

int rc;

memset(&attr, 0, sizeof(attr));

attr.qp_state = IBV_QPS_RTS;

attr.timeout = 4; // 65.536 usec

attr.retry_cnt = 6;

attr.rnr_retry = 6;

attr.sq_psn = 0;

attr.max_rd_atomic = 1;

flags = IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |

IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC;

rc = ibv_modify_qp(qp, &attr, flags);

if(rc) fprintf(stderr, "failed to modify QP state to RTS. ERRNO=%d\n", rc);

return rc;

}

===========================================

and 5~8 second delays with the following settings:

===========================================

static int modify_qp_to_rts(struct ibv_qp *qp) {

struct ibv_qp_attr attr;

int flags;

int rc;

memset(&attr, 0, sizeof(attr));

attr.qp_state = IBV_QPS_RTS;

attr.timeout = 20; // 4294967 usec

attr.retry_cnt = 7;

attr.rnr_retry = 7;

attr.sq_psn = 0;

attr.max_rd_atomic = 1;

flags = IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |

IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC;

rc = ibv_modify_qp(qp, &attr, flags);

if(rc) fprintf(stderr, "failed to modify QP state to RTS. ERRNO=%d\n", rc);

return rc;

}

===========================================

Note that we don't see the same problem if

1) we use a direct back-to-back connection (the sender host is connected to the receiver host by one cable without a switch), or

2) the sender post more than one request once, or

3) test with "ib_send_bw" tool even by setting the tx and rx depth to 1.

So having the Mellanox 100Gbps RDMA/ROCE switch in the path seems to trigger the issue. We logged into the switch and tried enabling and disabling the PFC/FC(Priority Flow-Control) feature, by which, a receiver can generate an MAC control frame and send a PAUSE request to a sender when it predicts the potential for buffer overflow; that had no impact of any kind.

Can anyone suggest ways that we might pin down the issue here? Has anyone seen this kind of problem in their own experiments?

Observing long delay with RoCE using ConnectX-4 and SN2700

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

SAHARA FLASH LIVE IN WERAGOLLA 2018-04-20

HP P2000 Storage Error Controller A Unknown Issue Resolution Request

Vocational Training Instructor (Carpenter) at States of Jersey

'My best friend looked possessed, then he stabbed me', teenager tells court

[Visual Studio] 開発ツール対応 OS 一覧

I want to a weather coin buyer genuine buyer r welcome

Karimnagar District Tahsildars Phone Numbers-Mobile Numbers Telangana-State

(get) Tej Dosa Letter 81 - How To Make An Extra $200-$500/Week (In 2025)

JACOB FORREST OGDEN Arrested by Clackamas County Sheriff's Office on Dec 30,...

Fired URP workers collect service $$$

Adolescence A Stage of Growth and Change Class 7 Extra Questions and Answers...

Bureau of Internal Revenue: Regional Offices (Directory)

The 10 Tennessee Cities With The Largest Black Population For 2021

Named and shamed: a round up of cases heard by Essex magistrates

FortiLink mode supported over a layer-3 network

ページングファイルサイズの推奨設定とその背景について

ZARIA CUMMINGS

Serial child killer David Threinen’s reign of terror

Philly Mobster Ronnie Turchi Took Last Ride In October ’99, Turned Up Trunk...