Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: Allow for large TX queues with limited (or no) inline data #9940

Closed
wants to merge 3 commits into from

Conversation

sydidelot
Copy link
Member

Using large TX queues with the verbs provider would cause fi_getinfo() to return an empty list of verbs adapters because the call to ibv_create_qp() executed as part of fi_getinfo() would fail with EINVAL.

The failure happens because the code allocates the QP with the maximum amount of inline data supported by the adapter, which is empirically determined by vrb_find_max_inline(). The problem is that using inline data limits the TX queue size that can be allocated.

The fix implemented in this patch is to set max_inline_data = 0 when the QP is created, then update info->tx_attr->inject_size with the value returned by vrb_find_max_inline() after the QP is created. The code in vrb_find_max_inline() guarantees that the calculated inline value is correct as it is also tested with a fake QP creation.

Signed-off-by: Sylvain Didelot [email protected]

@ghost
Copy link

ghost commented Mar 25, 2024

TX WR size is affected by both inline data size and # of SGEs (see mlx5_calc_send_wqe, and grep for inline_data_size_to_quanta in irdma proivder). You can get max inline data size if you limit # of SGEs and you can get max # of SGEs if you limit inline data, this is per WR. So, there is no good way to figure out a valid combination as application can vary this per WR. Only guaranteed is failure if you use max inline data size AND max # of SGEs as reported by the device in a WR.

@sydidelot
Copy link
Member Author

@chien-intel Right, I am aware of this limitation and I agree with you. The problem with using the maximum inline size is that the provider detects no network adapter when large TX queues are requested using export FI_VERBS_TX_SIZE=<x>. Any subsequent call to fi_getinfo() would return ENODATA as a consequence. And because the environment variable FI_VERBS_TX_SIZE is read only once (on the first call to fi_getinfo), the application cannot reduce the value FI_VERBS_TX_SIZE for a next try.

I think fi_getinfo() should just return the max values supported by the hardware and let the application decide the Send Queue size, the inline data size and the # of SGEs. The provider shouldn't assume that the application will always run with the max # of SGEs and the max inline data size as it's not always true.

@shefty
Copy link
Member

shefty commented Mar 25, 2024

I think there's something else off about this entire path. In vrb_get_qp_cap(), why is the code using the MIN(requested tx_size, default tx_size), rather than the requested size? Why wasn't the default applied earlier in the SW flow? It seems like that's the underlying issue. The QP being created to get the max inline is smaller than requested.

For mlx NICs, max inline is proportional to the max sge's. That too is requested using MIN(), versus the requested size.

@ghost
Copy link

ghost commented Mar 26, 2024

@chien-intel Right, I am aware of this limitation and I agree with you. The problem with using the maximum inline size is that the provider detects no network adapter when large TX queues are requested using export FI_VERBS_TX_SIZE=<x>.

What's your test case? The value of FI_VERBS_TX_SIZE and the adapter you are using? I agree with Sean in that something else is going on. Inline data size affects WR size and shouldn't impact # of WR in TX queue (AFAIK), even if WR is variable size. The provider should be able to provision a queue that's max WR size (max_sge and max inline data size) * max_qp_wr.

@sydidelot
Copy link
Member Author

@chien-intel The environment uses some quite old HCAs :

  • lspci
06:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
	Subsystem: Mellanox Technologies MT27600 [Connect-IB]
83:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
	Subsystem: Mellanox Technologies MT27600 [Connect-IB]
  • ibv_devinfo -v
hca_id:	ibp6s0
	transport:			InfiniBand (0)
	fw_ver:				10.16.1200
	node_guid:			f452:1403:006e:54f0
	sys_image_guid:			f452:1403:006e:54f0
	vendor_id:			0x02c9
	vendor_part_id:			4113
	hw_ver:				0x0
	board_id:			MT_1230110019
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				262144
	max_qp_wr:			32768
	device_cap_flags:		0x00301c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					XRC
					MEM_MGT_EXTENSIONS
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				16777216
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		4194304
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_NONE (0)
	max_ee:				0
	max_rdd:			0
	max_mw:				0
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		48
	max_total_mcast_qp_attach:	100663296
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					NO SUPPORT
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	device_cap_flags_ex:		0x301C36
	tso_caps:
		max_tso:			0
	rss_caps:
		max_rwq_indirection_tables:			0
		max_rwq_indirection_table_size:			0
		rx_hash_function:				0x0
		rx_hash_fields_mask:				0x0
	max_wq_type_rq:			0
	packet_pacing_caps:
		qp_rate_limit_min:	0kbps
		qp_rate_limit_max:	0kbps
	tag matching not supported

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	num_comp_vectors:		32
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			3
			port_lid:		21
			port_lmc:		0x00
			link_layer:		InfiniBand
			max_msg_sz:		0x40000000
			port_cap_flags:		0x22516848
			port_cap_flags2:	0x0000
			max_vl_num:		4 (3)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		128
			gid_tbl_len:		8
			subnet_timeout:		18
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		14.0 Gbps (16)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:f452:1403:006e:54f0

hca_id:	ibp131s0
	transport:			InfiniBand (0)
	fw_ver:				10.16.1200
	node_guid:			f452:1403:006e:56e0
	sys_image_guid:			f452:1403:006e:56e0
	vendor_id:			0x02c9
	vendor_part_id:			4113
	hw_ver:				0x0
	board_id:			MT_1230110019
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				262144
	max_qp_wr:			32768
	device_cap_flags:		0x00301c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					XRC
					MEM_MGT_EXTENSIONS
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				16777216
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		4194304
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_NONE (0)
	max_ee:				0
	max_rdd:			0
	max_mw:				0
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		48
	max_total_mcast_qp_attach:	100663296
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					NO SUPPORT
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	device_cap_flags_ex:		0x301C36
	tso_caps:
		max_tso:			0
	rss_caps:
		max_rwq_indirection_tables:			0
		max_rwq_indirection_table_size:			0
		rx_hash_function:				0x0
		rx_hash_fields_mask:				0x0
	max_wq_type_rq:			0
	packet_pacing_caps:
		qp_rate_limit_min:	0kbps
		qp_rate_limit_max:	0kbps
	tag matching not supported

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	num_comp_vectors:		32
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			3
			port_lid:		33
			port_lmc:		0x00
			link_layer:		InfiniBand
			max_msg_sz:		0x40000000
			port_cap_flags:		0x22516848
			port_cap_flags2:	0x0000
			max_vl_num:		4 (3)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		128
			gid_tbl_len:		8
			subnet_timeout:		18
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		10.0 Gbps (8)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:f452:1403:006e:56e0
  • apt list --installed | grep verbs
ibverbs-providers/jammy,now 39.0-1 amd64 [installed,automatic]
ibverbs-utils/jammy,now 39.0-1 amd64 [installed]
libibverbs-dev/jammy,now 39.0-1 amd64 [installed]
libibverbs1/jammy,now 39.0-1 amd64 [installed,automatic]

I can easily reproduce the error with fi_pingping:

FI_LOG_LEVEL=Debug FI_VERBS_TX_SIZE=<X> fi_pingpong -p verbs -e msg

If I export FI_VERBS_TX_SIZE to a large value (say 4096), I see the following failures in the logs:

libfabric:159952:1711473094::verbs:fabric:vrb_get_qp_cap():483<warn> ibv_create_qp: Invalid argument (22)
libfabric:159952:1711473094::verbs:fabric:vrb_get_qp_cap():483<warn> ibv_create_qp: Invalid argument (22)
libfabric:159952:1711473094::verbs:fabric:vrb_get_qp_cap():483<warn> ibv_create_qp: Invalid argument (22)
libfabric:159952:1711473094::verbs:fabric:vrb_get_qp_cap():483<warn> ibv_create_qp: Invalid argument (22)
libfabric:159952:1711473094::verbs:fabric:vrb_get_qp_cap():483<warn> ibv_create_qp: Invalid argument (22)
libfabric:159952:1711473094::verbs:fabric:vrb_get_qp_cap():483<warn> ibv_create_qp: Invalid argument (22)

If I run the same test with the following patch, it completes successfully:

diff --git a/prov/verbs/src/verbs_info.c b/prov/verbs/src/verbs_info.c
index e4def4c6d..a11e6d4aa 100644
--- a/prov/verbs/src/verbs_info.c
+++ b/prov/verbs/src/verbs_info.c
@@ -475,7 +475,7 @@ static inline int vrb_get_qp_cap(struct ibv_context *ctx,
                init_attr.cap.max_recv_sge = MIN(vrb_gl_data.def_rx_iov_limit,
                                                 info->rx_attr->iov_limit);
        }
-       init_attr.cap.max_inline_data = vrb_find_max_inline(pd, ctx, qp_type);
+       init_attr.cap.max_inline_data = 0;
        init_attr.qp_type = qp_type;
 
        qp = ibv_create_qp(pd, &init_attr);

There is a clear relation between the Send Queue size and the inline data size (at least for the environment I am using).

@shefty
Copy link
Member

shefty commented Mar 27, 2024

I'm trying to figure out what vrb_get_qp_cap() is supposed to do. It looks like its sole purpose is to do this:

info->tx_attr->inject_size = init_attr.cap.max_inline_data;

at least that's the only thing I can find being returned from the function.

vrb_find_max_inline() creates and destroys QPs in a loop to find the max inline size, only it uses some minimal set of QP attributes, rather than the requested size. I don't understand why it did that. After vrb_find_max_inline() returns, the code creates yet another QP, which it immediately destroys. I also don't understand why it did that. But it's the output of that call that's being returned.

vrb_get_qp_cap() basically exists to allocate the pd and cq which are passed to vrb_find_max_inline(). But why don't we have:

init_attr.cap.max_send_wr = info->tx_attr->size;

? Why do we want the min() of that and the default size? And why create a QP in vrb_get_qp_cap() at all?

@shefty
Copy link
Member

shefty commented Mar 27, 2024

Digging through the code more, the fi_info being constructed is defining the maximum attributes supported by the device. I don't think it's safe to assume that maximizing all parameters is guaranteed to work, but there's too many variables to really do anything else. These are the changes that I think are needed:

  • The ibv_create_qp() seems to conflict with what vrb_find_max_inline() is intended to do. I would remove it. If that causes an issue, then vrb_find_max_inline() isn't working as expected.

  • I would remove all MIN() calls in vrb_get_qp_cap() and use the info->..attr values. The info is set to the max supported by the device, so I vote for max_inline to be relative to the other max values.

  • I find the environment variables confusing. E.g. FI_VERBS_TX_SIZE is the "default maximum size", which I don't understand the meaning of. Based on the code, these are "maximum default sizes". It's the default size unless the value is reduced based on device max. In any case, I would remove the word "maximum" and just label the variables as a "default size" to avoid confusion over whether these are a default value or a maximum value.

The default values are applied later when the app calls fi_getinfo(). So all we're trying to do here is figure out the device max values in a convoluted way because of a software emulation of verbs, which no one would ever run anyway.

@a-szegel
Copy link
Contributor

bot:aws:retest

@sydidelot
Copy link
Member Author

@shefty Thanks for your input, I appreciate your help on this.

I 100% agree with your first 2 points: to my understanding, the call to ibv_create_qp() in vrb_get_qp_cap() is there only to make sure the provider can allocate an endpoint with the default values. I suggest that we remove the function vrb_get_qp_cap() and move info->tx_attr->inject_size = vrb_find_max_inline() directly in vrb_get_device_attrs().

As for your last point to remove the word "maximum" from the environment variables, I'm not if we should do this.

I find the environment variables confusing. E.g. FI_VERBS_TX_SIZE is the "default maximum size", which I don't understand the meaning of. Based on the code, these are "maximum default sizes". It's the default size unless the value is reduced based on device max.

I think it works the other way around: if the device max is higher than the environment variable, it would be reduced down to the value of this environment variable (performed by vrb_set_default_info()). I.e., if the application requests a TX size higher than FI_VERBS_TX_SIZE, fi_getinfo() would return ENODATA.

@shefty
Copy link
Member

shefty commented Apr 2, 2024

@sydidelot

if the application requests a TX size higher than FI_VERBS_TX_SIZE, fi_getinfo() would return ENODATA.

Can you confirm this? It looks like the user hints are compared against saved fi_info's which store the maximum values. The default from the environment variable then overrides that value. I wonder if this is being handle correctly.

In either case, I still can't tell if the env var is intended as the maximum or the default. The code suggests to me the latter, which may be incorrectly being used as a max.

@sydidelot
Copy link
Member Author

@shefty My understanding of the code was incorrect. You are right and the default from the environment variable defines the queue size if the hints do not provide it.

I wrote this short program to verify what the verbs provider does:

#include <stdlib.h>                                                          
#include <stdio.h>                                                           
#include <string.h>                                                          
                                                                             
#include <rdma/fabric.h>                                                     
                                                                             
int main(int argc, char **argv)                                              
{                                                                            
        struct fi_info *info;                                                
        struct fi_info *hints = fi_allocinfo();                              
        if (hints == NULL)                                                   
               return EXIT_FAILURE;                                          
                                                                             
        hints->ep_attr->type = FI_EP_MSG;                                    
        hints->caps = FI_MSG;                                                
        hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_ALLOCATED;         
        hints->fabric_attr->prov_name = strdup("verbs");                     
                                                                             
        if (argc > 1)                                                        
                hints->tx_attr->size = atoi(argv[1]);                        
                                                                             
        fprintf(stdout, "hints->tx_attr->size=%lu\n", hints->tx_attr->size); 
                                                                             
        int ret = fi_getinfo(FI_VERSION(FI_MAJOR_VERSION, FI_MINOR_VERSION), 
                             NULL, NULL, 0, hints, &info);                   
                                                                             
        if (ret) {                                                           
                fprintf(stderr, "fi_getinfo failed: %d\n", ret);             
                fi_freeinfo(hints);                                          
                return EXIT_FAILURE;                                         
        }                                                                    
                                                                             
        struct fi_info *cur = info;                                          
        while (cur) {                                                        
                fprintf(stderr, "Domain=%s prov=%s tx_size=%lu\n",           
                        cur->domain_attr->name,                              
                        cur->fabric_attr->prov_name,                         
                        cur->tx_attr->size);                                 
                cur = cur->next;                                             
        }                                                                    
                                                                             
        fi_freeinfo(info);                                                   
        fi_freeinfo(hints);                                                  
        return EXIT_SUCCESS;                                                 
}                                                                            

First command line argument defines the value of hints->tx_attr->size. If unspecified, the default is selected by the provider.

  • Scenario 1: no hints provided. The tx_size returned is the default set by the provider, i.e., 384
$ ./test 
hints->tx_attr->size=0
Domain=ibp6s0 prov=verbs tx_size=384
Domain=ibp6s0 prov=verbs tx_size=384
Domain=ibp131s0 prov=verbs tx_size=384
Domain=ibp131s0 prov=verbs tx_size=384

Default tx_size can be controlled with FI_VERBS_TX_SIZE:

$ FI_VERBS_TX_SIZE=1024 ./test 
hints->tx_attr->size=0
Domain=ibp6s0 prov=verbs tx_size=1024
Domain=ibp6s0 prov=verbs tx_size=1024
Domain=ibp131s0 prov=verbs tx_size=1024
Domain=ibp131s0 prov=verbs tx_size=1024
  • Scenario 2: if hints are provided, the tx_size returned by fi_getinfo() is the same as the hints (regardless the value of FI_VERBS_TX_SIZE):
$ ./test 1000 
hints->tx_attr->size=1000
Domain=ibp6s0 prov=verbs tx_size=1000
Domain=ibp6s0 prov=verbs tx_size=1000
Domain=ibp131s0 prov=verbs tx_size=1000
Domain=ibp131s0 prov=verbs tx_size=1000

Conclusion: it makes sense to remove the word "maximum" from environment variables since it defines the "default" size if no hints are provided 👍

@shefty
Copy link
Member

shefty commented Apr 3, 2024

@sydidelot - Thanks for verifying this!

@sydidelot sydidelot force-pushed the inline_data branch 2 times, most recently from e6f34f8 to 76b468a Compare April 4, 2024 20:10
@sydidelot
Copy link
Member Author

@shefty I have updated the PR with the changes you suggested in your previous comment. Thanks!

@@ -544,6 +548,10 @@ int vrb_find_max_inline(struct ibv_pd *pd, struct ibv_context *context,
ibv_destroy_cq(cq);
}

if (pd) {
ibv_dealloc_pd(pd);
}
Copy link
Member

@shefty shefty Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you're following the existing code, but the existing code is goofy. Assert that create succeeded, use the values, then check if create succeeded failed prior to calling destroy?

Can you replace the assert(pd) / assert(cq) above with checks, and just return on a failure? I'm surprised that coverity didn't already complain about the code structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I have replaced the assertion failures with checks as you suggested.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are not setting init_attr.cap.max_inline_data? If HW is worth anything it would enforce that when an inject call comes in using inline data part of the WR that's not set when QP was created.
Try this experiment, set # SGE to 1 and do a large inject call (384 bytes) and see what happens. If it passes, I will be quiet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is setting max_inline_data. It does it repeatedly to find the actual maximum that's supported, which is captured as the ma inject size. I'm not sure what change you're referring to.

Copy link

@ghost ghost Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And even worse, this can lead to SQ over-run as each inline WR takes up more room on the queue than indicated in init_attr.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change does not match commit message. I still don't see any issues with vrb_get_qp_cap. The whole thing smells.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there shouldn't be a difference between the QP create in vrb_get_qp_cap at getinfo time vs QP created by application at runtime. If application is indeed using inject_size for max_inline_data (which should be the case). So why would this change to get by getinfo error but not fail when application creates its QP, again using the same inject size/max_inline_data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chien-intel Please find below my replies to your concerns/questions:

And even worse, this can lead to SQ over-run as each inline WR takes up more room on the queue than indicated in init_attr.

I don't quite understand your concern here. The QP is created using the hints provided by the caller or based on the environment variables (FI_VERBS_TX_SIZE, FI_VERBS_TX_IOV_LIMIT, FI_VERBS_INLINE_SIZE, etc...). See vrb_set_default_info() for more info.
There is absolutely no intention of overrunning the QP resources set at creation time.

I still don't see any issues with vrb_get_qp_cap. The whole thing smells.

I think the real question here is: what's the purpose of vrb_get_qp_cap() and why should we keep it? To my understanding, it's a function that allows to validate that the QP can be created with default attributes and max_inline_data. IMHO, testing default attributes is meaningless as the application will likely request different attributes than the default. Finally, the value for max_inline_data is known to be working as vrb_find_max_inline() repeatedly creates QPs to discover what's the actual max value supported.

there shouldn't be a difference between the QP create in vrb_get_qp_cap at getinfo time vs QP created by application at runtime.

There is a difference: an application may want to create a QP at runtime with a lower inline size than the actual maximum in order to create larger TX queues. This is the whole purpose of this PR.

If application is indeed using inject_size for max_inline_data (which should be the case)

I don't get your point. Do you mean that applications are expected to use inject_size = max_inline_data? If so, this assumption is incorrect.

Try this experiment, set # SGE to 1 and do a large inject call (384 bytes) and see what happens. If it passes, I will be quiet.

  • Server
FI_VERBS_TX_IOV_LIMIT=1 FI_VERBS_RX_IOV_LIMIT=1 fi_rdm_pingpong -p verbs  -j 384 -S 384
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
384     10k     7.3m        0.06s    118.60       3.24       0.31
  • Client
FI_VERBS_TX_IOV_LIMIT=1 FI_VERBS_RX_IOV_LIMIT=1 fi_rdm_pingpong -p verbs 10.1.1.3 -j 384 -S 384   
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
384     10k     7.3m        0.06s    118.60       3.24       0.31

It works. But to be honest, I don't see why this wouldn't work.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works. But to be honest, I don't see why this wouldn't work.
Yeah, it will always work, posting 1 WR at a time. :-) Posting a queue-depth WRs is a different story.

Here is a realistic hypothetical example to demonstrate SQ overrun.
One key piece of information that I think you are not getting is inline data and SGEs overlap in a WR. This is why when calculating amount of memory for a SQ, both sizes are used in the calculation.

Let's assume a device that's capable of max 108 bytes of inline data and max 13 SGEs.
Assume 20 bytes header for a SQ operation, this would come out to max 128 bytes for a WR (20 header + 108 inline or 20 header + 13 SGE * 8).

Now take my test case, using your test parameter of 4096 queue depth, 0 inline data and 1 SGE. WR size may round up to 32 or 64 bytes., we can assume 64 bytes. Memory allocated to SQ comes out to 64 * 4096 = 262,144 bytes or 256k. Now, posting queue-depth WR with max inline data size would come out to 4096 * 128 byte WR = 524,288 bytes. This is the SQ over-run.

fi_getinfo should return valid numbers, period. Something that the device is capable of supporting which vrb_get_qp_cap is trying to validate. If application is not interested in sending inline data, perhaps code needs to be modified to use the lowered inline data size and subsequently set inject_size according.

By passing vrb_get_qp_cap instead of root-causing and fixing the code is wrong and subsequently setting inject_size to max inline data size is allowing application to do the wrong thing and making SQ over-run possible.

It's better to return from vrb_find_max_inline() if the CQ cannot
be created.

Signed-off-by: Sylvain Didelot <[email protected]>
Using large TX queues with the verbs provider would cause fi_getinfo()
to return an empty list of verbs adapters because the call to
ibv_create_qp() executed as part of fi_getinfo() would fail with EINVAL.

The failure happens because the code allocates the QP with the maximum
amount of inline data supported by the adapter, which is empirically
determined by vrb_find_max_inline(). The problem is that using inline
data limits the TX queue size that can be allocated.

The patch removes vrb_get_qp_cap(), whose the sole purpose is to set
the maximum inline data size returned by vrb_find_max_inline(). This
operation can be done in vrb_get_device_attrs() directly.

Signed-off-by: Sylvain Didelot <[email protected]>
These environment variables define the default values when no hints
are provided. Remove the word "maximum" to avoid confusion over
whether these are a default value or a maximum value.

Signed-off-by: Sylvain Didelot <[email protected]>
Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - changes look good from my view

@shefty
Copy link
Member

shefty commented Apr 23, 2024

bot:aws:retest

@shefty
Copy link
Member

shefty commented Apr 23, 2024

@sydidelot - AWS doesn't test verbs AFAIK, so the failure should be unrelated.

@j-xiong
Copy link
Contributor

j-xiong commented Oct 18, 2024

@sydidelot Do you still want this in? If so, please rebase and resolve the conflicts.

@sydidelot
Copy link
Member Author

I am closing the PR.

@sydidelot sydidelot closed this Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants