-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/verbs: Allow for large TX queues with limited (or no) inline data #9940
Conversation
TX WR size is affected by both inline data size and # of SGEs (see mlx5_calc_send_wqe, and grep for inline_data_size_to_quanta in irdma proivder). You can get max inline data size if you limit # of SGEs and you can get max # of SGEs if you limit inline data, this is per WR. So, there is no good way to figure out a valid combination as application can vary this per WR. Only guaranteed is failure if you use max inline data size AND max # of SGEs as reported by the device in a WR. |
@chien-intel Right, I am aware of this limitation and I agree with you. The problem with using the maximum inline size is that the provider detects no network adapter when large TX queues are requested using I think |
I think there's something else off about this entire path. In vrb_get_qp_cap(), why is the code using the MIN(requested tx_size, default tx_size), rather than the requested size? Why wasn't the default applied earlier in the SW flow? It seems like that's the underlying issue. The QP being created to get the max inline is smaller than requested. For mlx NICs, max inline is proportional to the max sge's. That too is requested using MIN(), versus the requested size. |
What's your test case? The value of FI_VERBS_TX_SIZE and the adapter you are using? I agree with Sean in that something else is going on. Inline data size affects WR size and shouldn't impact # of WR in TX queue (AFAIK), even if WR is variable size. The provider should be able to provision a queue that's max WR size (max_sge and max inline data size) * max_qp_wr. |
@chien-intel The environment uses some quite old HCAs :
I can easily reproduce the error with
If I export
If I run the same test with the following patch, it completes successfully:
There is a clear relation between the Send Queue size and the inline data size (at least for the environment I am using). |
I'm trying to figure out what vrb_get_qp_cap() is supposed to do. It looks like its sole purpose is to do this:
at least that's the only thing I can find being returned from the function. vrb_find_max_inline() creates and destroys QPs in a loop to find the max inline size, only it uses some minimal set of QP attributes, rather than the requested size. I don't understand why it did that. After vrb_find_max_inline() returns, the code creates yet another QP, which it immediately destroys. I also don't understand why it did that. But it's the output of that call that's being returned. vrb_get_qp_cap() basically exists to allocate the pd and cq which are passed to vrb_find_max_inline(). But why don't we have:
? Why do we want the min() of that and the default size? And why create a QP in vrb_get_qp_cap() at all? |
Digging through the code more, the fi_info being constructed is defining the maximum attributes supported by the device. I don't think it's safe to assume that maximizing all parameters is guaranteed to work, but there's too many variables to really do anything else. These are the changes that I think are needed:
The default values are applied later when the app calls fi_getinfo(). So all we're trying to do here is figure out the device max values in a convoluted way because of a software emulation of verbs, which no one would ever run anyway. |
bot:aws:retest |
@shefty Thanks for your input, I appreciate your help on this. I 100% agree with your first 2 points: to my understanding, the call to As for your last point to remove the word "maximum" from the environment variables, I'm not if we should do this.
I think it works the other way around: if the device max is higher than the environment variable, it would be reduced down to the value of this environment variable (performed by |
Can you confirm this? It looks like the user hints are compared against saved fi_info's which store the maximum values. The default from the environment variable then overrides that value. I wonder if this is being handle correctly. In either case, I still can't tell if the env var is intended as the maximum or the default. The code suggests to me the latter, which may be incorrectly being used as a max. |
@shefty My understanding of the code was incorrect. You are right and the default from the environment variable defines the queue size if the hints do not provide it. I wrote this short program to verify what the verbs provider does:
First command line argument defines the value of
Default tx_size can be controlled with
Conclusion: it makes sense to remove the word "maximum" from environment variables since it defines the "default" size if no hints are provided 👍 |
@sydidelot - Thanks for verifying this! |
e6f34f8
to
76b468a
Compare
@shefty I have updated the PR with the changes you suggested in your previous comment. Thanks! |
prov/verbs/src/verbs_init.c
Outdated
@@ -544,6 +548,10 @@ int vrb_find_max_inline(struct ibv_pd *pd, struct ibv_context *context, | |||
ibv_destroy_cq(cq); | |||
} | |||
|
|||
if (pd) { | |||
ibv_dealloc_pd(pd); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you're following the existing code, but the existing code is goofy. Assert that create succeeded, use the values, then check if create succeeded failed prior to calling destroy?
Can you replace the assert(pd) / assert(cq) above with checks, and just return on a failure? I'm surprised that coverity didn't already complain about the code structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I have replaced the assertion failures with checks as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are not setting init_attr.cap.max_inline_data? If HW is worth anything it would enforce that when an inject call comes in using inline data part of the WR that's not set when QP was created.
Try this experiment, set # SGE to 1 and do a large inject call (384 bytes) and see what happens. If it passes, I will be quiet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is setting max_inline_data. It does it repeatedly to find the actual maximum that's supported, which is captured as the ma inject size. I'm not sure what change you're referring to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And even worse, this can lead to SQ over-run as each inline WR takes up more room on the queue than indicated in init_attr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change does not match commit message. I still don't see any issues with vrb_get_qp_cap. The whole thing smells.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there shouldn't be a difference between the QP create in vrb_get_qp_cap at getinfo time vs QP created by application at runtime. If application is indeed using inject_size for max_inline_data (which should be the case). So why would this change to get by getinfo error but not fail when application creates its QP, again using the same inject size/max_inline_data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chien-intel Please find below my replies to your concerns/questions:
And even worse, this can lead to SQ over-run as each inline WR takes up more room on the queue than indicated in init_attr.
I don't quite understand your concern here. The QP is created using the hints provided by the caller or based on the environment variables (FI_VERBS_TX_SIZE, FI_VERBS_TX_IOV_LIMIT, FI_VERBS_INLINE_SIZE, etc...). See vrb_set_default_info()
for more info.
There is absolutely no intention of overrunning the QP resources set at creation time.
I still don't see any issues with vrb_get_qp_cap. The whole thing smells.
I think the real question here is: what's the purpose of vrb_get_qp_cap()
and why should we keep it? To my understanding, it's a function that allows to validate that the QP can be created with default attributes and max_inline_data. IMHO, testing default attributes is meaningless as the application will likely request different attributes than the default. Finally, the value for max_inline_data is known to be working as vrb_find_max_inline()
repeatedly creates QPs to discover what's the actual max value supported.
there shouldn't be a difference between the QP create in vrb_get_qp_cap at getinfo time vs QP created by application at runtime.
There is a difference: an application may want to create a QP at runtime with a lower inline size than the actual maximum in order to create larger TX queues. This is the whole purpose of this PR.
If application is indeed using inject_size for max_inline_data (which should be the case)
I don't get your point. Do you mean that applications are expected to use inject_size = max_inline_data
? If so, this assumption is incorrect.
Try this experiment, set # SGE to 1 and do a large inject call (384 bytes) and see what happens. If it passes, I will be quiet.
- Server
FI_VERBS_TX_IOV_LIMIT=1 FI_VERBS_RX_IOV_LIMIT=1 fi_rdm_pingpong -p verbs -j 384 -S 384
bytes iters total time MB/sec usec/xfer Mxfers/sec
384 10k 7.3m 0.06s 118.60 3.24 0.31
- Client
FI_VERBS_TX_IOV_LIMIT=1 FI_VERBS_RX_IOV_LIMIT=1 fi_rdm_pingpong -p verbs 10.1.1.3 -j 384 -S 384
bytes iters total time MB/sec usec/xfer Mxfers/sec
384 10k 7.3m 0.06s 118.60 3.24 0.31
It works. But to be honest, I don't see why this wouldn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works. But to be honest, I don't see why this wouldn't work.
Yeah, it will always work, posting 1 WR at a time. :-) Posting a queue-depth WRs is a different story.
Here is a realistic hypothetical example to demonstrate SQ overrun.
One key piece of information that I think you are not getting is inline data and SGEs overlap in a WR. This is why when calculating amount of memory for a SQ, both sizes are used in the calculation.
Let's assume a device that's capable of max 108 bytes of inline data and max 13 SGEs.
Assume 20 bytes header for a SQ operation, this would come out to max 128 bytes for a WR (20 header + 108 inline or 20 header + 13 SGE * 8).
Now take my test case, using your test parameter of 4096 queue depth, 0 inline data and 1 SGE. WR size may round up to 32 or 64 bytes., we can assume 64 bytes. Memory allocated to SQ comes out to 64 * 4096 = 262,144 bytes or 256k. Now, posting queue-depth WR with max inline data size would come out to 4096 * 128 byte WR = 524,288 bytes. This is the SQ over-run.
fi_getinfo should return valid numbers, period. Something that the device is capable of supporting which vrb_get_qp_cap is trying to validate. If application is not interested in sending inline data, perhaps code needs to be modified to use the lowered inline data size and subsequently set inject_size according.
By passing vrb_get_qp_cap instead of root-causing and fixing the code is wrong and subsequently setting inject_size to max inline data size is allowing application to do the wrong thing and making SQ over-run possible.
It's better to return from vrb_find_max_inline() if the CQ cannot be created. Signed-off-by: Sylvain Didelot <[email protected]>
Using large TX queues with the verbs provider would cause fi_getinfo() to return an empty list of verbs adapters because the call to ibv_create_qp() executed as part of fi_getinfo() would fail with EINVAL. The failure happens because the code allocates the QP with the maximum amount of inline data supported by the adapter, which is empirically determined by vrb_find_max_inline(). The problem is that using inline data limits the TX queue size that can be allocated. The patch removes vrb_get_qp_cap(), whose the sole purpose is to set the maximum inline data size returned by vrb_find_max_inline(). This operation can be done in vrb_get_device_attrs() directly. Signed-off-by: Sylvain Didelot <[email protected]>
These environment variables define the default values when no hints are provided. Remove the word "maximum" to avoid confusion over whether these are a default value or a maximum value. Signed-off-by: Sylvain Didelot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - changes look good from my view
bot:aws:retest |
@sydidelot - AWS doesn't test verbs AFAIK, so the failure should be unrelated. |
@sydidelot Do you still want this in? If so, please rebase and resolve the conflicts. |
I am closing the PR. |
Using large TX queues with the verbs provider would cause fi_getinfo() to return an empty list of verbs adapters because the call to ibv_create_qp() executed as part of fi_getinfo() would fail with EINVAL.
The failure happens because the code allocates the QP with the maximum amount of inline data supported by the adapter, which is empirically determined by vrb_find_max_inline(). The problem is that using inline data limits the TX queue size that can be allocated.
The fix implemented in this patch is to set max_inline_data = 0 when the QP is created, then update info->tx_attr->inject_size with the value returned by vrb_find_max_inline() after the QP is created. The code in vrb_find_max_inline() guarantees that the calculated inline value is correct as it is also tested with a fake QP creation.
Signed-off-by: Sylvain Didelot [email protected]