Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(bitswap): wantlist overflow handling #629
fix(bitswap): wantlist overflow handling #629
Changes from 5 commits
a06156c
9c35f18
eaf6667
25e948e
3f54359
48a42c8
7c11036
994bd46
50421b4
ffd64ac
ece31b4
8664e38
0d1f8e1
b016327
2faf741
9c2759d
d24abc7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps premature optimization (depends on how large people set their limits to and how frequently they're hit). Seems like we could spend a lot of time sorting here.
For the overflow list this is probably fine (shouldn't be that big anyway and it's related to message size), but if nodes spend significant time near the limit they'll be doing:
For basically every message that comes in.
Can the PeerRequestQueue help us out here since it's already storing a prioritized queue of what needs to be done? It might not due to how the locking/concurrency works but could save a bunch of pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PeerRequestQueue cannot really help out, since that maintains a heap so getting an ordered list would require iterating the heap. I do not see a much less expensive alternative, other than just clearing the peer's want list at some point, like if overflow is happening too frequently.
Maybe if overflow happens 5 times in a row for a particular peer, then clear that peer's message queue? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get what the idea is here and if this is necessary / if we can make this much cheaper
In either case it seems like we could add some extra data to the in-memory structs here rather than going to the blockstore to see if we have the data (and being at the mercy of whatever caching, bloom filters, etc. are used there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that there is a DONT_HAVE message queued for the peer, but it has not been sent yet and is blocking new messages from being queued for the peer. So, cancel the unsent DONT_HAVE and try to enqueue something possibly more important. Either a delayed HAVE message will replace the pending DONT_HAVE, or the peer can ask again later. This should keep messages moving, even if there is some backup sending DONT_HAVE messages to peers.
This also handles the case where a DONT_HAVE message has been sent, but is not removed from the queue. Once a message is sent, the want is removed from the message queue and peer ledger only when blocks have been sent or when block presence has been sent. If a DONT_HAVE was sent the want remains on the queue and peer ledger as a place-holder should a block arrive later, and this is stopping new wants from being accepted. This is what the 5th bullet in #527 is referring to by:
So, in short, it handles both cases.
Yes, the wants for which block is found can be recorded in the peer ledger so that these can be ignored in overflow handling. However, that would need to be done in every call to
engine.MessageReceived
, and seems less preferable than doing something more expensive only during the exceptional case.The task queue does already have this info, but this would require locking the tasqueue and the peer tracker for each overflow want CID to look at. Or, this would require a new taskqueue API to get a list of wants with
HaveBlock
set totrue
for a given peer. This last option might be less expensive than looking at the blockstore, but I was not comfortable with that amount of new plumbing for handling this bitswap exceptional case. WDYT?Check warning on line 832 in bitswap/server/internal/decision/engine.go
Codecov / codecov/patch
bitswap/server/internal/decision/engine.go#L831-L832
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above is this about blocks that aren't present or subscriptions?
Note: the reason I'm pushing on the difference is that from my perspective subscriptions are much more expensive by virtue of occupying memory for an indefinite amount of time rather than a transient "while I'm sending out a response". Not sure if that's enough to justify different lists, but it's how I'm thinking in my review here (but lmk if you disagree or think I'm missing the point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the engine perspective, I do not think there is any need for distinction between subscription and request-response since that I think only determines how long a peer is in the task queue/ledger.
Overall, it probably does make more sense to only do this overflow handling for subscriptions. I was thinking/hoping this would handle itself by subscriptions being the ones primarily affected in the first place and needing to do overflow handling. I think some real-world use is necessary to determine this. I will add logging that can be used to determine when overflow handling is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice I expect this never to happen given that IIUC the boxo client (which is the most widely used one) just decreases the priority over time
boxo/bitswap/client/internal/messagequeue/messagequeue.go
Line 305 in 733fa55
Given that this is the case it'd be good to:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it probably will be unlikely to happen given that the client decreases priority over time. The thinking here was looking more toward a future where priority may be set by path distance, where items closer to a DAG root have a higher priority. In that case it seems more likely that as new wants are requested that some do have a high priority because they are root or closer to root items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be fine, however do we want to do this here given that there could be duplicates once we actually look at the wantlist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to truncate the list, but sorting can be avoided.
Truncation makes sense:
The incoming wants are already unique, so even if there are no existing wants, or all possible wants have duplicates, there is still no way any more than the limit can be used. If all those not added to the message queue are included in the overflow then they will end up getting dropped anyway because there will not be enough existing wants available to replace. Dropping them early does two things:
handleOverflow
.Truncation without sorting:
This will potentially lose higher priority wants, but avoids sorting an incoming wantlist of unknown size. In the usual case sorting is not needed anyway, so not sorting will avoid a performance hit.
True, but then it is necessary to examine all the wants (a possibly excessive amount) to see if they can be added directly to the message queue, or need to be handled as overflow.