-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(iroh, iroh-relay)!: Optimise the relay datagram path through the MagicSock #3062
Conversation
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3062/docs/iroh/ Last updated: 2025-01-03T16:24:25Z |
@matheus23 this is still draft but you might already have opinions on the channel created ( |
The RelayActor needs to process all sent datagrams and pass them onto the correct ActiveRelayActor. But it also needs to process a few other inbox messages. The other inbox messages are higher priority however, since they can heal the connected relays. So this adds a separate datagram send queue, so that the inbox is not flooded by datagrams.
This makes sure to send as many datagrams as quickly as possible to the right ActiveRelayActor but at the same time not block the RelayActor itself. This ensures it can still process other inbox items and instruct the ActiveRelayActors appropriately.
0abb367
to
ef97b5a
Compare
…#3068) ## Description This removes an actor by making it a stream. The main (temporary) downside is that a read error no longer shuts down the Conn WriterTasks. This will not be an issue for that long as the WriterTasks are going away next and then the Client can manage this. The RelayDatagramRecvQueue is grown to 512 datagrams. We used to keep this many frames in the per-relay stream. Though that's potentially an awful lot. For datagrams we can assume that they will at max settle close to 1500 bytes each, so this buffer will end up being 750KiB max. That seems somewhat reasonable, though we could probably double it still. There will effectively no longer be a per-relay buffer - other than inside the relay's TCP stream. ## Breaking Changes <!-- Optional, if there are any breaking changes document them, including how to migrate older code. --> ## Notes & open questions This does some other renaming, e.g. `ConnReader` is now a `ConnFrameStream` which is a bit more coherent. It also moves a bunch of code around in the `conn.rs` file to give it some more stucture. This makes the diff harder to read then it needs to be. `process_incoming_frame` has not changed and the main other change is that the reader task is no longer created and instead is moved into the `Stream` impl of the new `ConnMessageStream`. This is open for review. Though whether or not we want to merge it is another matter. I'll see for how the next few PRs on top of this look to see what the real impact of the WriterTasks not being shut down is. Maybe we'll end up merging several PRs together before merging to main. But logically this is a nice dividing point to review. ## Change checklist - [X] Self-review. - [X] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [X] Tests if relevant. - [X] All breaking changes documented. --------- Co-authored-by: Friedel Ziegelmayer <[email protected]>
/// This waker is used by [`IoPoller`] and the [`RelayActor`] to signal when more | ||
/// datagrams can be sent to the relays. | ||
relay_send_waker: Arc<std::sync::Mutex<Option<Waker>>>, | ||
relay_datagram_recv_queue: Arc<RelayDatagramRecvQueue>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems inconsistent, one neds an Arc
and one doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, one is a Sender
which is Clone
. The other is a queue. I don't think this is a big deal, just an artifact of how the channels/queues are built internally.
/// Creates a new, empty queue with a fixed size bound of 128 items. | ||
fn new() -> Self { | ||
Self { | ||
queue: ConcurrentQueue::bounded(128), | ||
queue: ConcurrentQueue::bounded(512), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this number? also the comment is now outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on the previous per client queues which were this size IIRC.
At 1500 bytes/packet this is 750KiB of buffer. I think ideally the datagram queues would all have a capacity based on the number of datagram bytes queued rather than number of items. I'm using 1500 bytes/packet to estimate, but with GSO etc it could also be much larger as the max packet size is 64KiB, but being that big is probably not common.
But I think doing buffers capacity on buffered bytes should be a separate effort, and it isn't very high priority right now.
break; | ||
}; | ||
match self | ||
.run_connected(client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it makes this code more complex, but I am wondering if it would be better to model the state like this
enum ActiveRelayActor {
Dialing {
// ..
},
Connected {
// ..
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the state in the ActiveRelayActor
itself is used in all 3 states the actor can be in. The ConnectedRelayState
is only shared between sending and connected states. I don't feel like trying to put this state in the actor itself will help much, you'd still have run_sending
implemented in the disconnected state, but this is already alleviated by it needing the ConnectedRelayState
. What might work is create ConnectedRelay
and DisconnectedRelay
and impl the methods on there. They could take a ref to the data in the ActiveRelayActor
probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't take a super close look at relay_actor.rs
yet, given there's another stacked PR open about it right now.
I disagree with the Conn::Derp
-> Conn::Relay
rename and related. I've commented on that.
Otherwise only small stuff, some comments, etc. I had a hard time trying to figure out if this is a true refactor, because of the changes, so I gave up on that and mostly looked at what the code looks like now, which tbh is so much better.
That's just a draft to explore some possible futures, I don't really intend to merge that in here if folks are happy enough with the current code structure (I am).
I'm pretty agnostic to that. Not what I care about here. Can we leave these for now or do you fear it will introduce too much churn?
What do you mean "true refactor"? For iroh this is all internal and not visible. For iroh-relay this involes public API changes. But still is a refactor and not a feature I'd say. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We can leave the derp -> relay changes.
What do you mean "true refactor"?
Ah right. I categorize changes into "refactor"/"not refactor" somewhat strictly in the sense that everything that doesn't change program behavior observably, but changes code structure is a ("true") refactor.
Something that's supposed to be a refactor, but actually changes behavior I'd say is not a "true refactor".
Yeah ok. Probably not a true refactor in that case. Another question is whether this should really be classified as a fix? It should close #3008, #2971 and #2951? Oh well, I'll rename this fix :) |
Description
This refactors how datagrams flow from the MagicSock (AsyncUdpSocket) to relay server and back. It also vastly simplifies the actors involved in communicating with a relay server.
RelayActor
managed all connections to relay servers.ActiveRelayActor
for each relay server needed.ActiveRelayActor
will exit when unused.ActiveRelayActor
uses a relayClient
.Client
is now aStream
andSink
directly connected to theTcpStream
connected to the relay server. This eliminates several actors previously used here in theClient
andConn
.ActiveRelayActor
will try and maintain a connection with the relay server.AsyncUdpSocket
needs to send datagrams:RelayActor
.RelayActor
ensures the correctActiveRelayActor
is running and forwards datagrams to it.ActiveRelayActor
sends datagrams directly to the relay server.ActiveRelayActor
is connected it reads from the underlyingTcpStream
.RelayActor
and goes straight to theAsyncUpdSocket
interface.Along the way many bugs are fixed. Some of them:
AsyncUdpSocket
behaves better.Poll::Pending
more than necessary #3067.RelayActor
now avoids blocking. This means it can still react to events when the datagrams queues are full and reconnect relay servers etc as needed to unblock.ActiveRelayActor
also avoids blocking. Allowing it to react to connection breakage and the need to reconnect at any time.ActiveRelayActor
now correctly handles connection errors and retries with backoff.ActiveRleayActor
will no longer queue unsent datagrams forever, but flush them every 400ms.RelayActor
from completely blocking.Breaking Changes
iroh-relay
Conn
is no longer public.Client
is completely changed. See module docs.Notes & open questions
Conn
andClient
don't need to be two separate things now? Though Client is convenient as it only implements one Sink interface, while Conn is also a Frame sink. This means on Conn you often have to use very annoying syntax when calling things like .flush() or .close() etc.ActiveRelayActor
can be moved back into the relayClient
, though that would probably require some gymnastics. The current structure ofActiveRelayActor
is fairly reasonable and handles things correctly. Though it does have a lot of client knowledge baked in. Being able to reason about the client as a stream + sink is what enabled me to write the goodActiveRelayActor
though, so I'm fairly happy that this code makes sense as it is.If all goes well this should:
Closes #3008
Closes #2971
Closes #2951
Change checklist