Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh, iroh-relay)!: Optimise the relay datagram path through the MagicSock #3062

Merged
merged 18 commits into from
Jan 3, 2025

Conversation

flub
Copy link
Contributor

@flub flub commented Dec 18, 2024

Description

This refactors how datagrams flow from the MagicSock (AsyncUdpSocket) to relay server and back. It also vastly simplifies the actors involved in communicating with a relay server.

  • The RelayActor managed all connections to relay servers.
    • It starts a new ActiveRelayActor for each relay server needed.
    • The ActiveRelayActor will exit when unused.
      • Unless it is for the home relay, this one never exits.
    • Each ActiveRelayActor uses a relay Client.
      • The relay Client is now a Stream and Sink directly connected to the TcpStream connected to the relay server. This eliminates several actors previously used here in the Client and Conn.
    • Each ActiveRelayActor will try and maintain a connection with the relay server.
      • If connections fail, exponential backoff is used for reconnections.
  • When AsyncUdpSocket needs to send datagrams:
    • It (now) puts them on a queue to the RelayActor.
    • The RelayActor ensures the correct ActiveRelayActor is running and forwards datagrams to it.
    • The ActiveRelayActor sends datagrams directly to the relay server.
  • The relay receive path is now:
    • Whenever ActiveRelayActor is connected it reads from the underlying TcpStream.
    • Received datagrams are placed on an mpsc channel that now bypasses the RelayActor and goes straight to the AsyncUpdSocket interface.

Along the way many bugs are fixed. Some of them:

  • The relay datagrams send and receive queue now behave more correctly when they are full. So the AsyncUdpSocket behaves better.
  • The RelayActor now avoids blocking. This means it can still react to events when the datagrams queues are full and reconnect relay servers etc as needed to unblock.
  • The ActiveRelayActor also avoids blocking. Allowing it to react to connection breakage and the need to reconnect at any time.
  • The ActiveRelayActor now correctly handles connection errors and retries with backoff.
  • The ActiveRleayActor will no longer queue unsent datagrams forever, but flush them every 400ms.
    • This also stops the send queue into the RelayActor from completely blocking.

Breaking Changes

iroh-relay

  • Conn is no longer public.
  • The Client is completely changed. See module docs.

Notes & open questions

  • Potentially the relay Conn and Client don't need to be two separate things now? Though Client is convenient as it only implements one Sink interface, while Conn is also a Frame sink. This means on Conn you often have to use very annoying syntax when calling things like .flush() or .close() etc.
  • Maybe a few items from the ActiveRelayActor can be moved back into the relay Client, though that would probably require some gymnastics. The current structure of ActiveRelayActor is fairly reasonable and handles things correctly. Though it does have a lot of client knowledge baked in. Being able to reason about the client as a stream + sink is what enabled me to write the good ActiveRelayActor though, so I'm fairly happy that this code makes sense as it is.

If all goes well this should:
Closes #3008
Closes #2971
Closes #2951

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

Copy link

github-actions bot commented Dec 18, 2024

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3062/docs/iroh/

Last updated: 2025-01-03T16:24:25Z

Copy link

github-actions bot commented Dec 18, 2024

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 0dbcaf1

@flub flub requested a review from matheus23 December 18, 2024 17:21
@flub
Copy link
Contributor Author

flub commented Dec 18, 2024

@matheus23 this is still draft but you might already have opinions on the channel created (RelayDatagramSendChannelSender and RelayDatagramSendChannelReceiver), which are -pending review- final already.

flub added 6 commits December 19, 2024 17:52
The RelayActor needs to process all sent datagrams and pass them onto
the correct ActiveRelayActor.  But it also needs to process a few
other inbox messages.

The other inbox messages are higher priority however, since they can
heal the connected relays.  So this adds a separate datagram send
queue, so that the inbox is not flooded by datagrams.
This makes sure to send as many datagrams as quickly as possible to
the right ActiveRelayActor but at the same time not block the
RelayActor itself.  This ensures it can still process other inbox
items and instruct the ActiveRelayActors appropriately.
@flub flub force-pushed the flub/relay-actor-datagram-send-queue branch from 0abb367 to ef97b5a Compare December 19, 2024 16:53
@flub flub changed the base branch from flub/relay-actor-inactive-timeout to main December 20, 2024 10:27
@flub flub marked this pull request as ready for review December 20, 2024 10:44
@dignifiedquire dignifiedquire added this to the v0.31.0 milestone Jan 2, 2025
…#3068)

## Description

This removes an actor by making it a stream.

The main (temporary) downside is that a read error no longer shuts down
the Conn WriterTasks. This will not be an issue for that long as the
WriterTasks are going away next and then the Client can manage this.

The RelayDatagramRecvQueue is grown to 512 datagrams. We used to keep
this many frames in the per-relay stream. Though that's potentially an
awful lot. For datagrams we can assume that they will at max settle
close to 1500 bytes each, so this buffer will end up being 750KiB max.
That seems somewhat reasonable, though we could probably double it
still.

There will effectively no longer be a per-relay buffer - other than
inside the relay's TCP stream.

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

This does some other renaming, e.g. `ConnReader` is now a
`ConnFrameStream` which is a bit more coherent. It also moves a bunch of
code around in the `conn.rs` file to give it some more stucture. This
makes the diff harder to read then it needs to be.
`process_incoming_frame` has not changed and the main other change is
that the reader task is no longer created and instead is moved into the
`Stream` impl of the new `ConnMessageStream`.

This is open for review. Though whether or not we want to merge it is
another matter. I'll see for how the next few PRs on top of this look to
see what the real impact of the WriterTasks not being shut down is.
Maybe we'll end up merging several PRs together before merging to main.
But logically this is a nice dividing point to review.

## Change checklist

- [X] Self-review.
- [X] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [X] Tests if relevant.
- [X] All breaking changes documented.

---------

Co-authored-by: Friedel Ziegelmayer <[email protected]>
@flub flub changed the title refactor(iroh): Add datagram send queue to RelayActor refactor(iroh, iroh-relay): Optimise the relay datagram path through the MagicSock Jan 3, 2025
@flub flub changed the title refactor(iroh, iroh-relay): Optimise the relay datagram path through the MagicSock refactor(iroh, iroh-relay)!: Optimise the relay datagram path through the MagicSock Jan 3, 2025
/// This waker is used by [`IoPoller`] and the [`RelayActor`] to signal when more
/// datagrams can be sent to the relays.
relay_send_waker: Arc<std::sync::Mutex<Option<Waker>>>,
relay_datagram_recv_queue: Arc<RelayDatagramRecvQueue>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems inconsistent, one neds an Arc and one doesn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, one is a Sender which is Clone. The other is a queue. I don't think this is a big deal, just an artifact of how the channels/queues are built internally.

/// Creates a new, empty queue with a fixed size bound of 128 items.
fn new() -> Self {
Self {
queue: ConcurrentQueue::bounded(128),
queue: ConcurrentQueue::bounded(512),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this number? also the comment is now outdated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the previous per client queues which were this size IIRC.

At 1500 bytes/packet this is 750KiB of buffer. I think ideally the datagram queues would all have a capacity based on the number of datagram bytes queued rather than number of items. I'm using 1500 bytes/packet to estimate, but with GSO etc it could also be much larger as the max packet size is 64KiB, but being that big is probably not common.

But I think doing buffers capacity on buffered bytes should be a separate effort, and it isn't very high priority right now.

break;
};
match self
.run_connected(client)
Copy link
Contributor

@dignifiedquire dignifiedquire Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it makes this code more complex, but I am wondering if it would be better to model the state like this

enum ActiveRelayActor {
  Dialing {
    // ..
  },
  Connected {
    // .. 
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the state in the ActiveRelayActor itself is used in all 3 states the actor can be in. The ConnectedRelayState is only shared between sending and connected states. I don't feel like trying to put this state in the actor itself will help much, you'd still have run_sending implemented in the disconnected state, but this is already alleviated by it needing the ConnectedRelayState. What might work is create ConnectedRelay and DisconnectedRelay and impl the methods on there. They could take a ref to the data in the ActiveRelayActor probably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't take a super close look at relay_actor.rs yet, given there's another stacked PR open about it right now.

I disagree with the Conn::Derp -> Conn::Relay rename and related. I've commented on that.

Otherwise only small stuff, some comments, etc. I had a hard time trying to figure out if this is a true refactor, because of the changes, so I gave up on that and mostly looked at what the code looks like now, which tbh is so much better.

iroh-relay/src/client.rs Outdated Show resolved Hide resolved
iroh-relay/src/client.rs Outdated Show resolved Hide resolved
iroh-relay/src/client/conn.rs Show resolved Hide resolved
iroh-relay/src/client/conn.rs Show resolved Hide resolved
iroh-relay/src/client/conn.rs Outdated Show resolved Hide resolved
iroh-relay/src/server.rs Outdated Show resolved Hide resolved
iroh-relay/src/server/metrics.rs Show resolved Hide resolved
iroh/src/magicsock/relay_actor.rs Outdated Show resolved Hide resolved
iroh/src/magicsock/relay_actor.rs Outdated Show resolved Hide resolved
iroh/src/magicsock/relay_actor.rs Outdated Show resolved Hide resolved
@flub
Copy link
Contributor Author

flub commented Jan 3, 2025

Didn't take a super close look at relay_actor.rs yet, given there's another stacked PR open about it right now.

That's just a draft to explore some possible futures, I don't really intend to merge that in here if folks are happy enough with the current code structure (I am).

I disagree with the Conn::Derp -> Conn::Relay rename and related. I've commented on that.

I'm pretty agnostic to that. Not what I care about here. Can we leave these for now or do you fear it will introduce too much churn?

Otherwise only small stuff, some comments, etc. I had a hard time trying to figure out if this is a true refactor, because of the changes, so I gave up on that and mostly looked at what the code looks like now, which tbh is so much better.

What do you mean "true refactor"? For iroh this is all internal and not visible. For iroh-relay this involes public API changes. But still is a refactor and not a feature I'd say.

Copy link
Contributor

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can leave the derp -> relay changes.

What do you mean "true refactor"?

Ah right. I categorize changes into "refactor"/"not refactor" somewhat strictly in the sense that everything that doesn't change program behavior observably, but changes code structure is a ("true") refactor.
Something that's supposed to be a refactor, but actually changes behavior I'd say is not a "true refactor".

@flub
Copy link
Contributor Author

flub commented Jan 3, 2025

LGTM. We can leave the derp -> relay changes.

What do you mean "true refactor"?

Ah right. I categorize changes into "refactor"/"not refactor" somewhat strictly in the sense that everything that doesn't change program behavior observably, but changes code structure is a ("true") refactor. Something that's supposed to be a refactor, but actually changes behavior I'd say is not a "true refactor".

Yeah ok. Probably not a true refactor in that case. Another question is whether this should really be classified as a fix? It should close #3008, #2971 and #2951? Oh well, I'll rename this fix :)

@flub flub changed the title refactor(iroh, iroh-relay)!: Optimise the relay datagram path through the MagicSock fix(iroh, iroh-relay)!: Optimise the relay datagram path through the MagicSock Jan 3, 2025
@flub flub enabled auto-merge January 3, 2025 16:27
@flub flub added this pull request to the merge queue Jan 3, 2025
Merged via the queue into main with commit 7ad531e Jan 3, 2025
25 of 26 checks passed
@flub flub deleted the flub/relay-actor-datagram-send-queue branch January 3, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
3 participants