AutoNAT: multiple opened connection #2515

hamamo · 2022-02-14T11:02:30Z

It seems that AutoNAT opens peer connections as part of its probing loop but does not close them again, so after some time connected peers have dozens of open connections between them. I only have a very rudimentary understanding of the implications but it doesn't look quite right :-)

elenaf9 · 2022-02-14T12:38:42Z

From what I understand, connections don't need to be explicitly closed.
Instead, a connection is generally only kept alive while the handler of at least one network behaviour reports KeepAlive::True (ref). The AutoNAT protocol internally wraps the request-response protocol: a probe is an outbound request, and the dial-response from the server is the response. After the server sent the response, the handler in the request-response protocol should not be keeping the connection alive anymore (only for 10s, as this is its default connection timeout), hence it should close. But in case that you are using any other network behaviour protocols in combination with AutoNAT, it could be that they keep the connection alive?
@hamamo Could you share a bit more information about the behaviours that you are using?
Or maybe I am missing something here, so please correct if that is the case.

hamamo · 2022-02-18T08:30:15Z

@elenaf9 I'm using ping to keep connections alive, that may be the cause of the observed behavior. Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.
This may be something I change when I have a better sync mechanism for historical data, and nodes can catch up when they reconnect, but I'd rather have autonat and ping behave well together :-)

rkuhn · 2022-02-18T08:59:45Z

same here

elenaf9 · 2022-02-18T10:33:20Z

Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.

I am not too familiar with the details of gossipsub, but doesn't the gossipsub protocol itself already takes care that you stay connected with the peer in a mesh? What's the benefit of also using the ping for keep-alive?

rust-libp2p/protocols/gossipsub/src/handler.rs

Lines 271 to 275 in eeb3504

    
           // If we have joined the mesh, keep the connection alive. 
        
           GossipsubHandlerIn::JoinedMesh => { 
        
               self.in_mesh = true; 
        
               self.keep_alive = KeepAlive::Yes; 
        
           }

But independent of that, I still agree that autonat and ping should behave well together. What I am not so sure about is whether this is the responsibility of ping, or of autonat.

The whole logic around keep-alive follows more the idea of "every behaviour protocol decides for themself if they need the substream or not". Afaik none of the protocols explicitly closes a connection? Instead, even after the protocol that initiated the connection doesn't need that connection anymore, it is kept open "in case that any other protocol needs it". And imho network behaviours should not rely on other logic to explicitly close connections again. It could also be the case that the user frequently calls Swarm::dial for whatever reason, in which case we would run into the same problem.

I think ping is really the only protocol that just blindly keeps every substream alive. And the description for its keep-alive actually states that it keeps only the connection alive, not every substream. So what I would propose instead is to change Ping so that it only keeps at least one connection to a peer alive, but not every single one.

Overall, I am myself a bit undecided on this. If you still think that AutoNAT should explicitly clean up the connection that are opened in a dial-attempt I am fine with that as well, as I definitely see the reason for it. But my above point still stands.
What do you think?

mxinden · 2022-02-18T18:38:15Z

I'm using ping to keep connections alive, that may be the cause of the observed behavior.

Do I understand correctly that you are setting PingConfig::with_keep_alive(true) @hamamo?

rust-libp2p/protocols/ping/src/handler.rs

Lines 52 to 54 in 2ff9ace

    
               /// Whether the connection should generally be kept alive unless 
        
               /// `max_failures` occur. 
        
               keep_alive: bool,

Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.

Note that connections are kept alive, even without PingConfig::keep_alive(true), as long as they are being used. E.g., as @elenaf9 mentioned above, in the case of gossipsub that means that a connection is kept alive as long as the peer is part of the corresponding mesh.

@hamamo if you can provide a small reproducer, or point us to your code itself, that would be great.

I know we are not doing a great job documenting the connection lifecycle mechanism in libp2p. Let me know, I can extend the documentation by a paragraph or two in the next couple of days.

hamamo · 2022-02-26T10:11:42Z

Today I tried setting keep_alive in PingConfig to false, but that causes the connection to be closed, and GossipSub events to be missed. So if a mesh is established but not permanently used it does not seem to keep connections alive, and GossipSub also doesn't attempt to re-dial peers known to be part of the mesh.

@mxinden: The code is at https://github.com/hamamo/reputation-net/tree/tokio (the master branch where I used async_std isn't updated yet). I know this is a big ball of spaghetti, didn't have the time yet to come up with a clean small example showing the issue.

mxinden · 2022-03-01T10:08:58Z

So if a mesh is established but not permanently used it does not seem to keep connections alive

That is surprising to me. Once a peer joins a mesh, the in_mesh flag on the corresponding ConnectionHandler should be set to true and thus the connection should be kept alive.

https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/handler.rs#L132-L134

Could you check whether a peer successfully joins a mesh but is still disconnected? Also could you post the logs (RUST_LOG=debug) of a small reputation-net reproducing the issue @hamamo?

mxinden · 2022-03-01T10:14:58Z

Just a wild guess, seeing logic in reputation_net/mod.rs related to awaiting storage operations within the custom NetworkBehaviour. Do these operations block the main libp2p task for longer period of time? If so, this might cause connections to time out. Let's see what the logs say.

https://github.com/hamamo/reputation-net/blob/tokio/src/reputation_net/mod.rs

thomaseizinger · 2023-09-20T02:17:31Z

We are reworking idle connection management, see #4306.

I am closing this as stale.

thomaseizinger closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoNAT: multiple opened connection #2515

AutoNAT: multiple opened connection #2515

hamamo commented Feb 14, 2022

elenaf9 commented Feb 14, 2022 •

edited

Loading

hamamo commented Feb 18, 2022

rkuhn commented Feb 18, 2022

elenaf9 commented Feb 18, 2022 •

edited

Loading

mxinden commented Feb 18, 2022

hamamo commented Feb 26, 2022

mxinden commented Mar 1, 2022

mxinden commented Mar 1, 2022

thomaseizinger commented Sep 20, 2023

AutoNAT: multiple opened connection #2515

AutoNAT: multiple opened connection #2515

Comments

hamamo commented Feb 14, 2022

elenaf9 commented Feb 14, 2022 • edited Loading

hamamo commented Feb 18, 2022

rkuhn commented Feb 18, 2022

elenaf9 commented Feb 18, 2022 • edited Loading

mxinden commented Feb 18, 2022

hamamo commented Feb 26, 2022

mxinden commented Mar 1, 2022

mxinden commented Mar 1, 2022

thomaseizinger commented Sep 20, 2023

elenaf9 commented Feb 14, 2022 •

edited

Loading

elenaf9 commented Feb 18, 2022 •

edited

Loading