Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Lease and auto-reconnect not working properly in FreeRTOS platform #865

Closed
pdematteis opened this issue Jan 28, 2025 · 4 comments · Fixed by #867
Closed

[Bug] Lease and auto-reconnect not working properly in FreeRTOS platform #865

pdematteis opened this issue Jan 28, 2025 · 4 comments · Fixed by #867
Assignees
Labels
bug Something isn't working

Comments

@pdematteis
Copy link

Describe the bug

The current lease implementation, when an error occurs, calls _z_unicast_transport_clear(), that in turn calls _z_task_detach().
In FreeRTOS, it finally calls vTaskDelete(), which imply that the task is immediately cancelled and does not continue; cleanup code is never reached, as well as auto-reconnect code.

A cleaner solution, according to me, is just to set ztu->_common._lease_task_running = false, then deal with cleanup and reconnection from another task.

As a side note, auto-reconnect from within the lease task, also causes a second lease task to be started in parallel. It uses too much heap in FreeRTOS and can't use static tasks allocation, because the same buffer would be used by the two lease tasks concurrently.

To reproduce

.

System info

.

@pdematteis pdematteis added the bug Something isn't working label Jan 28, 2025
@sashacmc
Copy link
Member

sashacmc commented Jan 28, 2025

Dear @pdematteis,

This is a known limitation of the connection recovery mechanism.
The connection recovery mechanism is not only the ability to connect to the same router, but also the ability to connect to any other if the previous router is unavailable.
In this case, the connection to another router may be via a different protocol and therefore restarting the lease and read tasks is necessary.
But perhaps for the case with FreeRTOS a more correct completion of the task restart is needed.

@pdematteis
Copy link
Author

There is also another annoying side effect of using auto-reconnect, at least in my implementation.
If there is a publishing loop, say in a secondary thread, and the lease task closes the transport, then the publish will throw assert messages, since the buffers used have been disposed. To handle this, good synchronization would be needed to avoid race conditions.

I finally found a good solution by using this cleanup function:

static void _zp_unicast_failed(_z_transport_unicast_t *ztu) {
    _z_unicast_transport_close(ztu, _Z_CLOSE_EXPIRED);
#if Z_FEATURE_AUTO_RECONNECT == 1
    _z_unicast_transport_clear(ztu, true);
    _z_session_rc_ref_t *zs = ztu->_common._session;
    z_result_t ret = _z_reopen(zs);
    if (ret != _Z_RES_OK) {
        _Z_ERROR("Reopen failed: %i", ret);
    }
#else
    ztu->_common._lease_task_running = false;
#endif
}

with a support task performing a loop like:

for(;;)
{
	// Initialize
	Init();
	// Loop until lease task fails
	const z_loaned_session_t *zs = z_loan(session);
	while (_Z_RC_IN_VAL(zs)->_tp._transport._unicast._common._lease_task_running) {
		osDelay(100);
	}
	// Cleanup
	Close();
	// Allow background tasks to cleanup themselves
	osDelay(100);
}

where Init() opens session, publishers, subscribers and Close() shuts down everything.

Thanks.

@sashacmc
Copy link
Member

sashacmc commented Feb 3, 2025

Dear @pdematteis

If it's not too much trouble, please check this solution for your specific case.

The problem that it is impossible to send messages (as well as asserts) during a restart remains at the moment, this is the expected behavior. The main thing is that the system does not crash or hangs.

There are also plans to create an API for notification of connection loss, but I have no deadlines for its implementation.

@sashacmc sashacmc self-assigned this Feb 3, 2025
@pdematteis
Copy link
Author

Hello @sashacmc

The overall cleanup stuff looks good.
Anyway, I my implementation, because of memory constrain and synchronization issues, I can't use auto-reconnect and I'd prefer to use this cleanup function, then handle the cleanup + reconnection externally.

static void _zp_unicast_failed(_z_transport_unicast_t *ztu) {
    _z_unicast_transport_close(ztu, _Z_CLOSE_EXPIRED);
#if Z_FEATURE_AUTO_RECONNECT == 1
    _z_unicast_transport_clear(ztu, true);

#if Z_FEATURE_LIVELINESS == 1 && Z_FEATURE_SUBSCRIPTION == 1
    _z_liveliness_subscription_undeclare_all(_Z_RC_IN_VAL(ztu->_common._session));
#endif
    _z_session_rc_ref_t *zs = ztu->_common._session;
    z_result_t ret = _z_reopen(zs);
    if (ret != _Z_RES_OK) {
        _Z_ERROR("Reopen failed: %i", ret);
    }
#else
    ztu->_common._lease_task_running = false;
#endif
}

Dropping the transport in the lease task without reconnecting, makes things tricky while trying to cleanup from the main task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants