Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No wifi reconnection after connection loss - Heishamon Large (ESP32) #621

Open
oxyde42 opened this issue Jan 8, 2025 · 40 comments
Open

Comments

@oxyde42
Copy link

oxyde42 commented Jan 8, 2025

Sometimes my Heishamon loses the wifi connection, but there is no reconnection.

Has this been forgotten for the ESP32?

void check_wifi() {
int wifistatus = WiFi.status();
if ((wifistatus != WL_CONNECTED) && (WiFi.localIP())) {
// special case where it seems that we are not connect but we do have working IP (causing the -1% wifi signal), do a reset.
#ifdef ESP8266
log_message(_F("Weird case, WiFi seems disconnected but is not. Resetting WiFi!"));
setupWifi(&heishamonSettings);
#else // here for ESP32
log_message(_F("WiFi just got disconnected, still have IP addres."));
#endif
...

Or can a reconnection be triggered with a rule?

@geduxas
Copy link
Contributor

geduxas commented Jan 8, 2025

I can confirm this behavior, i noticed after updating rebooting my AP heishamon will go to failsafe hotspot mode, and will not reconnect, just rebooted heishamon and it connect's back.

@IgorYbema
Copy link
Contributor

No there is something else weird going on with ESP32 wifi. I haven't figured it out yet. Will keep searching when I have some more spare time

@oxyde42
Copy link
Author

oxyde42 commented Jan 9, 2025

I have found the problem. It is a logical error in the function check_wifi(). The ESP32 also loses the IP address at the same time as the connection. This means that result of line 173 is always false:
if ((wifistatus != WL_CONNECTED) && (WiFi.localIP()))

On the ESP32 you get the correct information WL_CONNECTION_LOST as wifistatus if the connection is lost.

I suggest the following change:

void check_wifi() {
int wifistatus = WiFi.status();
#ifdef ESP8266
if ((wifistatus != WL_CONNECTED) && (WiFi.localIP())) {
#else
if (wifistatus == WL_CONNECTION_LOST) {
#endif
// special case where it seems that we are not connect but we do have working IP (causing the -1% wifi signal), do a reset.
log_message(_F(“Weird case, WiFi seems disconnected but is not. Resetting WiFi!”));
setupWifi(&heishamonSettings);

} else if ((wifistatus != WL_CONNECTED) || (!WiFi.localIP())) {
...

@IgorYbema
Copy link
Contributor

No that is not it. That if statement is specially written for the ESP8266 which can be in this state. I know that for the ESP32 it will never go in that statement.
The next if is for the real check (else if ((wifistatus != WL_CONNECTED) || (!WiFi.localIP()))
From there it sometimes doesn't go to the part where it retries the connection.
I'll try to debug more later.

@IgorYbema
Copy link
Contributor

I hope to have it fixed in IgorYbema@2eee519 and will release in v3.9

@lucasimons
Copy link

I hope to have it fixed in IgorYbema@2eee519 and will release in v3.9

I have the same issue of @oxyde42 ...can we try the beta release of 3.9 ? because in raining conditions I need to reconnect the board every day

@IgorYbema
Copy link
Contributor

Yes you can test it. If you login to github and go to this page, you will see a 'artificat' which is a ZIP file containing test firmware for v3.9. The ESP8266 file is for the small heishamon and the ESP32 is for the large heishamon

@MiG-41
Copy link
Contributor

MiG-41 commented Jan 14, 2025

an "this page" is with page ? any link ?

@IgorYbema
Copy link
Contributor

Sorry, forgot the link: https://github.com/IgorYbema/HeishaMon/actions/runs/12772225974

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

For me this went somehow terrible wrong. Web interface become unresponsive, also didn't get any values from pump

Previous worked version for me was
https://github.com/IgorYbema/HeishaMon/actions/runs/12716712132/job/35451629829

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

if it gives something here is partial log from access point

image

and ping

image

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

and by block access to AP it will not go to hot spot mode anymore.

@lucasimons
Copy link

For me this went somehow terrible wrong. Web interface become unresponsive, also didn't get any values from pump

Previous worked version for me was
https://github.com/IgorYbema/HeishaMon/actions/runs/12716712132/job/35451629829

For me works fine and it connects automatically to my AP

@MiG-41
Copy link
Contributor

MiG-41 commented Jan 14, 2025

For my old ESP8266 based also fine.

@IgorYbema
Copy link
Contributor

@geduxas this version does try to connect to any accesspoint with the corrent SSID on each channel and tries to look for the strongest one. Maybe it now connects to a faulty (but stronger) accessspoint at your home? That is the only thing what I can think of right now.

@IgorYbema
Copy link
Contributor

Or it is crashing. Check the stats mqtt log line for the uptime

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

no, i have disabled all of them, and seams its crashing, here is output from console:

Tue Jan 14 21:02:22 2025 (5711): received TOP89 Z2_Cool_Curve_Outside_Low_Temp: 20
Tue Jan 14 21:02:22 2025 (5712): received TOP90 Room_Heater_Operations_Hours: 6
Tue Jan 14 21:02:22 2025 (5715): received TOP91 DHW_Heater_Operations_Hours: 54


Stack smashing protect failure!


Backtrace: 0x40377eb6:0x3fceba30 0x4037e6f5:0x3fceba50 0x40377ec6:0x3fceba70 0x42010755:0x3fceba90 0x420108a9:0x3fcebae0 0x42010d8d:0x3fcebb20 0x420047ba:0x3fcebdf0 0x420049ef:0x3fcebf20 0x42005fcc:0x3fcebf60 0x4203072c:0x3fcebfc0 0x403811f2:0x3fcebfe0




ELF file SHA256: 5ea386a80b702f41

Rebooting...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x2b (SPI_FAST_FLASH_BOOT)
Saved PC:0x4037c18a
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3818,len:0x508
load:0x403c9700,len:0x4
load:0x403c9704,len:0xad0
load:0x403cc700,len:0x29d8
entry 0x403c9880
Starting debugging, version: 3.9

@IgorYbema
Copy link
Contributor

Damn, then this one isn't ready for release yet. Can you tell me you setup? What could be different with your config. What are your settings?

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

no rules, no dallas, no s0 large board using relays, mqtt with password, wifi with possible multiple AP.. full mqtt retransmit lowest possible (5s ?) everything worked in this build: https://github.com/IgorYbema/HeishaMon/actions/runs/12716712132

@IgorYbema
Copy link
Contributor

Ah ok then it isn't hard to find the cause. Can you try working upwards from the builds from where it did work? Only focus on the v3.9b succes builds. So https://github.com/IgorYbema/HeishaMon/actions?query=branch%3A3.9b+is%3Asuccess

@IgorYbema
Copy link
Contributor

IgorYbema commented Jan 14, 2025

It could be the websocket change (the latest one on the list) causing this. If you have retransmit each 5s this causes a lot of websocket traffic. If this is the cause, I'll need to redesign this.

-- edit: no this can't be it as it isn't causing more websocket traffic --

@IgorYbema
Copy link
Contributor

Ah wait, I see where it goes wrong. At TOP92. That is the heatpump model which now got changed. I'll focus on looking why this goes wrong for you

@IgorYbema
Copy link
Contributor

@geduxas please try the build from this run when it finishes. I believe I found the issue.
https://github.com/IgorYbema/HeishaMon/actions/runs/12775259241

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

yes, in release where removed heatpump model it started to crash https://github.com/IgorYbema/HeishaMon/actions/runs/12736286709

ill try latest

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

ok, latest working without crash. interesting why it effect only me ?

@IgorYbema
Copy link
Contributor

The function ran out of boundary (buffer overflow) and that causes problems at random. Not sure how this exactly works but it really depends on how the controller maps the memory. Good we found it.

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

I have made some test's on my WiFi, and it's works!. After disabling access to one of my AP, heishamon switches to another, disabling on all AP's it will switch to hotspot mode. And bringing back access to AP it will reconnect to AP. So this issue may be closed I think

@geduxas
Copy link
Contributor

geduxas commented Jan 14, 2025

The function ran out of boundary (buffer overflow) and that causes problems at random. Not sure how this exactly works but it really depends on how the controller maps the memory. Good we found it.

interesting fact that with crashing firmware if I wait for a while, about after 10-20 minutes it will boot, and will work after all.. but after I reset it or reboot it will go to crash cycle...

@oxyde42
Copy link
Author

oxyde42 commented Jan 22, 2025

Now I have waited until the release of the finished version 3.9. Unfortunately, the implemented change does not bring any improvement on the ESP32. Although the Heishamon now connects to the strongest SSID (my repeater), the connection loss still happens again - without reconnecting.

Out of desperation, I took the esphome_panasonic_heatpump project as a model and started my own ESPHome Heishamon. Connected to the proxy port of the big Heishamon, it maintains the wifi connection to my home assistant, despite the fact that I use a hideous “ESP32-C3 Super Mini” for it.

@geduxas
Copy link
Contributor

geduxas commented Jan 22, 2025

Strange why you're setup is working different. I had same problem as yours, but with 3.9 it's now works as expected, if one of my AP will go down, heishamon jumps to another, if all AP's down it will open hotspot, and after bringing AP back it will recover and connect's to my wifi..
So what you have? You're heishamon keep stuck in hotspot mode?

@oxyde42
Copy link
Author

oxyde42 commented Jan 22, 2025

So what you have? You're heishamon keep stuck in hotspot mode?

Yes, after pressing the boot button everything is O.K. again.

The Heishamon otherwise continued to run without errors. I could see this on my ESPHome adapter connected to the proxy port.

@geduxas
Copy link
Contributor

geduxas commented Jan 22, 2025

So why it's disconnect from wifi? What wifi AP do you use? Us it possible to reproduce? Also if you can manually trigger problem, could you share log from usb interface?

@oxyde42
Copy link
Author

oxyde42 commented Jan 22, 2025

No, unfortunately I cannot reproduce this. I have a FRITZ!Box 7580 with FRITZ!OS: 07.30 as an access point and also a FRITZ!Repeater 2400, which is within a decent range. I have a Wifi signal of 74% to the repeater (Heishamon display). My Fritz! repeater says: 2.4 GHz, 137 Mbit/s down, 70 Mbit/s up, Wi-Fi 4, 40 MHz, WPA3, 1 x 1.
With firmware 3.8, the Heishamon always connected to the Fritz!Box, with much worse connection values.

If the connection is lost, I can no longer access the log. It doesn't happen so often that I could record the log via USB-C with the laptop.

@geduxas
Copy link
Contributor

geduxas commented Jan 22, 2025

Is it possible that your AP somehow change's it's channels, and it could lead to this behavior?

@oxyde42
Copy link
Author

oxyde42 commented Jan 22, 2025

Is it possible that your AP somehow change's it's channels, and it could lead to this behavior?

Is it possible that your AP is somehow changing its channels and this could be causing this behavior?

No, the channel is fixed. Now I have tried to provoke this loss of connection. I have kicked out the heishamon on the repeater and forbidden renewed access. It switched to the worse access point without any errors. Now it stays there. I am attaching the log.
Who knows what causes the Heishamon to switch to AP mode instead of reconnecting.
Log:
Wed Jan 22 16:52:21 2025 (21120707): Heishamon stats: Uptime: 0 days 5 hours 52 minutes 0 seconds ## Free memory: 69% ## Free PSRAM: 2073076 bytes ## Free heap: 241476 bytes ## Wifi: 68% (RSSI: -66) ## Ethernet: not installed ## Mqtt reconnects: 1 ## Correct data: 100.00% Rules active: 0
Wed Jan 22 16:52:21 2025 (21120710): Requesting new panasonic data
Wed Jan 22 16:52:21 2025 (21120711): sent bytes: 111 including checksum value: 18
Wed Jan 22 16:52:21 2025 (21121116): Received 203 bytes data
Wed Jan 22 16:52:21 2025 (21121117): Checksum and header received ok!
Wed Jan 22 16:52:21 2025 (21121118): received TOP16 Heat_Power_Consumption: 1400
Wed Jan 22 16:52:21 2025 (21121121): received TOP55 Ipm_Temp: 20
Wed Jan 22 16:52:30 2025 (21130462): PROXY Received 111 bytes
Wed Jan 22 16:52:30 2025 (21130463): PROXY Checksum and header received ok!
Wed Jan 22 16:52:30 2025 (21130463): PROXY requests basic data
Wed Jan 22 16:52:45 2025 (21145469): PROXY Received 111 bytes
Wed Jan 22 16:52:45 2025 (21145469): PROXY Checksum and header received ok!
Wed Jan 22 16:52:45 2025 (21145470): PROXY requests basic data
Wed Jan 22 16:52:51 2025 (21150707): Lost MQTT connection!
Wed Jan 22 16:52:51 2025 (21150707): Reconnecting to mqtt server ...
Wed Jan 22 16:52:51 2025 (21150736): Heishamon stats: Uptime: 0 days 5 hours 52 minutes 30 seconds ## Free memory: 69% ## Free PSRAM: 2069728 bytes ## Free heap: 240916 bytes ## Wifi: 44% (RSSI: -78) ## Ethernet: not installed ## Mqtt reconnects: 2 ## Correct data: 100.00% Rules active: 0
Wed Jan 22 16:52:51 2025 (21150739): Requesting new panasonic data
Wed Jan 22 16:52:51 2025 (21150739): sent bytes: 111 including checksum value: 18
Wed Jan 22 16:52:51 2025 (21151205): Received 203 bytes data
Wed Jan 22 16:52:51 2025 (21151206): Checksum and header received ok!
Wed Jan 22 16:52:51 2025 (21151207): received TOP8 Compressor_Freq: 45
Wed Jan 22 16:52:51 2025 (21151209): received TOP16 Heat_Power_Consumption: 1200
Wed Jan 22 16:52:51 2025 (21151212): received TOP64 High_Pressure: 23.2
Wed Jan 22 16:52:51 2025 (21151214): received TOP67 Compressor_Current: 6.0
Wed Jan 22 16:53:00 2025 (21160473): PROXY Received 111 bytes
Wed Jan 22 16:53:00 2025 (21160474): PROXY Checksum and header received ok!
Wed Jan 22 16:53:00 2025 (21160474): PROXY requests basic data

@IgorYbema
Copy link
Contributor

If it can't reconnect within 30 secs it starts the access point but it will still keep trying to reconnect. If it reconnects it shuts down the access point again.
Only if a client is connect to the access point in that period, it stops trying to reconnect. This logic is there to let the client finish the setting.
Could it be that your have a computer/phone which remembers and connects to the heishamon-setup ssid when available?

@oxyde42
Copy link
Author

oxyde42 commented Jan 23, 2025

Could it be that your have a computer/phone which remembers and connects to the heishamon-setup ssid when available?

No, I can rule that out. It only seems to happen after a cold boot. When I restarted the Heishamon via reset, the wifi connection seems to be stable.

But now a question about roaming: If it now works wonderfully to select the best AP, can't it also be done cyclically to switch to the best one? I can select this in ESPHome, and the µC also consumes much less power there.

@oxyde42
Copy link
Author

oxyde42 commented Jan 24, 2025

Now I have been able to concretize the loss of connection. Unfortunately, a warm boot does not prevent this. It was apparently triggered by my repeater. I can find the following entry in the log:
24.01.25, 10:02:17, WLAN device is periodically re-registered (2.4 GHz), base station, IP 192.168.0.1, MAC 44:4E:6D:xxxxx.

The clients then seem to have lost the connection briefly - possibly also the IP address. My ESPHome devices reported a loss of availability at 10:02:42 and availability again at 10:02:43 (or the other device at 10:02.37 and 10:02:38). The reconnect worked without any problems there. The Heishamon transmitted via its MQTT LWT at 10:02:22 that the connection was offline. The Heishamon then went into AP mode.
Now I was able to connect the debug monitor to the Heishamon. The UART connection worked inconspicuously. After connecting to the AP and immediately disconnecting from it, it went back into STA mode without errors. The detection of the connection loss therefore appears to be faulty here.
Here is the log of my short client connection.
Terminal on COM12.txt

@oxyde42
Copy link
Author

oxyde42 commented Jan 25, 2025

And exactly the same thing happened again today. The repeater tries to reconnect to 2.4 GHz and the Heishamon is back in AP mode. This is actually the wrong behavior when it goes into AP mode, unless it has just been restarted by the user - regardless of whether it is a cold or warm boot. Without the user being at the device, it is better not to go into AP mode.
I have not yet fully understood the somewhat confusing logic of the check_wifi() function. Since the 8266 and the ESP32 apparently implement the wifi differently, it would certainly be more reliable to manage the loss of connection yourself. In my tests, the ESP32 also loses the IP after a connection loss. This means that it immediately goes into AP mode if “Enable WiFi hotspot when not connected” is activated. But surely this should only apply in an emergency and if it doesn't get a connection after starting.
I have now deactivated it. If the connection is now not possible, the only thing that helps is reflashing via USB.

@IgorYbema
Copy link
Contributor

IgorYbema commented Jan 25, 2025

Yes correct the reconnect isn't optimal yet. The ESP32 (large heishamon) behaves different in some situation. There are about 8 wifi states and sometimes it says disconnected and sometimes stopped (or something like that). I'll try to make it even better but it is hard to reproduce all situations.

But if your wifi fails so often maybe you need to look at that first. My wifi has uptime of many weeks/months (unless I restart it myself)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants