Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Beam crash related to DETS or persistent_term #9222

Closed
lukaszsamson opened this issue Dec 19, 2024 · 12 comments · Fixed by #9349
Closed

[Windows] Beam crash related to DETS or persistent_term #9222

lukaszsamson opened this issue Dec 19, 2024 · 12 comments · Fixed by #9349
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Milestone

Comments

@lukaszsamson
Copy link
Contributor

Describe the bug
When ElixirLS test suite is run on Windows on OTP 27, beam crashes. When run from PowerShell terminal it corrupts the terminal.

[process exited with code 2 (0x00000002)]
You can now close this terminal with Ctrl+D, or press Enter to restart.

When run from git bash it crashes and writes

Segmentation fault

No crash dump file is produced

To Reproduce
Unfortunately I was not able to isolate this crash to a simple erl script. The below steps require elixir 1.17 install with hex

  1. Checkout elixir-lsp/elixir-ls@5b43ee2
  2. run mix deps.get
  3. change dir to apps/language_server
  4. run tests mix test test/server_test.exs or mix test test/providers/workspace_symbols_test.exs

Beam crashes almost on every run

Here's an example crash from ElixirLS CI
https://github.com/elixir-lsp/elixir-ls/actions/runs/12369438492/job/34521367766

Expected behavior
No crash

Affected versions
27 on windows
The bug was not present on earlier version
Linux and mac is not affected

Additional context
The bug seems to be some weird combination of DETS and/or persistent_term usage. Removing DETS makes the crash much harder to reproduce. Removing both makes the crash go away.

@lukaszsamson lukaszsamson added the bug Issue is reported as a bug label Dec 19, 2024
@lukaszsamson
Copy link
Contributor Author

I can provide windows crash dumps but I'd need guidance on how to set up windows for OTP debugging

@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Dec 19, 2024
@garazdawi
Copy link
Contributor

Which specific 27 versions have you tried? I assume 26 works?

@lukaszsamson
Copy link
Contributor Author

Which specific 27 versions have you tried?

This crash has been failing CI since I added OTP 27 to the matrix 5 months ago https://github.com/elixir-lsp/elixir-ls/actions/runs/9817589619. The newest I tried locally was 27.2. I guess all versions are affected

I assume 26 works?

Yes, I run mostly the same test suite on 22-27. Only 27 on windows is affected

@garazdawi
Copy link
Contributor

I managed to reproduce it locally, the crash happens when garbage collecting literals, which would point to something related to persistent_term deletion, or just literal GC in general.

I will continue to dig today, but Christmas holidays is coming so it will probably be a couple of weeks before I have time to find out what is going on.

@garazdawi
Copy link
Contributor

So I'm trying to figure out what this is and I have a question. When it fails, this is printed:

Content-Length: 109

{"jsonrpc":"2.0","method":"window/logMessage","params":{"message":"Loaded DETS databases in 196ms","type":3}}

just before Erlang segfault, while when it does not fail no such message is printed. So you know what might cause it to enter the path where that is printed? and if so, any ideas on how to make that always happen?

@garazdawi
Copy link
Contributor

And just after I typed that message, I ofcourse managed to get that printout on a run that did not fail, so that seems to be a red herring. Digging on...

@lukaszsamson
Copy link
Contributor Author

@garazdawi
Copy link
Contributor

Adding some notes for myself:

The crash happens here:

I: <'Elixir.Path':do_join/3+27>
0x000001cf389b69a8: <'Elixir.Path':join/2+15>
0x000001cf389b69b0: <'Elixir.Path':join/2+22>
0x000001cf389b69b8: win32
0x000001cf389b69c0: <'Elixir.Path':join/1+18>
0x000001cf389b69c8: [<<"calls.dets">>]
0x000001cf389b69d0: <'Elixir.ElixirLS.LanguageServer.Tracer':init_table/2+18>
0x000001cf389b69d8: []
0x000001cf389b69e0: []
0x000001cf389b69e8: []
0x000001cf389b69f0: []
0x000001cf389b69f8: []
0x000001cf389b6a00: 'Elixir.ElixirLS.LanguageServer.Tracer:calls'

when erts_bs_start_match_3 is called with a bitstring where the underlying refc binary is a literal that has been GC:ed away. More digging to resume on Monday...

@garazdawi garazdawi linked a pull request Jan 27, 2025 that will close this issue
@garazdawi
Copy link
Contributor

After much digging I finally found the issue. Solution in #9349. The combination of GC bug together with Windows only made this so much harder than it needed to be to find...

Thanks for the report!

@garazdawi garazdawi added this to the OTP-27.3 milestone Jan 29, 2025
@garazdawi
Copy link
Contributor

Fix will be part of Erlang/OTP 27.3.

@lukaszsamson
Copy link
Contributor Author

I've seen the fix is in generic code that was not changed for ages. Was it a regression introduced in 27? Does the bug affect only windows?

@garazdawi
Copy link
Contributor

It was introduced in 27 by a major refactoring in how binaries look inside the vm. The refactoring missed to update this part of the code that is only used on windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants