Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build stalls on massive autogenerated C files #23557

Open
ImBoop opened this issue Jan 31, 2025 · 7 comments
Open

Build stalls on massive autogenerated C files #23557

ImBoop opened this issue Jan 31, 2025 · 7 comments

Comments

@ImBoop
Copy link

ImBoop commented Jan 31, 2025

I've been experimenting with building various things with emscripten to relatively decent success, until recently.

The library in question is libpostal. After making a few changes (namely to have it use the clang webassembly SSE intrinsics, increasing the max wasm memory, etc), it hangs on building the enormous scanner.c - this file is autogenerated from re2c, but despite being enormous, non-emscripten builds take on the order of a few dozen seconds. Emscripten's build, however, steadily climbs in memory usage until its killed from OOM. I've thrown 200GB of memory at it and ran it overnight and it still never finished, eventually still dying to an OOM.

I can get the complexity of the file is quite high, but this still seems rather unusual, and I'm having trouble tracking down potential causes - I feel it may be a bug in the toolchain.

@sbc100
Copy link
Collaborator

sbc100 commented Jan 31, 2025

What pat of the process is failing? Is it compiling or linking?

If it is linking (seems likely), what part of the link is failing? (You add -v to you link command to get more information about the sub-processes or build with EMCC_DEBUG=1 to get even more info).

@sbc100 sbc100 transferred this issue from emscripten-core/emsdk Jan 31, 2025
@ImBoop
Copy link
Author

ImBoop commented Jan 31, 2025

It's the compiling process (emcc is what is stalling) - everything works up to that enormous .c file. It's relied on by several pieces of libpostal, so it doesn't even get a chance to run the linker.

Here's the full command that's being run (as part of the makefile), if it helps:

/bin/bash ../libtool  --tag=CC   --mode=compile /home/.../LP/emsdk/upstream/emscripten/emcc -DHAVE_CONFIG_H -I.. -I/usr/local/include    -Wall -Wextra -Wno-unused-function -Wformat -Werror=format-security -Winit-self -Wno-sign-compare -DLIBPOSTAL_DATA_DIR='"/usr/local/share/libpostal"' -g -msimd128 -sALLOW_MEMORY_GROWTH -sMAXIMUM_MEMORY=4gb -DUSE_SIMD -g -O2 -O0 -D LIBPOSTAL_EXPORTS   -MT libscanner_la-scanner.lo -MD -MP -MF .deps/libscanner_la-scanner.Tpo -c -o libscanner_la-scanner.lo `test -f 'scanner.c' || echo './'`scanner.c

@sbc100
Copy link
Collaborator

sbc100 commented Jan 31, 2025

In that case this is likely a clang issue since emcc will simply exec clang when compiling, and nothing more.

Can you confirm that it is the clang process that is stalling and eating all your memory? Can you add -v to your cflags to see the exact clang command that is run?

Does compiling the same C file for a different --target (or not target at all for your host system) not have the same issue?

@sbc100
Copy link
Collaborator

sbc100 commented Jan 31, 2025

Since the file in question has a huge amount of goto statements I imagine the LLVM pass that is taking all the resources is https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp.

@sbc100
Copy link
Collaborator

sbc100 commented Jan 31, 2025

Could you perhaps attach the pre-processed scanner.c (so that we can build it standalone with all the headers, etc).

@ImBoop
Copy link
Author

ImBoop commented Jan 31, 2025

Since the file in question has a huge amount of goto statements I imagine the LLVM pass that is taking all the resources is https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp.

re2c does have some options to reduce the number of gotos, I believe; may be worth me experimenting regenerating it w/ some of those options to see how it behaves. The parser.c I have is the same as the one in libpostal above - I'll see about getting my changes put into a repo here in a bit (they're relatively minimal, at the moment, isolated to the autoconf scripts.)

I can also see about pulling it out to try to isolate it for testing.

Can you add -v to your cflags to see the exact clang command that is run?

make[2]: Entering directory '/home/.../LP/libpostal-wasm/src'
 cd .. && /bin/bash /home/.../LP/libpostal-wasm/missing automake-1.16 --foreign src/Makefile
 cd .. && /bin/bash ./config.status src/Makefile depfiles
config.status: creating src/Makefile
config.status: executing depfiles commands
/bin/bash ../libtool  --tag=CC   --mode=compile /home/.../LP/emsdk/upstream/emscripten/emcc -DHAVE_CONFIG_H -I.. -I/usr/local/include    -v -Wall -Wextra -Wno-unused-function -Wformat -Werror=format-security -Winit-self -Wno-sign-compare -DLIBPOSTAL_DATA_DIR='"/usr/local/share/libpostal"' -g -msimd128 -sALLOW_MEMORY_GROWTH -sMAXIMUM_MEMORY=4gb -DUSE_SIMD -g -O2 -O0 -D LIBPOSTAL_EXPORTS   -MT libscanner_la-scanner.lo -MD -MP -MF .deps/libscanner_la-scanner.Tpo -c -o libscanner_la-scanner.lo `test -f 'scanner.c' || echo './'`scanner.c
libtool: compile:  /home/.../LP/emsdk/upstream/emscripten/emcc -DHAVE_CONFIG_H -I.. -I/usr/local/include -v -Wall -Wextra -Wno-unused-function -Wformat -Werror=format-security -Winit-self -Wno-sign-compare -DLIBPOSTAL_DATA_DIR=\"/usr/local/share/libpostal\" -g -msimd128 -sALLOW_MEMORY_GROWTH -sMAXIMUM_MEMORY=4gb -DUSE_SIMD -g -O2 -O0 -D LIBPOSTAL_EXPORTS -MT libscanner_la-scanner.lo -MD -MP -MF .deps/libscanner_la-scanner.Tpo -c scanner.c  -fPIC -DPIC -o .libs/libscanner_la-scanner.o
emcc: warning: linker setting ignored during compilation: 'ALLOW_MEMORY_GROWTH' [-Wunused-command-line-argument]
emcc: warning: linker setting ignored during compilation: 'MAXIMUM_MEMORY' [-Wunused-command-line-argument]
 /home/.../LP/emsdk/upstream/bin/clang -target wasm32-unknown-emscripten -fignore-exceptions -fvisibility=default -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr --sysroot=/home/.../LP/emsdk/upstream/emscripten/cache/sysroot -DEMSCRIPTEN -Xclang -iwithsysroot/include/fakesdl -Xclang -iwithsysroot/include/compat -DHAVE_CONFIG_H -I.. -I/usr/local/include -v -Wall -Wextra -Wno-unused-function -Wformat -Werror=format-security -Winit-self -Wno-sign-compare -DLIBPOSTAL_DATA_DIR="/usr/local/share/libpostal" -g3 -msimd128 -DUSE_SIMD -g3 -O2 -O0 -D LIBPOSTAL_EXPORTS -MT libscanner_la-scanner.lo -MD -MP -MF .deps/libscanner_la-scanner.Tpo -c scanner.c -fPIC -DPIC -o.libs/libscanner_la-scanner.o
clang version 21.0.0git (https:/github.com/llvm/llvm-project 9534d27e3321a3b9e6e79fe6328445575bf26b7b)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /home/.../LP/emsdk/upstream/bin
 (in-process)
 "/home/.../LP/emsdk/upstream/bin/clang-21" -cc1 -triple wasm32-unknown-emscripten -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name scanner.c -mrelocation-model pic -pic-level 2 -mframe-pointer=none -ffp-contract=on -fno-rounding-math -mconstructor-aliases -target-feature +mutable-globals -target-cpu generic -target-feature +simd128 -debug-info-kind=constructor -dwarf-version=4 -debugger-tuning=gdb -fdebug-compilation-dir=/home/.../LP/libpostal-wasm/src -v -fcoverage-compilation-dir=/home/.../LP/libpostal-wasm/src -resource-dir /home/.../LP/emsdk/upstream/lib/clang/21 -dependency-file .deps/libscanner_la-scanner.Tpo -MT libscanner_la-scanner.lo -sys-header-deps -MP -D EMSCRIPTEN -D HAVE_CONFIG_H -I .. -I /usr/local/include -D "LIBPOSTAL_DATA_DIR=\"/usr/local/share/libpostal\"" -D USE_SIMD -D LIBPOSTAL_EXPORTS -D PIC -isysroot /home/.../LP/emsdk/upstream/emscripten/cache/sysroot -internal-isystem /home/.../LP/emsdk/upstream/lib/clang/21/include -internal-isystem /home/.../LP/emsdk/upstream/emscripten/cache/sysroot/include/wasm32-emscripten -internal-isystem /home/.../LP/emsdk/upstream/emscripten/cache/sysroot/include -O0 -Wall -Wextra -Wno-unused-function -Wformat -Werror=format-security -Winit-self -Wno-sign-compare -ferror-limit 19 -fvisibility=default -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fignore-exceptions -fcolor-diagnostics -iwithsysroot/include/fakesdl -iwithsysroot/include/compat -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr -o .libs/libscanner_la-scanner.o -x c scanner.c
clang -cc1 version 21.0.0git based upon LLVM 21.0.0git default target x86_64-unknown-linux-gnu
ignoring nonexistent directory "/home/.../LP/emsdk/upstream/emscripten/cache/sysroot/include/wasm32-emscripten"
#include "..." search starts here:
#include <...> search starts here:
 ..
 /usr/local/include
 /home/.../LP/emsdk/upstream/emscripten/cache/sysroot/include/fakesdl
 /home/.../LP/emsdk/upstream/emscripten/cache/sysroot/include/compat
 /home/.../LP/emsdk/upstream/lib/clang/21/include
 /home/.../LP/emsdk/upstream/emscripten/cache/sysroot/include

@ImBoop
Copy link
Author

ImBoop commented Jan 31, 2025

Could you perhaps attach the pre-processed scanner.c (so that we can build it standalone with all the headers, etc).

To reproduce, pull my libpostal fork and run the following:

emconfigure ./bootstrap.sh
emconfigure ./configure --datadir=/tmp/libpostal-data
emmake make -j8

I'm still going to work on splitting out the parser, but in the mean time you can use this to get far enough into the build to where it gets unhappy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants