Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise message dispatch and primitives #177

Closed
wants to merge 35 commits into from

Conversation

fniephaus
Copy link
Member

This PR revises the message dispatch in TruffleSqueak significantly: most importantly, primitives are now always being tried a direct or indirect call happens (which brings TS closer to the reference implementation). Previously, primitives were tried before a call but from the moment a primitive failed once, it was replaced with a call in which the primitive was tried (again) right before the bytecode loop. This had two downsides:

  1. On first failure, a primitive was tried twice which can be annoying, for example when running tests where you might end up with two debuggers,
  2. Primitives that fail sometimes always required a direct or indirect call which is rather expensive. primitiveCopyBits, for example, typically fails at startup a bunch of times until forms are uncompressed. The call overhead is now avoid if a primitive succeeds.

Another major change is that there are now separate dispatch nodes for different numbers of arguments. This way, method calls with up to five arguments (the majority of calls) no longer create a lot of Object[] to pass around arguments. There are new interfaces such as Primitive0 and Primitive5WithFallback with execute methods with different signatures. For primitive calls, it is therefore no longer necessary to pass the arguments via an Object[]. The reduction of Object[] improves interpreter performance significantly and reduces memory pressure in the interpreter, while also increasing the amount of code a bit. This can be observed in something as simple as tinyBenchmarks:

Before:

72,000,000 bytecodes/sec; 3,400,000 sends/sec; 168 GCs for 10x tinyBenchmarks
$ mx build # build without compiler
$ $GRAALVM_HOME/bin/trufflesqueak --smalltalk.disable-startup --code "1 tinyBenchmarks" --smalltalk.resource-summary=true --experimental-options --vm.Dpolyglot.engine.WarnInterpreterOnly=false images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on Interpreted...
[trufflesqueak] Image loaded in 771ms.
[trufflesqueak] Skipping startup routine...
[trufflesqueak] Evaluating '1 tinyBenchmarks'...
[trufflesqueak] Result: 72,000,000 bytecodes/sec; 3,400,000 sends/sec
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 8.486s | CPU load: 1.59
[trufflesqueak] >   0.0460s ( 0.54% of total time) in   20 GCs of G1 Young Generation
[trufflesqueak] >   0.0010s ( 0.01% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.0900s ( 1.06% of total time) in    1 GCs of G1 Old Generation
[trufflesqueak] >   0.1370s ( 1.61% of total time) in   23 GCs in total

$ $GRAALVM_HOME/bin/trufflesqueak --smalltalk.disable-startup --code "10 timesRepeat: [ 1 tinyBenchmarks ]" --smalltalk.resource-summary=true --experimental-options --vm.Dpolyglot.engine.WarnInterpreterOnly=false images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on Interpreted...
[trufflesqueak] Image loaded in 768ms.
[trufflesqueak] Skipping startup routine...
[trufflesqueak] Evaluating '10 timesRepeat: [ 1 tinyBenchmarks ]'...
[trufflesqueak] Result: 10
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 57.585s | CPU load: 1.10
[trufflesqueak] >   0.2060s ( 0.36% of total time) in  165 GCs of G1 Young Generation
[trufflesqueak] >   0.0010s ( 0.00% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.0890s ( 0.15% of total time) in    1 GCs of G1 Old Generation
[trufflesqueak] >   0.2960s ( 0.51% of total time) in  168 GCs in total

After:

87,000,000 bytecodes/sec; 4,300,000 sends/sec; 89 GCs for 10x tinyBenchmarks
$ mx build # build without compiler
$ $GRAALVM_HOME/bin/trufflesqueak --smalltalk.disable-startup --code "1 tinyBenchmarks" --smalltalk.resource-summary=true --experimental-options --vm.Dpolyglot.engine.WarnInterpreterOnly=false images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on Interpreted...
[trufflesqueak] Image loaded in 775ms.
[trufflesqueak] Skipping startup routine...
[trufflesqueak] Evaluating '1 tinyBenchmarks'...
[trufflesqueak] Result: 87,000,000 bytecodes/sec; 4,300,000 sends/sec
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 7.346s | CPU load: 1.65
[trufflesqueak] >   0.0370s ( 0.50% of total time) in   13 GCs of G1 Young Generation
[trufflesqueak] >   0.0020s ( 0.03% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.0860s ( 1.17% of total time) in    1 GCs of G1 Old Generation
[trufflesqueak] >   0.1250s ( 1.70% of total time) in   16 GCs in total

$ $GRAALVM_HOME/bin/trufflesqueak --smalltalk.disable-startup --code "10 timesRepeat: [ 1 tinyBenchmarks ]" --smalltalk.resource-summary=true --experimental-options --vm.Dpolyglot.engine.WarnInterpreterOnly=false images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on Interpreted...
[trufflesqueak] Image loaded in 708ms.
[trufflesqueak] Skipping startup routine...
[trufflesqueak] Evaluating '10 timesRepeat: [ 1 tinyBenchmarks ]'...
[trufflesqueak] Result: 10
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 56.672s | CPU load: 1.09
[trufflesqueak] >   0.1130s ( 0.20% of total time) in   86 GCs of G1 Young Generation
[trufflesqueak] >   0.0020s ( 0.00% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.0970s ( 0.17% of total time) in    1 GCs of G1 Old Generation
[trufflesqueak] >   0.2120s ( 0.37% of total time) in   89 GCs in total

That's roughly +20% bytecodes/sec and +26% sends/sec at only 52% of GCs.

At the same time, peak performance has not really changed, at least for tinyBenchmarks:

Before:

~10,000,000,000 bytecodes/sec; ~340,000,000 sends/sec
$ mx --dy /compiler build
$ $GRAALVM_HOME/bin/trufflesqueak --engine.Mode=default --code "10 timesRepeat: [ FileStream stdout nextPutAll: (1 tinyBenchmarks) asString; cr ]" --smalltalk.resource-summary=true --experimental-options images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on GraalVM CE...
[trufflesqueak] Image loaded in 2383ms.
[trufflesqueak] Evaluating '10 timesRepeat: [ FileStream stdout nextPutAll: (1 tinyBenchmarks) asString; cr ]'...
9,400,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
9,900,000,000 bytecodes/sec; 350,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
9,800,000,000 bytecodes/sec; 340,000,000 sends/sec
9,700,000,000 bytecodes/sec; 340,000,000 sends/sec
9,900,000,000 bytecodes/sec; 340,000,000 sends/sec
[trufflesqueak] Result: 10
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 80.873s | CPU load: 1.56
[trufflesqueak] >   0.4950s ( 0.61% of total time) in  236 GCs of G1 Young Generation
[trufflesqueak] >   0.0010s ( 0.00% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.1400s ( 0.17% of total time) in    2 GCs of G1 Old Generation
[trufflesqueak] >   0.6360s ( 0.79% of total time) in  240 GCs in total

After:

~10,000,000,000 bytecodes/sec; ~340,000,000 sends/sec
$ mx --dy /compiler build
$ $GRAALVM_HOME/bin/trufflesqueak --engine.Mode=default --code "10 timesRepeat: [ FileStream stdout nextPutAll: (1 tinyBenchmarks) asString; cr ]" --smalltalk.resource-summary=true --experimental-options images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on GraalVM CE...
[trufflesqueak] Image loaded in 2588ms.
[trufflesqueak] Evaluating '10 timesRepeat: [ FileStream stdout nextPutAll: (1 tinyBenchmarks) asString; cr ]'...
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
9,900,000,000 bytecodes/sec; 330,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
9,900,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
[trufflesqueak] Result: 10
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 80.878s | CPU load: 1.60
[trufflesqueak] >   0.4920s ( 0.61% of total time) in  266 GCs of G1 Young Generation
[trufflesqueak] >   0.0010s ( 0.00% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.1530s ( 0.19% of total time) in    2 GCs of G1 Old Generation
[trufflesqueak] >   0.6460s ( 0.80% of total time) in  270 GCs in total

Nonetheless, we should run the full benchmark suite before this gets merged.

@fniephaus fniephaus self-assigned this Jan 12, 2025
@fniephaus fniephaus force-pushed the wip/revise-dispatch-prims branch 5 times, most recently from 7f74a51 to 44dcfec Compare January 14, 2025 15:24
@fniephaus fniephaus force-pushed the wip/revise-dispatch-prims branch 2 times, most recently from df27eab to 7ad3018 Compare January 14, 2025 20:46
@fniephaus fniephaus force-pushed the wip/revise-dispatch-prims branch from 7ad3018 to cc4067d Compare January 20, 2025 08:34
@fniephaus fniephaus force-pushed the wip/revise-dispatch-prims branch from cc4067d to 63b3193 Compare January 20, 2025 08:37
@fniephaus fniephaus closed this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant