Revise message dispatch and primitives #177

fniephaus · 2025-01-12T13:22:00Z

This PR revises the message dispatch in TruffleSqueak significantly: most importantly, primitives are now always being tried a direct or indirect call happens (which brings TS closer to the reference implementation). Previously, primitives were tried before a call but from the moment a primitive failed once, it was replaced with a call in which the primitive was tried (again) right before the bytecode loop. This had two downsides:

On first failure, a primitive was tried twice which can be annoying, for example when running tests where you might end up with two debuggers,
Primitives that fail sometimes always required a direct or indirect call which is rather expensive. primitiveCopyBits, for example, typically fails at startup a bunch of times until forms are uncompressed. The call overhead is now avoid if a primitive succeeds.

Another major change is that there are now separate dispatch nodes for different numbers of arguments. This way, method calls with up to five arguments (the majority of calls) no longer create a lot of Object[] to pass around arguments. There are new interfaces such as Primitive0 and Primitive5WithFallback with execute methods with different signatures. For primitive calls, it is therefore no longer necessary to pass the arguments via an Object[]. The reduction of Object[] improves interpreter performance significantly and reduces memory pressure in the interpreter, while also increasing the amount of code a bit. This can be observed in something as simple as tinyBenchmarks:

Before:

72,000,000 bytecodes/sec; 3,400,000 sends/sec; 168 GCs for 10x tinyBenchmarks

$ mx build # build without compiler
$ $GRAALVM_HOME/bin/trufflesqueak --smalltalk.disable-startup --code "1 tinyBenchmarks" --smalltalk.resource-summary=true --experimental-options --vm.Dpolyglot.engine.WarnInterpreterOnly=false images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on Interpreted...
[trufflesqueak] Image loaded in 771ms.
[trufflesqueak] Skipping startup routine...
[trufflesqueak] Evaluating '1 tinyBenchmarks'...
[trufflesqueak] Result: 72,000,000 bytecodes/sec; 3,400,000 sends/sec
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 8.486s | CPU load: 1.59
[trufflesqueak] >   0.0460s ( 0.54% of total time) in   20 GCs of G1 Young Generation
[trufflesqueak] >   0.0010s ( 0.01% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.0900s ( 1.06% of total time) in    1 GCs of G1 Old Generation
[trufflesqueak] >   0.1370s ( 1.61% of total time) in   23 GCs in total

$ $GRAALVM_HOME/bin/trufflesqueak --smalltalk.disable-startup --code "10 timesRepeat: [ 1 tinyBenchmarks ]" --smalltalk.resource-summary=true --experimental-options --vm.Dpolyglot.engine.WarnInterpreterOnly=false images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on Interpreted...
[trufflesqueak] Image loaded in 768ms.
[trufflesqueak] Skipping startup routine...
[trufflesqueak] Evaluating '10 timesRepeat: [ 1 tinyBenchmarks ]'...
[trufflesqueak] Result: 10
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 57.585s | CPU load: 1.10
[trufflesqueak] >   0.2060s ( 0.36% of total time) in  165 GCs of G1 Young Generation
[trufflesqueak] >   0.0010s ( 0.00% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.0890s ( 0.15% of total time) in    1 GCs of G1 Old Generation
[trufflesqueak] >   0.2960s ( 0.51% of total time) in  168 GCs in total

After:

87,000,000 bytecodes/sec; 4,300,000 sends/sec; 89 GCs for 10x tinyBenchmarks

$ mx build # build without compiler
$ $GRAALVM_HOME/bin/trufflesqueak --smalltalk.disable-startup --code "1 tinyBenchmarks" --smalltalk.resource-summary=true --experimental-options --vm.Dpolyglot.engine.WarnInterpreterOnly=false images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on Interpreted...
[trufflesqueak] Image loaded in 775ms.
[trufflesqueak] Skipping startup routine...
[trufflesqueak] Evaluating '1 tinyBenchmarks'...
[trufflesqueak] Result: 87,000,000 bytecodes/sec; 4,300,000 sends/sec
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 7.346s | CPU load: 1.65
[trufflesqueak] >   0.0370s ( 0.50% of total time) in   13 GCs of G1 Young Generation
[trufflesqueak] >   0.0020s ( 0.03% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.0860s ( 1.17% of total time) in    1 GCs of G1 Old Generation
[trufflesqueak] >   0.1250s ( 1.70% of total time) in   16 GCs in total

$ $GRAALVM_HOME/bin/trufflesqueak --smalltalk.disable-startup --code "10 timesRepeat: [ 1 tinyBenchmarks ]" --smalltalk.resource-summary=true --experimental-options --vm.Dpolyglot.engine.WarnInterpreterOnly=false images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on Interpreted...
[trufflesqueak] Image loaded in 708ms.
[trufflesqueak] Skipping startup routine...
[trufflesqueak] Evaluating '10 timesRepeat: [ 1 tinyBenchmarks ]'...
[trufflesqueak] Result: 10
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 56.672s | CPU load: 1.09
[trufflesqueak] >   0.1130s ( 0.20% of total time) in   86 GCs of G1 Young Generation
[trufflesqueak] >   0.0020s ( 0.00% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.0970s ( 0.17% of total time) in    1 GCs of G1 Old Generation
[trufflesqueak] >   0.2120s ( 0.37% of total time) in   89 GCs in total

That's roughly +20% bytecodes/sec and +26% sends/sec at only 52% of GCs.

At the same time, peak performance has not really changed, at least for tinyBenchmarks:

Before:

~10,000,000,000 bytecodes/sec; ~340,000,000 sends/sec

$ mx --dy /compiler build
$ $GRAALVM_HOME/bin/trufflesqueak --engine.Mode=default --code "10 timesRepeat: [ FileStream stdout nextPutAll: (1 tinyBenchmarks) asString; cr ]" --smalltalk.resource-summary=true --experimental-options images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on GraalVM CE...
[trufflesqueak] Image loaded in 2383ms.
[trufflesqueak] Evaluating '10 timesRepeat: [ FileStream stdout nextPutAll: (1 tinyBenchmarks) asString; cr ]'...
9,400,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
9,900,000,000 bytecodes/sec; 350,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
9,800,000,000 bytecodes/sec; 340,000,000 sends/sec
9,700,000,000 bytecodes/sec; 340,000,000 sends/sec
9,900,000,000 bytecodes/sec; 340,000,000 sends/sec
[trufflesqueak] Result: 10
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 80.873s | CPU load: 1.56
[trufflesqueak] >   0.4950s ( 0.61% of total time) in  236 GCs of G1 Young Generation
[trufflesqueak] >   0.0010s ( 0.00% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.1400s ( 0.17% of total time) in    2 GCs of G1 Old Generation
[trufflesqueak] >   0.6360s ( 0.79% of total time) in  240 GCs in total

After:

~10,000,000,000 bytecodes/sec; ~340,000,000 sends/sec

$ mx --dy /compiler build
$ $GRAALVM_HOME/bin/trufflesqueak --engine.Mode=default --code "10 timesRepeat: [ FileStream stdout nextPutAll: (1 tinyBenchmarks) asString; cr ]" --smalltalk.resource-summary=true --experimental-options images/test-64bit.image 
[trufflesqueak] Running test-64bit.image on GraalVM CE...
[trufflesqueak] Image loaded in 2588ms.
[trufflesqueak] Evaluating '10 timesRepeat: [ FileStream stdout nextPutAll: (1 tinyBenchmarks) asString; cr ]'...
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
9,900,000,000 bytecodes/sec; 330,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
9,900,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
10,000,000,000 bytecodes/sec; 340,000,000 sends/sec
[trufflesqueak] Result: 10
[trufflesqueak] # Resource Summary
[trufflesqueak] > Total process time: 80.878s | CPU load: 1.60
[trufflesqueak] >   0.4920s ( 0.61% of total time) in  266 GCs of G1 Young Generation
[trufflesqueak] >   0.0010s ( 0.00% of total time) in    2 GCs of G1 Concurrent GC
[trufflesqueak] >   0.1530s ( 0.19% of total time) in    2 GCs of G1 Old Generation
[trufflesqueak] >   0.6460s ( 0.80% of total time) in  270 GCs in total

Nonetheless, we should run the full benchmark suite before this gets merged.

and increase PERFORM_SELECTOR_CACHE_LIMIT from 2 to 4

This reverts commit 86b417e.

fniephaus added 20 commits January 3, 2025 17:42

Revise message dispatch and primitives WIP

56011ba

Address warnings

4c87f28

Ignore compilation errors temporarily

86b417e

Reduce primitive logging

b1e75ba

Handle primFailCode correctly

1c5fb25

Fix copyright header

3eb81fa

Simplify and fix tests

badb24f

Fix indirect cases of prim 188

b1004cc

Cleanups

03c8f91

Profile class of primitive nodes in indirect sends

d6628f4

Ignore primitives 65-67 when logging

265a315

Clear receiver and push when necessary

937c68a

Introduce CacheLimits

04e7825

and increase PERFORM_SELECTOR_CACHE_LIMIT from 2 to 4

Remove useless profile

68b3b04

Allow superSends to resolve to DNU or OAM

ec902a0

Fix for testTerminateEnsureOnTopOfEnsure

0a466ea

Introduce nodes for indirect dispatch

8db278c

Use cached prim node on hot path

2010a5d

Revert "Ignore compilation errors temporarily"

740c92c

This reverts commit 86b417e.

Remove node inlining for indirect dispatches

6cb8277

fniephaus self-assigned this Jan 12, 2025

fniephaus added 4 commits January 12, 2025 19:30

Simplify SendBytecodes

d668323

Cleanups and share code

e927c14

Remove numCopied from AbstractPushClosureNode

1a20d3e

Minor cleanups

1cbebe2

fniephaus force-pushed the wip/revise-dispatch-prims branch 5 times, most recently from 7f74a51 to 44dcfec Compare January 14, 2025 15:24

fniephaus force-pushed the wip/revise-dispatch-prims branch 2 times, most recently from df27eab to 7ad3018 Compare January 14, 2025 20:46

fniephaus added 7 commits January 20, 2025 08:13

Revise special selector sends and cleanups

63dd39d

Style fixes

63ef938

Ignore unused imports on JDT

8cee0ce

Do not show statistics on reload

f10b305

Remove index from AbstractBytecodeNode

c1313eb

Use % instead of & to derive hash

db2bb85

Drop dependency on truffle-enterprise

c381838

fniephaus force-pushed the wip/revise-dispatch-prims branch from 7ad3018 to cc4067d Compare January 20, 2025 08:34

fniephaus added 4 commits January 20, 2025 09:37

Upgrade to 24.2 release branch and MX 7.38.1

7f92d98

Use Frame.copyTo()

bf20a35

Resolve @Bind("this") warning

a87c5e7

Resolve @Bind("$node") warning

63b3193

fniephaus force-pushed the wip/revise-dispatch-prims branch from cc4067d to 63b3193 Compare January 20, 2025 08:37

fniephaus closed this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise message dispatch and primitives #177

Revise message dispatch and primitives #177

fniephaus commented Jan 12, 2025

Revise message dispatch and primitives #177

Revise message dispatch and primitives #177

Conversation

fniephaus commented Jan 12, 2025