Skip to content

Commit

Permalink
move stuff around
Browse files Browse the repository at this point in the history
  • Loading branch information
Mathnerd314 committed Jan 14, 2024
1 parent b3e14ff commit 66578aa
Show file tree
Hide file tree
Showing 62 changed files with 1,565 additions and 854 deletions.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -165,17 +165,111 @@ Profile data consists of several collection of info:

* detailed: instruction-level counts, several invocation/backends counts with timestamp, data on branches, call receiver types, typechecks (checkcast, instanceof, aastore). but collecting it adds 35% overhead over just per-method counters

12B. Separately translated units may be assembled into operational systems. It shall be possible for a separately translated unit to reference exported definitions of other units. All language imposed restrictions shall be enforced across such interfaces. Separate translation shall not change the semantics of a correct program.

Whole-Program Compilation - all code must be available at compile-time. This allows several optimizations
• Enables monomorphization which increases inlining opportunities and avoids the need to box primitives.
• Enables aggressive dead code elimination and tree shaking which significantly reduces code size.
• Enables cross namespace/module optimizations.

* Enables monomorphization which increases inlining opportunities and avoids the need to box primitives.
* Enables aggressive dead code elimination and tree shaking which significantly reduces code size.
* Enables cross namespace/module optimizations.

In the past, requiring access to the entire source code of a program may been impractical. Today, systems are sufficiently performant that JavaScript, Python, PHP, and Rust have ecosystems where there is no separate compilation, and arguably Java pioneered this model with JIT compilation not paying any attention to module boundaries. Similarly Google and Facebook use monolithic repositories of source code, but have caching optimizations so that developers may use the cloud.

future JIT compiler

specializing adaptive interpreter
collecting runtime profiling data for user code
generating the interpreter cases, the main `switch`, from a domain specific language - error handling, reference counts, adding counters, stats
generate multiple interpreters from a single source of truth
generate lots of metadata by analyzing that single source of truth
internal pipeline for detecting, optimizing, and executing hot code paths
find hot loops in your code, build a trace of that loop, and break it up into a more efficient representation, do some light optimization – and execute it in a second interpreter

Being able to break big complicated bytecode instructions down into more simple atomic steps is only possible because of specialization and defining bytecode instructions in terms of smaller steps


Example: Fibonacci function
def fibonacci(n)
a, b = 0, 1
for _ in range(n): # inner loop
a, b = b, a + b # update a and b by adding them together
return a

The bytecode for the loop is something like this:
FOR_ITER
STORE_FAST
LOAD_FAST_LOAD_FAST
LOAD_FAST
BINARY_OP
STORE_FAST_STORE_FAST
JUMP_BACKWARD

these are generic operations like FOR_ITER and BINARY_OP which have been around for years. But we can specialize these at runtime - like FOR_ITER into FOR_ITER_RANGE and BINARY_OP into BINARY_OP_ADD_INT. Then we build micro-op traces - smaller, more atomic steps that each individual instructions is broken up into.

FOR_ITER_RANGE - _SET_IP, _ITER_CHECK_RANGE, _IS_ITER_EXHAUSTED_RANGE, _POP_JUMP_IF_TRUE, _ITER_NEXT_RANGE
STORE_FAST - _SET_IP, STORE_FAST
LOAD_FAST_LOAD_FAST - _SET_IP, LOAD_FAST, LOAD_FAST
LOAD_FAST - _SET_IP, LOAD_FAST
BINARY_OP_ADD_INT - _SET_IP, _GUARD_BOTH_INT, _BINARY_OP_ADD_INT
STORE_FAST_STORE_FAST - _SET_IP, STORE_FAST, STORE_FAST
JUMP_BACKWARD - _SET_IP, _JUMP_TO_TOP

So a small instruction expands to two micro-ops, but the more complicated ones may have several different parts. Then we can optimize this - remove redundant frame pointer updates (only needed for JUMP_BACKWARD), remove range checks, remove int guard. And what's left is basically the bare minimum amount of work required to actually execute this hot inner loop. So now that it's translated and optimized, we have to do just-in-time code generation.


burning in - encode constants, caches, arguments directly into the machine code, e.g. immediate arguments
move data off of frames and into registers - eliminate intermediate reads and writes to memory
eliminate jumps back into the interpreter

options:
copy-and-patch compilation
WebAssembly baseline compiler (Liftoff)
LLVM toolchain (LLVM -O0)
LuaJIT

see paper for benchmarks, of course multiple tiers are better, but tl;dr is copy-and-patch is a nice middle tier. It is a template JIT compiler. In particular, it works by copying over a static pre-compiled machine code "template" into executable memory, and then going through that machine code and patching up instructions that need to have runtime data encoded in them. This is sort of like the relocation phase of linking/loading an ELF file. And actually we can use LLVM to build an ELF object file and generate our templates. For example:

extern int MAGICALLY_INSERT_THE_OPARG;
extern int MAGICALLY_CONTINUE_EXECUTION(_PyInterpreterFrame *frame, PyObject **stack_pointer);
int load_fast(_PyInterpreterFrame *frame, PyObject **stack_pointer)
{
int oparg = &MAGICALLY_INSERT_THE_OPARG;
PyObject *value = frame->localsplus[oparg];
Py_INCREF(value);
*stack_pointer++ = value;
__attribute__((musttail)) return MAGICALLY_CONTINUE_EXECUTION(frame, stack_pointer);
}
So there are extern placeholders for inserting the oparg and continuing execution.
For the oparg, we use the address of the extern for our oparg. This generates more efficient code because the relocation inserts the constant directly, instead of needing to dereference the address.
And for continuing execution, we use LLVM's `musttail` so we get a single jump to the next opcode, and even better, if that jump happens to be of length zero, we can just skip the jump entirely. So, the object file that we get out of this looks like this:

.static
00: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
0a: 48 98 cltq
0c: 49 8b 44 c5 48 movq 0x48(%r13,%rax,8), %rax
11: 8b 08 movl (%rax), %ecx
13: ff c1 incl %ecx
15: 74 02 je 0x19 <load_fast+0x19>
17: 89 08 movl %ecx, (%rax)
19: 48 89 45 00 movq %rax, (%rbp)
1d: 48 83 c5 08 addq $0x8, %rbp
21: e9 00 00 00 00 jmp 0x26 <load_fast+0x26>
.reloc
02: R_X86_64_64 MAGICALLY_INSERT_THE_OPARG
22: R_X86_64_PLT32 MAGICALLY_CONTINUE_EXECUTION - 0x4

We have the machine code, and the relocations, and we know the calling convention. And so we can take this, parse it out and put it in static header files as data, and then we can implement copy and patch for real. There is python code https://github.com/brandtbucher/cpython/tree/justin/Tools/jit (c4904e44167de6d3f7a1f985697710fd8219b3b2) that handles actually extracting all the cases, compiling each one, parsing out the ELF (by dumping with LLVM to JSON), and then generating the header files. Then the final build has no LLVM dependency and is a self-contained JIT. And because clang/LLVM is portable, you can cross-compile for all platforms from Linux, or do whatever.

and you can play with the templates, like compiling super-instructions for common pairs or triples, or adding more oparg holes. and it mixes well with handwritten assembly or a more aggressive compilation strategy, you just special-case the opcode / basic block and say "use this assembly instead".

debugging: can we re-use unwind the ELF tables or DWARF in the JITted code? Also look at Java for how they dealt with debugging

Optimization
============

11F. Programs may advise translators on the optimization criteria to be used in a scope. It shall be possible in programs to specify whether minimum translation costs or minimum execution costs are more important, and whether execution time or memory space is to be given preference. All such specifications shall be optional. Except for the amount of time and space required during execution, approximate values beyond the specified precision, the order in which exceptions are detected, and the occurrence of side effects within an expression, optimization shall not alter the semantics of correct programs, (e.g., the semantics of parameters will be unaffected by the choice between open and closed calls).

All software consumes resources: time, memory, compute, power, binary size, and so on. The software's performance may be defined as the rate of resource consumption. There are various measurements of different aspects of performance. E.g. at compile time, one may measure execution time, memory usage, power usage. Similarly at runtime, there are more metrics: execution time, power usage, memory usage, executable size, throughput (work/time), latency (time from request to response). Pretty much anything that can be estimated or measured is fair game. Overall, there is a set of resource constraints and the goal is to maximize performance within those constraints. This performance goal can be a flexible one like "as fast as possible" or a hard one like "cost to operate must be within our budget". Performance is influential and can be a competitive advantage. Many projects track performance metrics and wish to get the best performance possible subject to limitations on manpower.

It would be great to support optimizing the code for any objective function based on some combination of these criteria. But that's hard. So let's look at some use cases:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The simplest compiler writes out a file like:

Except in the proper object format.

Then the rest is optimizing/specializing this to run more efficiently.
Then the rest is optimizing/specializing this to run more efficiently. Particularly, there is `tree shaking <https://en.wikipedia.org/wiki/Tree_shaking>`__. Steelman 1D defines this as "unused and constant portions of programs will not add to execution costs. Execution time support packages of the language shall not be included in object code unless they are called."

Formats
=======
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Errors
######

13D. Translators shall be responsible for reporting errors that are detectable during translation and for optimizing object code. Translators shall be responsible for the integrity of object code in affected translation units when any separately translated unit is modified, and shall ensure that shared definitions have compatible representations in all translation units. Translators shall do full syntax and type checking, shall check that all language imposed restrictions are met, and should provide warnings where constructs will be dangerous or unusually expensive in execution and shall attempt to detect exceptions during translation. If the translator determines that a call on a routine will not terminate normally, the exception shall be reported as a translation error at the point of call.

The compiler gives clear and helpful error messages that make it easier for developers to identify and fix problems in their code.

Error levels
Expand Down Expand Up @@ -35,6 +37,11 @@ W2 is fine and will not cause any problems upgrading. But W1 and W3 will break t

We can formalize this process with two presets: a 'before' preset warns rather than fatally erroring on any issues that the old compiler didn't bother about, and an an 'after' preset that fatally errors on such issues. Of course this doesn't help with downgrading or cases where the fatal error cannot be turned into a warning.

Automatic fixing
----------------

Steelman 1B says "Translators shall produce explanatory diagnostic and warning messages, but shall not attempt to correct programming errors." I think the part about not correcting programming errors is stupid - Java IDE's have offered "quick fix" options for ages. Now maybe there is some room for making the process interactive, so that the compiler does not completely mess up the program without prompting, but neither does this mean that auto-fix isn't a goal.

Error types
===========

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ As fast as C

How do you prove that Stroscot is "as fast as C"? Well, we must show that every C program has a natural Stroscot translation, that performs at least as fast. Since Stroscot is an expressive language there may be many natural translations, and all must perform as fast.

Stroscot should also have fast performance on programs that aren't natural translations of C programs, but this is secondary to the claim. Even the lambda calculus, after years of study, has no optimal implementation on modern machines. So long as the implementation is reasonable the actual performance on non-C-like programs doesn't matter - worst-case, the feature becomes a footgun for people are performance sensitive.
Stroscot should also have fast performance on programs that aren't natural translations of C programs, but this is secondary to the claim. Even the lambda calculus, after years of study, has no optimal implementation on modern machines. So long as the implementation is reasonable, the actual performance on non-C-like programs doesn't matter - worst-case, the feature becomes a footgun for people are performance sensitive.

Benchmarks
==========
Expand Down
Loading

0 comments on commit 66578aa

Please sign in to comment.