Data-oriented pass on the encode hot path. Profiling showed bounds checks
already elided by -o:speed; the cost was per-instruction loop/scan machinery
and immediates falling off the hint path.
- Gather the immediate slot in the single resolve pass and emit it straight-
line (no scan over enc.enc); likewise drive the legacy REX prefix from the
precomputed reg/mr/opr slots instead of a per-form scan.
- Fold the separate needs_66 (GPR16) and SPL/BPL/SIL/DIL operand loops into
the resolve pass, so user operands are visited exactly once. This was the
big one: mov r,r 27 -> 21 ns.
- Gate the whole legacy-prefix block on a single flags!=0 test (a legacy
prefix is almost always absent) instead of four branches per instruction.
- Make immediate forms hintable. A typed immediate builder names its width
(inst_add_r32_imm32), the matcher already keys off the operand's declared
size, so baking the form is byte-identical AND drops immediates from the
full match scan: mov r32,imm32 55.7 -> 17.8 ns (3.1x).
Floor (no-op) 14.55 -> 10.3 ns; realistic immediate-heavy typed mix
30 -> 20.5 ns/inst (~49 M inst/s). gen/builders/check/test/idempotent green;
2282 cases (typed==generic byte-identical, incl. the new immediate cases).
Targeted branchless revert + the pre-matched form fast path, and a fix
for a pre-existing bug the latter surfaced.
(a) Revert the two speculative-write spots from the prior branchless pass
(legacy-prefix emission, widened displacement store, ENCODE_TAIL_SLACK)
back to predicted branches. In real streams a legacy prefix is almost
always absent and disp size is stable, so those branches are ~free and
the unconditional stores only added work. Every class got faster
(RET 19->17.5, MOV r,r 52->46.6, VADDPS 42.8->39.3 ns).
(b) Pre-matched form hint. Instruction.enc_hint (in the existing 11-byte
padding, idx+1 biased; 0 = matcher path) lets a typed builder that maps
to a single value-independent form bake the global form index, so
encode() skips the O(forms) match scan -- and, in a varied stream, its
unpredictable branches. Generated for non-immediate forms only (value-
dependent imm8/imm32 selection stays on the matcher). On a 100k mixed
typed-builder stream: 47.3 -> 30.2 ns/inst (-36%), byte-identical to the
matcher path -- ~2x the original baseline for codegen.
Repair the typed inst_/emit_ builders. They were non-functional: the
generator cast the hw-only typed enum straight to Register
(Register(GPR64.RAX) -> class 0), so every typed-builder operand was
rejected by the matcher (encode returned empty). Untested because the
suite builds via the generic constructors. Now they build through the
class-correct op_gpr64/op_xmm/... path (op_* already used by 3+ operand
builders), emit_ reuses inst_, and a new 30-case consistency suite
asserts typed == generic (llvm-verified) and hint == matcher.
gen/builders/check/test/idempotent all green; 2276 cases.
Move all ten ISA packages (x86, arm32, arm64, mips, riscv, ppc, ppc_vle,
rsp, mos6502, mos65816) from core/rexcode/<arch> to core/rexcode/isa/<arch>,
so the import pattern is now `import "core:rexcode/isa/x86"`. The shared
core stays at core:rexcode/isa.
Mechanical: relative `import "../isa"` / "../../isa" -> absolute
"core:rexcode/isa" (the only path that survives the move; the "../" and
"../.." self/generated imports move with their packages). build.lua now
builds paths as <root>/isa/<name>; stale `cd <arch>` hints in the verify
tools and the doc.odin paths updated.
WASM stays at core/rexcode/wasm for now -- it is an IR, not an ISA, and
will move under the forthcoming core:rexcode/ir once that layer lands.
All 10 arches gen/builders/check/test green; import core:rexcode/isa/x86
verified working; wasm still compiles.