Files
Odin/core/rexcode/docs/ir_design.md
Brendan Punsky daa5b7cb79 rexcode: add core:rexcode/ir — the IR API layer (no concrete IR yet)
A sibling to core:rexcode/isa for the intermediate representations (WASM,
SPIR-V, LLVM bitcode + the LLVM dialects AIR/DXIL). Holds the shared
vocabulary every IR package builds on, implements no specific IR.

Design stance (see docs/ir_design.md): keep the ISA layer's spirit, but
where IRs are structurally MORE uniform than ISAs (SSA + a type system
regularize the operand/module shape), the shared core is richer. ir/ owns:

  status.odin  Error/Error_Code (shape-identical to isa.Error)
  refs.odin    Id/Ref/Ref_Space/Symbol_Table (the label analog: structural
               id references, not PC-relative byte offsets)
  types.odin   Type/Type_Ref/Type_Kind (the type table -- no ISA analog)
  module.odin  Module/Function/Block/Operation/Operand/Result/Dataflow
               (the structured model; Operation = isa.Instruction + an
               optional typed Result, opcode a u16 like Mnemonic)
  print.odin   token kinds + options + num-fmt (parallels isa.print)

Three honest concessions vs the ISA API, made explicit not inert: a
structured Module replaces the flat []Instruction; a first-class type
system; id-based entity refs replace labels. The encode/decode verbs take
a Module and drop label_defs/resolve/base_address. Dataflow hosts both the
WASM value stack and SSA; the codec is pluggable (table for WASM/SPIR-V,
bitstream for the LLVM family -- AIR/DXIL are LLVM dialects, not peers).

Package compiles; a hand-built SSA module round-trips through the types.
2026-06-18 19:03:27 -04:00

13 KiB
Raw Blame History

rexcode — IR API Design

Why the rexcode IR family (wasm, and the planned spirv, llvm, with air / dxil as LLVM dialects) gets its own API layer (core:rexcode/ir) parallel to the ISA layer (core:rexcode/isa) — sharing the ISA layer's spirit and as much of its shape as honestly survives, while conceding exactly the three places an IR is not an ISA.

Read cross_arch_design.md first; this document is its sibling and assumes its vocabulary.


The guiding principle

The ISA layer's rule was “share the bookkeeping, specialize the bytes.” The IR layer keeps it and adds one clause:

Share the bookkeeping and the structure, specialize the dialect and the codec.

An ISA only ever shares bookkeeping because its content (registers, operands, the bit-twiddling) diverges maximally per arch. An IR shares more — the whole Module → Function → Block → Operation structure and the operand/type model are genuinely the same problem on every IR — because SSA and a type system regularize what ISAs leave ad hoc. So ir/ is a richer shared core than isa/: it owns the structural model, not just labels and errors. What stays per-IR is the opcode set, the codec (the wire format), and the dialect (intrinsic/metadata conventions).


0. First: how many IRs are there really?

Fewer than the list suggests. AIR and DXIL are not peers of LLVM — they are LLVM bitcode. AIR is LLVM bitcode + a Metal dialect; DXIL is LLVM ~3.7 bitcode

  • a DirectX dialect inside a DXContainer. So the field is three codec families, not five:
family members wire format
WASM wasm byte stream + LEB128, one form per opcode
SPIR-V spirv 32-bit words, uniform wordCount<<16 | opcode header
LLVM bitstream llvm, air, dxil self-describing block/record/abbreviation bitstream

The implementation cost is therefore 3 codecs + N dialects, and air/dxil should reuse the llvm codec wholesale. That single fact shapes the package tree: ir/llvm/, ir/llvm/air/, ir/llvm/dxil/.


1. The universal IR shape

Strip away specifics and every IR needs these — the same checklist isa has, shifted up one level of structure:

# Concept ir type
1 A type = (kind, width/elem/fields) Type, Type_Ref
2 An operand = literal | entity-ref | type Operand, Operand_Kind
3 An operation = opcode + operands + optional result Operation
4 An opcode enum per-IR Opcode (u16, INVALID=0)
5 References to entities by id (+ named symbols) Id, Ref, Symbol_Table
6 Relocations for object-file symbol fixups per-IR Relocation
7 encode(Module) -> bytes (+relocs +errors) per-IR encode()
8 decode(bytes) -> Module (+errors) per-IR decode()
9 print(Module) -> text (+tokens) per-IR print()/tprint()
+ A structured module of functions→blocks→operations Module/Function/Block
+ A dataflow discipline (stack or SSA) Dataflow

Items 19 are item-for-item the ISA's nine, re-aimed: type generalizes the ISA's implicit-width; operand keeps the kind tag; operation is Instruction

  • a Result; opcode is Mnemonic; references replace labels. The two + rows are the genuinely new structure (§3).

2. Where IRs diverge from ISAs

Three real divergences, then a long tail of things that look different but are the same shape.

The three real concessions

  1. The unit of work is a structured Module, not a flat []Instruction. An ISA program is a byte-addressed instruction stream; an IR program is a typed graph: Module → []Function → []Block → []Operation, where an op may define an SSA value that later ops use. So decode is a structured parse, not a linear scan, and ir owns Module/Function/Block where isa owns no Instruction. Operation.operands is variable-arity ([]Operand) — the ISA Instruction's fixed [4]Operand is the one leaf shape that does not survive (calls, switch, phi).

  2. A first-class type system. Operations and results carry a Type_Ref into the module's type table. ISAs bake width into the mnemonic and never need this. Type_Kind is the WASMSPIR-VLLVM denominator (INT/FLOAT/VECTOR/ POINTER/STRUCT/FUNCTION/...).

  3. Entity references replace PC-relative labels. ISA branches resolve as instruction-index→byte-offset (isa.Label_Definition, rewritten by encode). IR operands reference entities by Id — SSA results, blocks, functions, globals, types — resolved structurally, with no PC-relative pass. (Object- file symbol fixups still produce Relocations for EXTERNAL refs.)

Two axes that sort the IRs

Everything else sorts onto two orthogonal axes. Note the clustering is counterintuitive — the encoding mates and the model mates are different pairs:

IR encoding model dataflow model
WASM table (byte/LEB, one form per opcode) stack (implicit)
SPIR-V table (32-bit words, uniform header) SSA (result ids, typed)
LLVM / AIR / DXIL bitstream (data-defined abbreviations) SSA (+ metadata graph)
  • On encoding, WASM and SPIR-V are siblings — a static opcode → operand- layout table, exactly the ISA ENCODING_TABLE shape. LLVM is the outlier: its layout is defined by abbreviation records in the stream, so no static table can describe it.
  • On dataflow, SPIR-V/LLVM are siblings (SSA + types); WASM is the outlier — a stack bytecode with no SSA, no named results, minimal types.

So WASM is encoding-kin to SPIR-V but model-kin to nothing, and the one thing you most want to share (LLVM) breaks the table assumption the others share. The Dataflow trait and the pluggable codec (§5) exist precisely to absorb these two splits without forking the API.

Divergence summary

Component Verdict Shared (ir/) Per-IR
References / Id shared the whole id + symbol model which Ref_Spaces exist
Error / status shared struct shape (= isa.Error) error-code subset
Type model shared Type/Type_Ref/Type_Kind wire⇄Type lowering
Operand model shared* Operand + kinds (SSA homogenizes it) dialect aux encodings
Structural model shared Module/Function/Block/Operation
Printer framework ◑ split tokens, options, num-fmt type/value/block syntax
Relocation ◑ split struct-shape convention type enum (per-IR file)
Opcode ✗ per-IR convention (u16, INVALID=0) the enum
Opcode table / codec ✗ per-IR codec strategy (§5) schema + data (or bitstream)
encode/decode driver ✗ per-IR verb signature the whole parse/emit

*Operand is shared here where isa.Operand is per-arch. ISA operands diverge wildly (ModRM/SIB vs shifted-register vs split immediates); SSA collapses IR operands to "a literal, a reference, or a type," uniform enough to define once. Dialect-specific encodings (WASM memarg, SPIR-V enum masks) are an encoding detail carried in Operand.aux + the IR's opcode table — not a new shape.


3. The shared core (ir/) and why this much is shared

ir/ depends on nothing (it does not depend on isa/) and owns the parts that are the same problem on every IR:

  • status.odinError/Error_Code; the Error struct is byte-identical to isa.Error so one tool surfaces both.
  • refs.odinId, Ref, Ref_Space, Symbol_Table (the isa.labels analog, re-cast from byte-offsets to structural ids).
  • types.odinType, Type_Ref, Type_Kind (no ISA analog).
  • module.odinModule/Function/Block/Operation/Operand/Result/ Dataflow (the structural model; the heart of the layer).
  • print.odin — token kinds (with IR-only TYPE/VALUE_REF/RESULT/ BLOCK_LABEL), print options, number-formatting helpers.

Each concrete IR package re-exports these (e.g. wasm.Module, spirv.Operation) so a consumer sees one namespace, mirroring how arch packages re-export isa.

The validating precedent and the rejected alternatives

The Operation-with-blocks-and-regions spine is exactly MLIR's structural model, which is field-proof that one model cleanly subsumes a CFG (LLVM/SPIR-V), structured control (WASM, as block regions), and a flat ISA (the degenerate one-block, no-SSA case). We take MLIR's spine, not its open-ended generality (no region/trait/dialect-registry machinery) — the lean version.

Rejected, for the same reasons the ISA layer rejected its three:

  1. Fold ISAs into the IR API (ISA = "degenerate IR"). True in theory, but it taxes the fast, flat ISA hot path with type/SSA/module machinery it never needs. Keep them siblings; share only the leaf vocabulary in spirit.
  2. One concrete codec for all IRs. LLVM's bitstream is not a static table; forcing WASM/SPIR-V and LLVM through one table breaks LLVM. The codec is pluggable behind the verbs (§5).
  3. Bake in SSA (mandatory results + value-refs). Excludes WASM. Dataflow
    • optional Result.id == ID_NONE keeps the stack machine first-class.

4. The naming contract

Every IR package exposes these names with these signatures — the checklist each new IR is built against.

Re-exported shared types (from ir): Module Function Block Operation Operand Operand_Kind Result Type Type_Ref Type_Kind Id Ref Ref_Space Symbol_Table Dataflow Error Error_Code Token Token_Kind Print_Options DEFAULT_PRINT_OPTIONS

Per-IR concrete types (identical names): Opcode (u16, INVALID = 0) and Relocation / Relocation_Type.

Operand constructors (shared): op_int op_float op_type op_ref op_value op_block, plus the IR's own dialect helpers where an opcode needs a structured immediate (e.g. a WASM op_memarg).

Operation builders & emitters — by shape, mnemonic passed in (an IR has hundreds of opcodes over a handful of shapes, so per-opcode typed builders are optional, not the default): op_none(opcode) op_unary(opcode, a) op_binary(opcode, a, b) op_call(callee, args) op_branch(target) … and emit_*.

Entry points (identical signatures across IRs):

encode(m: Module, code: []u8,
       relocs: ^[dynamic]Relocation, errors: ^[dynamic]Error) -> (byte_count: u32, ok: bool)

decode(data: []u8, m: ^Module, errors: ^[dynamic]Error,
       allocator := context.allocator) -> (byte_count: u32, ok: bool)

print/tprint/(m: Module, options := ir.DEFAULT_PRINT_OPTIONS) -> (Print_Result | string)

Note the deliberate differences from the ISA verbs: they take a Module, not []Instruction, and they drop label_defs / resolve / base_address — an IR has no PC-relative resolution pass, so those parameters would be dead. This is the divergence made explicit rather than carried inert. (It is also why WASM, currently shaped like an ISA package, will move to ir/wasm: its real encode/decode already dropped those parameters.)

Anything an IR genuinely lacks (WASM has no VALUE refs; an untyped IR no TYPE refs) is simply absent, not stubbed — same rule as the ISA layer.


5. Codecs — the one place the strategy, not just the data, differs

For an ISA, every codec is the same kind of thing (a bit/byte packer driven by a static table). For IRs there are two kinds, and the API contract is the verbs (§4), not the table — so a package picks its strategy underneath:

  • Table-driven (WASM, SPIR-V). A static OPCODE → [operand layout] table, literally the ISA ENCODING_TABLE pattern: hand-written single source of truth, O(1) dispatch. WASM's existing ENCODING_TABLE and SPIR-V's grammar JSON both fit this.
  • Bitstream (LLVM, AIR, DXIL). A generic block/record/abbreviation engine; operand layout is defined by abbreviation records encountered in the stream, so there is no static opcode table. This is a real subsystem (shared by the three LLVM-family members) that the LLVM IR reader sits on top of.

Both satisfy the same encode/decode signatures; callers never see which.


6. One-paragraph summary

Make ir own what is the same on every IR — and for IRs that is more than for ISAs: not just errors/refs/printing but the whole typed Module → Function → Block → Operation structure, because SSA and a type system regularize it. Keep the leaf ISA-shaped (Operation = Instruction + an optional Result, opcode a u16), keep the three verbs, and make exactly three concessions where an IR is not an ISA: a structured module instead of a flat stream, a first-class type table, and id-based entity references instead of PC-relative labels. Let Dataflow host both the stack machine and SSA, and let the codec be pluggable so the LLVM bitstream and the WASM/SPIR-V tables live under one contract. The result is a sibling to the ISA API, not a generalization of it: each new IR gets the shared structure and vocabulary for free and writes only its opcode set, its codec, and its dialect.