From daa5b7cb79c1fc78079c0ad0a22f60e4096c6794 Mon Sep 17 00:00:00 2001 From: Brendan Punsky Date: Thu, 18 Jun 2026 18:59:50 -0400 Subject: [PATCH] =?UTF-8?q?rexcode:=20add=20core:rexcode/ir=20=E2=80=94=20?= =?UTF-8?q?the=20IR=20API=20layer=20(no=20concrete=20IR=20yet)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A sibling to core:rexcode/isa for the intermediate representations (WASM, SPIR-V, LLVM bitcode + the LLVM dialects AIR/DXIL). Holds the shared vocabulary every IR package builds on, implements no specific IR. Design stance (see docs/ir_design.md): keep the ISA layer's spirit, but where IRs are structurally MORE uniform than ISAs (SSA + a type system regularize the operand/module shape), the shared core is richer. ir/ owns: status.odin Error/Error_Code (shape-identical to isa.Error) refs.odin Id/Ref/Ref_Space/Symbol_Table (the label analog: structural id references, not PC-relative byte offsets) types.odin Type/Type_Ref/Type_Kind (the type table -- no ISA analog) module.odin Module/Function/Block/Operation/Operand/Result/Dataflow (the structured model; Operation = isa.Instruction + an optional typed Result, opcode a u16 like Mnemonic) print.odin token kinds + options + num-fmt (parallels isa.print) Three honest concessions vs the ISA API, made explicit not inert: a structured Module replaces the flat []Instruction; a first-class type system; id-based entity refs replace labels. The encode/decode verbs take a Module and drop label_defs/resolve/base_address. Dataflow hosts both the WASM value stack and SSA; the codec is pluggable (table for WASM/SPIR-V, bitstream for the LLVM family -- AIR/DXIL are LLVM dialects, not peers). Package compiles; a hand-built SSA module round-trips through the types. --- core/rexcode/doc.odin | 3 +- core/rexcode/docs/ir_design.md | 270 +++++++++++++++++++++++++++++++++ core/rexcode/ir/doc.odin | 89 +++++++++++ core/rexcode/ir/module.odin | 154 +++++++++++++++++++ core/rexcode/ir/print.odin | 130 ++++++++++++++++ core/rexcode/ir/refs.odin | 99 ++++++++++++ core/rexcode/ir/status.odin | 42 +++++ core/rexcode/ir/types.odin | 66 ++++++++ 8 files changed, 852 insertions(+), 1 deletion(-) create mode 100644 core/rexcode/docs/ir_design.md create mode 100644 core/rexcode/ir/doc.odin create mode 100644 core/rexcode/ir/module.odin create mode 100644 core/rexcode/ir/print.odin create mode 100644 core/rexcode/ir/refs.odin create mode 100644 core/rexcode/ir/status.odin create mode 100644 core/rexcode/ir/types.odin diff --git a/core/rexcode/doc.odin b/core/rexcode/doc.odin index d74977111..b58fd1e6a 100644 --- a/core/rexcode/doc.odin +++ b/core/rexcode/doc.odin @@ -214,8 +214,9 @@ rexcode/ ppc_vle/ # Freescale VLE (sibling of ppc) riscv/ # RISC-V rsp/ # N64 RSP + ir/ # shared IR core (parallels isa/; see docs/ir_design.md) wasm/ # WebAssembly (an IR; destined for ir/wasm once the IR layer settles) - docs/ # cross-arch design + per-arch design docs + docs/ # cross-arch design + IR design + per-arch design docs ``` Each ISA is imported as `core:rexcode/isa/` (e.g. `core:rexcode/isa/x86`); the diff --git a/core/rexcode/docs/ir_design.md b/core/rexcode/docs/ir_design.md new file mode 100644 index 000000000..f6b52c2ee --- /dev/null +++ b/core/rexcode/docs/ir_design.md @@ -0,0 +1,270 @@ + + +# rexcode — IR API Design + +> Why the rexcode IR family (`wasm`, and the planned `spirv`, `llvm`, with +> `air` / `dxil` as LLVM dialects) gets its own API layer (`core:rexcode/ir`) +> **parallel** to the ISA layer (`core:rexcode/isa`) — sharing the ISA layer's +> spirit and as much of its shape as honestly survives, while conceding exactly +> the three places an IR is not an ISA. + +Read [cross_arch_design.md](cross_arch_design.md) first; this document is its +sibling and assumes its vocabulary. + +--- + +## The guiding principle + +The ISA layer's rule was *“share the bookkeeping, specialize the bytes.”* The IR +layer keeps it and adds one clause: + +> **Share the bookkeeping *and the structure*, specialize the dialect and the codec.** + +An ISA only ever shares bookkeeping because its *content* (registers, operands, +the bit-twiddling) diverges maximally per arch. An IR shares **more** — the whole +`Module → Function → Block → Operation` structure and the operand/type model are +genuinely the same problem on every IR — because SSA and a type system regularize +what ISAs leave ad hoc. So `ir/` is a richer shared core than `isa/`: it owns the +*structural model*, not just labels and errors. What stays per-IR is the *opcode +set*, the *codec* (the wire format), and the *dialect* (intrinsic/metadata +conventions). + +--- + +## 0. First: how many IRs are there really? + +Fewer than the list suggests. **AIR and DXIL are not peers of LLVM — they are +LLVM bitcode.** AIR is LLVM bitcode + a Metal dialect; DXIL is LLVM ~3.7 bitcode ++ a DirectX dialect inside a DXContainer. So the field is **three codec +families**, not five: + +| family | members | wire format | +|---|---|---| +| WASM | wasm | byte stream + LEB128, one form per opcode | +| SPIR-V | spirv | 32-bit words, uniform `wordCount<<16 \| opcode` header | +| LLVM bitstream | llvm, **air**, **dxil** | self-describing block/record/abbreviation bitstream | + +The implementation cost is therefore *3 codecs + N dialects*, and `air`/`dxil` +should reuse the `llvm` codec wholesale. That single fact shapes the package +tree: `ir/llvm/`, `ir/llvm/air/`, `ir/llvm/dxil/`. + +--- + +## 1. The universal IR shape + +Strip away specifics and every IR needs these — the same checklist `isa` has, +shifted up one level of structure: + +| # | Concept | `ir` type | +|---|---|---| +| 1 | A **type** = (kind, width/elem/fields) | `Type`, `Type_Ref` | +| 2 | An **operand** = literal \| entity-ref \| type | `Operand`, `Operand_Kind` | +| 3 | An **operation** = opcode + operands + *optional result* | `Operation` | +| 4 | An **opcode** enum | per-IR `Opcode` (u16, INVALID=0) | +| 5 | **References** to entities by id (+ named symbols) | `Id`, `Ref`, `Symbol_Table` | +| 6 | **Relocations** for object-file symbol fixups | per-IR `Relocation` | +| 7 | `encode(Module) -> bytes (+relocs +errors)` | per-IR `encode()` | +| 8 | `decode(bytes) -> Module (+errors)` | per-IR `decode()` | +| 9 | `print(Module) -> text (+tokens)` | per-IR `print()`/`tprint()` | +| + | A **structured module** of functions→blocks→operations | `Module`/`Function`/`Block` | +| + | A **dataflow discipline** (stack or SSA) | `Dataflow` | + +Items 1–9 are item-for-item the ISA's nine, re-aimed: *type* generalizes the +ISA's implicit-width; *operand* keeps the kind tag; *operation* is `Instruction` ++ a `Result`; *opcode* is `Mnemonic`; *references* replace *labels*. The two `+` +rows are the genuinely new structure (§3). + +--- + +## 2. Where IRs diverge from ISAs + +Three real divergences, then a long tail of things that *look* different but are +the same shape. + +### The three real concessions + +1. **The unit of work is a structured `Module`, not a flat `[]Instruction`.** + An ISA program is a byte-addressed instruction stream; an IR program is a + typed graph: `Module → []Function → []Block → []Operation`, where an op may + define an SSA value that later ops use. So `decode` is a *structured parse*, + not a linear scan, and `ir` owns `Module`/`Function`/`Block` where `isa` owns + no `Instruction`. `Operation.operands` is **variable-arity** (`[]Operand`) — + the ISA `Instruction`'s fixed `[4]Operand` is the one leaf shape that does not + survive (calls, `switch`, `phi`). + +2. **A first-class type system.** Operations and results carry a `Type_Ref` into + the module's type table. ISAs bake width into the mnemonic and never need + this. `Type_Kind` is the WASM∪SPIR-V∪LLVM denominator (`INT/FLOAT/VECTOR/ + POINTER/STRUCT/FUNCTION/...`). + +3. **Entity references replace PC-relative labels.** ISA branches resolve as + instruction-index→byte-offset (`isa.Label_Definition`, rewritten by `encode`). + IR operands reference entities by **`Id`** — SSA results, blocks, functions, + globals, types — resolved *structurally*, with no PC-relative pass. (Object- + file *symbol* fixups still produce `Relocation`s for `EXTERNAL` refs.) + +### Two axes that sort the IRs + +Everything else sorts onto two orthogonal axes. Note the clustering is +counterintuitive — the encoding mates and the model mates are *different* pairs: + +| IR | encoding model | dataflow model | +|---|---|---| +| WASM | **table** (byte/LEB, one form per opcode) | **stack** (implicit) | +| SPIR-V | **table** (32-bit words, uniform header) | **SSA** (result ids, typed) | +| LLVM / AIR / DXIL | **bitstream** (data-defined abbreviations) | **SSA** (+ metadata graph) | + +- On **encoding**, WASM and SPIR-V are siblings — a static `opcode → operand- + layout` table, *exactly* the ISA `ENCODING_TABLE` shape. LLVM is the outlier: + its layout is defined by abbreviation records *in the stream*, so **no static + table can describe it**. +- On **dataflow**, SPIR-V/LLVM are siblings (SSA + types); **WASM is the + outlier** — a stack bytecode with no SSA, no named results, minimal types. + +So WASM is encoding-kin to SPIR-V but model-kin to nothing, and the one thing +you most want to share (LLVM) breaks the table assumption the others share. The +`Dataflow` trait and the *pluggable codec* (§5) exist precisely to absorb these +two splits without forking the API. + +### Divergence summary + +| Component | Verdict | Shared (`ir/`) | Per-IR | +|---|---|---|---| +| References / `Id` | ✅ shared | the whole id + symbol model | which `Ref_Space`s exist | +| Error / status | ✅ shared | struct shape (= `isa.Error`) | error-code subset | +| Type model | ✅ shared | `Type`/`Type_Ref`/`Type_Kind` | wire⇄`Type` lowering | +| Operand model | ✅ shared* | `Operand` + kinds (SSA homogenizes it) | dialect `aux` encodings | +| Structural model | ✅ shared | `Module`/`Function`/`Block`/`Operation` | — | +| Printer framework | ◑ split | tokens, options, num-fmt | type/value/block syntax | +| Relocation | ◑ split | struct-shape convention | type enum (per-IR file) | +| `Opcode` | ✗ per-IR | convention (u16, INVALID=0) | the enum | +| Opcode table / codec | ✗ per-IR | codec *strategy* (§5) | schema + data (or bitstream) | +| `encode`/`decode` driver | ✗ per-IR | verb signature | the whole parse/emit | + +> *`Operand` is shared here where `isa.Operand` is per-arch. ISA operands diverge +> wildly (ModRM/SIB vs shifted-register vs split immediates); SSA collapses IR +> operands to "a literal, a reference, or a type," uniform enough to define once. +> Dialect-specific encodings (WASM memarg, SPIR-V enum masks) are an *encoding* +> detail carried in `Operand.aux` + the IR's opcode table — not a new shape. + +--- + +## 3. The shared core (`ir/`) and why this much is shared + +`ir/` depends on nothing (it does **not** depend on `isa/`) and owns the parts +that are the same problem on every IR: + +- `status.odin` — `Error`/`Error_Code`; the `Error` struct is byte-identical to + `isa.Error` so one tool surfaces both. +- `refs.odin` — `Id`, `Ref`, `Ref_Space`, `Symbol_Table` (the `isa.labels` + analog, re-cast from byte-offsets to structural ids). +- `types.odin` — `Type`, `Type_Ref`, `Type_Kind` (no ISA analog). +- `module.odin` — `Module`/`Function`/`Block`/`Operation`/`Operand`/`Result`/ + `Dataflow` (the structural model; the heart of the layer). +- `print.odin` — token kinds (with IR-only `TYPE`/`VALUE_REF`/`RESULT`/ + `BLOCK_LABEL`), print options, number-formatting helpers. + +Each concrete IR package **re-exports** these (e.g. `wasm.Module`, +`spirv.Operation`) so a consumer sees one namespace, mirroring how arch packages +re-export `isa`. + +### The validating precedent and the rejected alternatives + +The `Operation`-with-blocks-and-regions spine is exactly **MLIR's** structural +model, which is field-proof that one model cleanly subsumes a CFG (LLVM/SPIR-V), +structured control (WASM, as block regions), *and* a flat ISA (the degenerate +one-block, no-SSA case). We take MLIR's spine, not its open-ended generality +(no region/trait/dialect-registry machinery) — the lean version. + +Rejected, for the same reasons the ISA layer rejected its three: + +1. **Fold ISAs into the IR API** (ISA = "degenerate IR"). True in theory, but it + taxes the fast, flat ISA hot path with type/SSA/module machinery it never + needs. Keep them **siblings**; share only the leaf vocabulary in spirit. +2. **One concrete codec for all IRs.** LLVM's bitstream is not a static table; + forcing WASM/SPIR-V and LLVM through one table breaks LLVM. The codec is + *pluggable* behind the verbs (§5). +3. **Bake in SSA** (mandatory results + value-refs). Excludes WASM. `Dataflow` + + optional `Result.id == ID_NONE` keeps the stack machine first-class. + +--- + +## 4. The naming contract + +Every IR package exposes these names with these signatures — the checklist each +new IR is built against. + +**Re-exported shared types (from `ir`):** +`Module Function Block Operation Operand Operand_Kind Result Type Type_Ref +Type_Kind Id Ref Ref_Space Symbol_Table Dataflow Error Error_Code Token +Token_Kind Print_Options DEFAULT_PRINT_OPTIONS` + +**Per-IR concrete types (identical names):** +`Opcode` (u16, `INVALID = 0`) and `Relocation` / `Relocation_Type`. + +**Operand constructors (shared):** `op_int op_float op_type op_ref op_value +op_block`, plus the IR's own dialect helpers where an opcode needs a structured +immediate (e.g. a WASM `op_memarg`). + +**Operation builders & emitters** — by *shape*, mnemonic passed in (an IR has +hundreds of opcodes over a handful of shapes, so per-opcode typed builders are +optional, not the default): `op_none(opcode) op_unary(opcode, a) +op_binary(opcode, a, b) op_call(callee, args) op_branch(target) …` and `emit_*`. + +**Entry points (identical signatures across IRs):** + +```odin +encode(m: Module, code: []u8, + relocs: ^[dynamic]Relocation, errors: ^[dynamic]Error) -> (byte_count: u32, ok: bool) + +decode(data: []u8, m: ^Module, errors: ^[dynamic]Error, + allocator := context.allocator) -> (byte_count: u32, ok: bool) + +print/tprint/…(m: Module, options := ir.DEFAULT_PRINT_OPTIONS) -> (Print_Result | string) +``` + +Note the *deliberate* differences from the ISA verbs: they take a **`Module`**, +not `[]Instruction`, and they **drop `label_defs` / `resolve` / `base_address`** +— an IR has no PC-relative resolution pass, so those parameters would be dead. +This is the divergence made explicit rather than carried inert. (It is also why +WASM, currently shaped like an ISA package, will move to `ir/wasm`: its real +`encode`/`decode` already dropped those parameters.) + +> Anything an IR genuinely lacks (WASM has no `VALUE` refs; an untyped IR no +> `TYPE` refs) is simply **absent**, not stubbed — same rule as the ISA layer. + +--- + +## 5. Codecs — the one place the strategy, not just the data, differs + +For an ISA, every codec is the same *kind* of thing (a bit/byte packer driven by +a static table). For IRs there are **two kinds**, and the API contract is the +verbs (§4), not the table — so a package picks its strategy underneath: + +- **Table-driven (WASM, SPIR-V).** A static `OPCODE → [operand layout]` table, + literally the ISA `ENCODING_TABLE` pattern: hand-written single source of + truth, O(1) dispatch. WASM's existing `ENCODING_TABLE` and SPIR-V's grammar + JSON both fit this. +- **Bitstream (LLVM, AIR, DXIL).** A generic block/record/abbreviation engine; + operand layout is defined by abbreviation records encountered in the stream, + so there is no static opcode table. This is a real subsystem (shared by the + three LLVM-family members) that the LLVM IR reader sits on top of. + +Both satisfy the same `encode`/`decode` signatures; callers never see which. + +--- + +## 6. One-paragraph summary + +Make `ir` own what is the same on every IR — and for IRs that is *more* than for +ISAs: not just errors/refs/printing but the whole typed `Module → Function → +Block → Operation` structure, because SSA and a type system regularize it. Keep +the leaf ISA-shaped (`Operation` = `Instruction` + an optional `Result`, opcode a +u16), keep the three verbs, and make exactly three concessions where an IR is not +an ISA: a structured module instead of a flat stream, a first-class type table, +and id-based entity references instead of PC-relative labels. Let `Dataflow` +host both the stack machine and SSA, and let the codec be pluggable so the LLVM +bitstream and the WASM/SPIR-V tables live under one contract. The result is a +sibling to the ISA API, not a generalization of it: each new IR gets the shared +structure and vocabulary for free and writes only its opcode set, its codec, and +its dialect. diff --git a/core/rexcode/ir/doc.odin b/core/rexcode/ir/doc.odin new file mode 100644 index 000000000..2f3108967 --- /dev/null +++ b/core/rexcode/ir/doc.odin @@ -0,0 +1,89 @@ +// rexcode · Brendan Punsky (dotbmp@github), original author + +/* +# rexcode/ir — the IR API layer + +`core:rexcode/ir` is to the intermediate representations (WASM, SPIR-V, LLVM +bitcode, and the LLVM dialects AIR / DXIL) what `core:rexcode/isa` is to the +machine ISAs: the **shared core** every concrete IR package builds on. It holds +the parts that are the same for every IR, and defines the contract each IR +package follows. It implements **no specific IR** — the concrete packages +(`core:rexcode/ir/wasm`, `.../spirv`, `.../llvm`, …) are added separately. + +See `docs/ir_design.md` for the full design rationale and the ISA↔IR comparison. + +## Why a sibling, not a generalization of `isa` + +The ISA API works because every arch follows one *shape contract* +(`Mnemonic` / `Instruction` / `Operand` / `encode` / `decode` / `print`) while +the shared `isa` package carries only the universal bookkeeping. The IR API +keeps that spirit, with three honest concessions where IRs truly diverge: + + 1. **A structured module replaces the flat instruction stream.** The unit of + work is a `Module` (`Module → []Function → []Block → []Operation`), not a + `[]Instruction`. So `ir` owns the *structural model* (module/function/block/ + operation), where `isa` owns no `Instruction`. + + 2. **A first-class type system.** Operations and results reference a + module-level type table by `Type_Ref`. ISAs bake width into the mnemonic. + + 3. **Entity references replace PC-relative labels.** Operands reference SSA + values / blocks / functions / globals / types by `Id`, resolved + structurally — there is no instruction-index→byte-offset rewrite. (Object- + file *symbol* fixups still produce Relocations, defined per-IR.) + +Everything else is deliberately ISA-shaped: the leaf `Operation` is +`isa.Instruction` + an optional typed `Result`, `opcode` is a u16 just like +`isa.Mnemonic`, `Operand` is one discriminated value, and the verbs are the same +three. `Dataflow` lets one model host both an implicit value stack (WASM) and +explicit SSA (SPIR-V/LLVM) without baking in either. + +## What this package provides (shared) + + * `status.odin` — `Error` / `Error_Code` (shape-identical to `isa.Error`). + * `refs.odin` — `Id` / `Ref` / `Ref_Space` / `Symbol_Table` (the label analog). + * `types.odin` — `Type` / `Type_Ref` / `Type_Kind` (the type table). + * `module.odin` — `Module` / `Function` / `Block` / `Operation` / `Operand` / + `Result` / `Dataflow` (the structural model). + * `print.odin` — token kinds, print options, number-formatting helpers. + +## What a concrete IR package provides (the contract) + +Each `core:rexcode/ir/` package supplies, mirroring an arch package: + + * `Opcode` — the IR's operation enum (`u16`, `INVALID = 0`), stored in + `Operation.opcode`. (Analogous to a `Mnemonic`.) + * A **codec** — the wire format. Two strategies cover the field: + - *table-driven* (WASM byte/LEB, SPIR-V 32-bit words): a static + `OPCODE → operand-layout` table, exactly like an ISA `ENCODING_TABLE`. + - *bitstream* (LLVM bitcode, and thus AIR / DXIL): a block/record/ + abbreviation engine; the operand layout is data-defined, so there is no + static table. The codec is pluggable behind the verbs below. + * The three verbs, on a `Module` (vs the ISA verbs' `[]Instruction`): + + encode :: proc(m: Module, + code: []u8, + relocs: ^[dynamic]Relocation, + errors: ^[dynamic]Error) -> (byte_count: u32, ok: bool) + + decode :: proc(data: []u8, + m: ^Module, + errors: ^[dynamic]Error, + allocator := context.allocator) -> (byte_count: u32, ok: bool) + + print :: proc(m: Module, options := ir.DEFAULT_PRINT_OPTIONS) -> ir.Print_Result + tprint :: proc(m: Module, options := ir.DEFAULT_PRINT_OPTIONS) -> string + + (`encode`/`decode` deliberately *drop* the ISA verbs' `label_defs` / + `resolve` / `base_address` — there is no PC-relative resolution pass — and + take a `Module` rather than an instruction array. That is the whole point of + the divergence, made explicit rather than left inert.) + + * `Relocation` / `Relocation_Type` — per-IR (the linker fixups for `EXTERNAL` + references), exactly as each arch owns its `reloc.odin`. + * Type lowering — how the IR's wire types map to/from `ir.Type`. + +A *dialect* (AIR over LLVM, DXIL over LLVM) reuses its base IR's codec wholesale +and adds only the intrinsic/metadata conventions and any container wrapper. +*/ +package rexcode_ir diff --git a/core/rexcode/ir/module.odin b/core/rexcode/ir/module.odin new file mode 100644 index 000000000..224b760f8 --- /dev/null +++ b/core/rexcode/ir/module.odin @@ -0,0 +1,154 @@ +// rexcode · Brendan Punsky (dotbmp@github), original author + +package rexcode_ir + +// ============================================================================= +// STRUCTURAL MODEL (the core of the IR API) +// ============================================================================= +// +// The central divergence from the ISA API. An ISA program is a flat +// `[]Instruction`; an IR program is a *typed, structured module* -- +// +// Module → []Function → []Block → []Operation +// +// where an operation may define an SSA result that later operations reference. +// +// Design stance vs the ISA API: +// +// * The leaf is kept ISA-shaped on purpose. `Operation` is `isa.Instruction` +// plus an optional typed `Result`; `opcode` is the concrete IR's Opcode +// enum stored as a u16, exactly as `isa` stores `Mnemonic` as a u16. So the +// opcode-table dispatch, the encode/decode/print verbs, and the relocation +// model all carry over. +// +// * `Operand` is *shared* here, where `isa.Operand` is per-arch. Justified: +// ISA operands diverge wildly (ModRM/SIB vs shifted-register vs ...), but +// SSA collapses IR operands to "a literal, a reference to an entity, or a +// type", which is uniform enough to define once. Dialect-specific operand +// encodings (WASM memarg, SPIR-V enum masks) ride in `aux` + the IR's own +// opcode table -- they are an encoding detail, not a new operand shape. +// +// * Both dataflow styles are first-class. `Dataflow` is a per-IR trait, NOT a +// baked-in assumption: a stack IR (WASM) leaves `Result.id == ID_NONE` and +// references nothing through VALUE; an SSA IR (SPIR-V/LLVM) names results +// and threads them as REF operands. The model excludes neither. +// +// What is deliberately NOT here: the wire codec (`encode`/`decode`) and the +// printer. Those are per-IR -- just as `isa` defines no `encode`, each concrete +// IR provides its own, against the contract in `doc.odin`. This package is the +// shared *vocabulary*, not an implementation. + +// Per-IR dataflow discipline. WASM = STACK; SPIR-V / LLVM / AIR / DXIL = SSA. +Dataflow :: enum u8 { STACK, SSA } + +// ----------------------------------------------------------------------------- +// Operand (generalizes isa.Operand) +// ----------------------------------------------------------------------------- + +Operand_Kind :: enum u8 { + NONE, + LIT_INT, // integer literal (value in `imm`) + LIT_FLOAT, // float literal (IEEE bits in `imm`, width in `aux`) + REF, // reference to an entity: `imm` is the Id, `space` the Ref_Space + TYPE, // a Type_Ref (in the low 32 bits of `imm`) + ATTRIBUTE, // a dialect enum / decoration / mask (value in `imm`, tag in `aux`) +} + +// 16 bytes. The payload is one i64 (covers an Id, a Type_Ref, an int/float-bits +// literal, or an attribute value); `space`/`aux` discriminate the entity space +// or dialect tag. Large/aggregate constants are *entities* (a CONSTANT ref), not +// inline operands -- the SSA way -- so no inline byte blob is needed here. +Operand :: struct #packed { + imm: i64, + kind: Operand_Kind, + space: Ref_Space, // REF: which id space + aux: u16, // LIT_FLOAT width / ATTRIBUTE tag / dialect bits + flags: u32, +} +#assert(size_of(Operand) == 16) + +@(require_results) op_int :: #force_inline proc "contextless" (v: i64) -> Operand { return Operand{kind = .LIT_INT, imm = v} } +@(require_results) op_float :: #force_inline proc "contextless" (bits: u64, w: u16) -> Operand { return Operand{kind = .LIT_FLOAT, imm = i64(bits), aux = w} } +@(require_results) op_type :: #force_inline proc "contextless" (t: Type_Ref) -> Operand { return Operand{kind = .TYPE, imm = i64(u32(t))} } + +@(require_results) +op_ref :: #force_inline proc "contextless" (space: Ref_Space, id: Id) -> Operand { + return Operand{kind = .REF, space = space, imm = i64(u32(id))} +} + +@(require_results) op_value :: #force_inline proc "contextless" (id: Id) -> Operand { return op_ref(.VALUE, id) } +@(require_results) op_block :: #force_inline proc "contextless" (id: Id) -> Operand { return op_ref(.BLOCK, id) } + +// Reconstruct the Id / Type_Ref carried by an operand. +@(require_results) operand_id :: #force_inline proc "contextless" (o: Operand) -> Id { return Id(u32(o.imm)) } +@(require_results) operand_type :: #force_inline proc "contextless" (o: Operand) -> Type_Ref { return Type_Ref(u32(o.imm)) } + +// ----------------------------------------------------------------------------- +// Operation (the leaf -- parallels isa.Instruction) +// ----------------------------------------------------------------------------- + +Operation_Flags :: bit_field u8 { + terminator: bool | 1, // ends a block (br / ret / switch / unreachable) + control: bool | 1, // structured-control op (block/loop/if/... for stack IRs) + memory: bool | 1, // touches linear memory / pointers + _: u8 | 5, +} + +// `opcode` is the concrete IR's Opcode enum, stored as u16 (like isa.Mnemonic). +// `operands` is variable-arity (calls, switch, phi) and caller-owned, like the +// rest of the decoded module -- the fixed `[4]Operand` of the ISA Instruction is +// the one shape that does not survive into IRs. +Operation :: struct { + operands: []Operand, + result: Result, // SSA def; `.id == ID_NONE` for stack/void ops + opcode: u16, + flags: Operation_Flags, + _: u8, +} + +// What an operation produces. +Result :: struct #packed { + id: Id, // ID_NONE if the op defines no value + type: Type_Ref, +} +#assert(size_of(Result) == 8) + +// ----------------------------------------------------------------------------- +// Containers (no ISA parallel -- the structured-module concession) +// ----------------------------------------------------------------------------- + +// A basic block (SSA) or a structured region (stack IRs). `params` are block +// arguments (SSA, phi-free form); empty for stack IRs. The terminator is the +// final operation (Operation_Flags.terminator). +Block :: struct { + ops: []Operation, + params: []Result, + id: Id, +} + +Function :: struct { + blocks: []Block, + name: string, + signature: Type_Ref, // a FUNCTION type in Module.types +} + +// A module-level mutable/immutable value. +Global :: struct { + name: string, + init: Id, // a CONSTANT ref, or ID_NONE + type: Type_Ref, + mutable: bool, +} + +// The module -- the unit the IR verbs operate on (where the ISA verbs take a +// flat `[]Instruction`). Metadata, decorations, and dialect custom sections are +// carried by the concrete IR alongside this core, the way each arch carries its +// own reloc.odin. +Module :: struct { + target: string, // triple / capability profile / version tag + types: []Type, // the type table; Type_Ref indexes here + globals: []Global, + functions: []Function, + symbols: Symbol_Table, // externally-visible names + dataflow: Dataflow, +} diff --git a/core/rexcode/ir/print.odin b/core/rexcode/ir/print.odin new file mode 100644 index 000000000..5be66204f --- /dev/null +++ b/core/rexcode/ir/print.odin @@ -0,0 +1,130 @@ +// rexcode · Brendan Punsky (dotbmp@github), original author + +package rexcode_ir + +// ============================================================================= +// PRINTER FRAMEWORK (shared scaffolding -- parallels isa.print) +// ============================================================================= +// +// Same role as isa.print: the universal pieces of textual output (token kinds +// for highlighting, print options, the result type, number-formatting helpers). +// A concrete IR's printer (WAT / SPIR-V disasm / LLVM `.ll`) owns the syntax of +// types, value names, blocks, and the output-sink procedures, and calls these +// helpers for hex/decimal. Kept independent of isa.print so the two siblings do +// not couple; the `Token_Kind` set adds the IR-only categories. + +import "core:strings" +import "core:reflect" + +Token_Kind :: enum u8 { + WHITESPACE, + NEWLINE, + OFFSET, // byte/word offset prefix + KEYWORD, // `func` / `block` / `define` / `OpLabel` style keywords + OPCODE, // the operation mnemonic + TYPE, // a type reference / spelling (IR-only) + VALUE_REF, // a use of an SSA value / local (IR-only) + RESULT, // a value definition (`%3 =`) (IR-only) + BLOCK_LABEL, // a basic-block / branch-target label (IR-only) + GLOBAL_REF, // function / global / symbol reference (IR-only) + IMMEDIATE, // literal constant + ATTRIBUTE, // dialect attribute / decoration / flag (IR-only) + PUNCTUATION, // `(`, `)`, `,`, `=`, `:` + COMMENT, +} + +Token :: struct { + offset: u32, // byte offset in the output string + length: u16, + kind: Token_Kind, + operation_index: u16, // which operation (0xFFFF for module-level / whitespace) +} + +@(require_results) +token_kind_to_string :: proc(k: Token_Kind) -> string { + if name, ok := reflect.enum_name_from_value(k); ok { + return name + } + return "???" +} + +// ----------------------------------------------------------------------------- +// Print options & result (same shape as isa, IR-flavoured defaults) +// ----------------------------------------------------------------------------- + +Print_Options :: struct { + uppercase: bool, + hex_prefix: string, // default "0x" + hex_lowercase: bool, + value_prefix: string, // SSA value sigil, default "%" + block_prefix: string, // block-label sigil, default "^" + show_offsets: bool, + indent: string, // default " " + separator: string, // default "\n" +} + +DEFAULT_PRINT_OPTIONS :: Print_Options{ + uppercase = false, + hex_prefix = "0x", + hex_lowercase = true, + value_prefix = "%", + block_prefix = "^", + show_offsets = false, + indent = " ", + separator = "\n", +} + +Print_Result :: struct { + text: string, + tokens: []Token, // nil unless requested +} + +// ----------------------------------------------------------------------------- +// Number formatting helpers (used by every IR printer -- arch/IR-agnostic) +// ----------------------------------------------------------------------------- + +print_hex :: proc(sb: ^strings.Builder, value: u64, options: ^Print_Options) { + strings.write_string(sb, options.hex_prefix) + print_hex_digits(sb, value, options) +} + +print_hex_digits :: proc(sb: ^strings.Builder, value: u64, options: ^Print_Options) { + if value == 0 { + strings.write_byte(sb, '0') + return + } + buf: [16]u8 + i := 0 + v := value + for v > 0 { + digit := u8(v & 0xF) + buf[i] = digit < 10 ? '0' + digit : 'a' + digit - 10 + v >>= 4 + i += 1 + } + for j := i - 1; j >= 0; j -= 1 { + c := buf[j] + if options.uppercase && c >= 'a' && c <= 'f' { + c -= 32 + } + strings.write_byte(sb, c) + } +} + +print_decimal :: proc(sb: ^strings.Builder, value: u32) { + if value == 0 { + strings.write_byte(sb, '0') + return + } + buf: [10]u8 + i := 0 + v := value + for v > 0 { + buf[i] = '0' + u8(v % 10) + v /= 10 + i += 1 + } + for j := i - 1; j >= 0; j -= 1 { + strings.write_byte(sb, buf[j]) + } +} diff --git a/core/rexcode/ir/refs.odin b/core/rexcode/ir/refs.odin new file mode 100644 index 000000000..98d46c484 --- /dev/null +++ b/core/rexcode/ir/refs.odin @@ -0,0 +1,99 @@ +// rexcode · Brendan Punsky (dotbmp@github), original author + +package rexcode_ir + +// ============================================================================= +// REFERENCES (the IR analog of isa.labels) +// ============================================================================= +// +// This is the first place the IR API genuinely diverges from the ISA API. +// +// An ISA resolves control flow as *PC-relative labels*: `Label_Definition` +// maps a label id to an instruction index and `encode()` rewrites it to a byte +// offset (isa.labels.rewrite_label_defs_to_offsets). That model is wrong for an +// IR: IR operands reference *entities by id* -- SSA results, blocks, functions, +// globals, types -- which are stable indices into the module's entity tables, +// not byte offsets, and resolve *structurally* (no PC-relative pass). +// +// So the label machinery is replaced, not re-exported. What survives in spirit: +// * a small distinct-u32 id type with an "undefined" sentinel (forward refs), +// * a name<->id table for the externally-visible symbols (the Label_Map analog). +// +// Object-file *symbol* fixups (a linker patching a function/global index) are +// still real and still produce Relocations -- but that is a codec concern, +// defined per-IR (parallel to each arch's reloc.odin), not here. + +// A stable id into one of the module's entity spaces (see Ref_Space). +Id :: distinct u32 + +ID_NONE :: Id(0xFFFFFFFF) + +// Which id space a reference addresses. Drives validation, printer annotation, +// and (for EXTERNAL) relocation-type selection. This is the union of the spaces +// the modelled IRs use; a concrete IR uses only the subset it needs -- a stack +// IR (WASM) never produces a VALUE ref, an untyped IR never produces a TYPE ref. +Ref_Space :: enum u8 { + NONE, + VALUE, // an SSA result (or a local/stack slot) + BLOCK, // a basic block / structured-control label (branch target) + FUNCTION, + GLOBAL, + TYPE, + CONSTANT, // a constant-pool entry + MEMORY, // a linear memory / address space + METADATA, // a metadata/debug node + EXTERNAL, // an imported/exported symbol -- relocatable across object files +} + +// A typed reference: which space, plus the id within it. Carried by REF operands +// and by branch targets. 8 bytes, like isa.Label_Definition is u32-cheap. +Ref :: struct #packed { + id: Id, + space: Ref_Space, + _: [3]u8, +} +#assert(size_of(Ref) == 8) + +@(require_results) +ref :: #force_inline proc "contextless" (space: Ref_Space, id: Id) -> Ref { + return Ref{id = id, space = space} +} + +// ----------------------------------------------------------------------------- +// Symbol table (the IR analog of isa.Label_Map: name <-> id for visible names) +// ----------------------------------------------------------------------------- + +Symbol_Table :: struct { + names: map[string]Id, + space: Ref_Space, // what these names address (usually FUNCTION/GLOBAL) +} + +symbol_table_init :: #force_inline proc(st: ^Symbol_Table, space := Ref_Space.EXTERNAL, allocator := context.allocator) { + st.names = make(map[string]Id, allocator = allocator) + st.space = space +} + +symbol_table_destroy :: #force_inline proc(st: ^Symbol_Table) { + delete(st.names) +} + +// Bind a name to an id (e.g. when a definition is emitted). +symbol_define :: #force_inline proc(st: ^Symbol_Table, name: string, id: Id) { + st.names[name] = id +} + +// Reserve a name for a forward reference; resolve later with symbol_define. +@(require_results) +symbol_reserve :: #force_inline proc(st: ^Symbol_Table, name: string) -> Id { + if existing, ok := st.names[name]; ok { + return existing + } + st.names[name] = ID_NONE + return ID_NONE +} + +@(require_results) +symbol_lookup :: #force_inline proc(st: ^Symbol_Table, name: string) -> (id: Id, ok: bool) { + id, ok = st.names[name] + return +} diff --git a/core/rexcode/ir/status.odin b/core/rexcode/ir/status.odin new file mode 100644 index 000000000..169ada9ac --- /dev/null +++ b/core/rexcode/ir/status.odin @@ -0,0 +1,42 @@ +// rexcode · Brendan Punsky (dotbmp@github), original author + +package rexcode_ir + +// ============================================================================= +// ERROR / RESULT TYPES (shared by every IR codec) +// ============================================================================= +// +// Parallels isa.status. The `Error` struct shape is intentionally identical to +// `isa.Error` (8 bytes: a u32 location + a 1-byte code) so a tool can surface +// ISA and IR diagnostics through one path. `Error_Code` keeps the encode/decode +// codes shared with the ISA side, then adds the codes only a *typed, structured* +// IR can produce. Per-IR codecs emit the subset that applies to them. + +Error_Code :: enum u8 { + NONE = 0, + + // Shared with the ISA side (encode/decode of the byte/word stream). + INVALID_OPCODE, + NO_MATCHING_ENCODING, + OPERAND_MISMATCH, + IMMEDIATE_OUT_OF_RANGE, + BUFFER_OVERFLOW, + BUFFER_TOO_SHORT, + + // IR-specific (no ISA analog -- these need a type system / SSA / a module). + INVALID_TYPE, // malformed or out-of-range Type_Ref + TYPE_MISMATCH, // an operand/result type disagrees with the op signature + UNDEFINED_REF, // a Ref to an id/symbol that is never defined + DUPLICATE_DEFINITION, // an id/symbol defined twice + MALFORMED_MODULE, // structural violation (block without terminator, ...) + UNSUPPORTED_FEATURE, // a capability/extension the codec does not implement +} + +// `location` is the operation index on encode, or the byte offset on decode -- +// mirroring isa.Error.inst_idx. +Error :: struct #packed { + location: u32, + code: Error_Code, + _: [3]u8, +} +#assert(size_of(Error) == 8) diff --git a/core/rexcode/ir/types.odin b/core/rexcode/ir/types.odin new file mode 100644 index 000000000..332b77340 --- /dev/null +++ b/core/rexcode/ir/types.odin @@ -0,0 +1,66 @@ +// rexcode · Brendan Punsky (dotbmp@github), original author + +package rexcode_ir + +// ============================================================================= +// TYPE MODEL +// ============================================================================= +// +// The second genuine divergence from the ISA API: IRs have a *first-class type +// system*. An ISA bakes width into the mnemonic (`ADD` vs `ADDB`); operands are +// just bit patterns. An IR carries an explicit type table and operations / +// results reference types by `Type_Ref` (an index into `Module.types`). +// +// `Type_Kind` is the common denominator across the modelled IRs: +// * WASM: i32/i64/f32/f64/v128 + funcref/externref (a *degenerate* table -- +// a handful of primitives, no user structs). +// * SPIR-V: OpTypeInt / Float / Vector / Pointer / Struct / Function / ... +// * LLVM: iN / float / pointer / vector / array / struct / function / opaque. +// +// A concrete IR lowers its wire types onto this set on decode and back on +// encode. Anything a dialect needs beyond the common shape rides in `aux` (e.g. +// pointer address space) or in the concrete IR's own side tables. + +Type_Ref :: distinct u32 + +TYPE_NONE :: Type_Ref(0xFFFFFFFF) + +Type_Kind :: enum u8 { + VOID, + INT, // `bits` = width (1/8/16/32/64/...); signedness is op-level in most IRs + FLOAT, // `bits` = width (16/32/64/128) + VECTOR, // `elem` x `count` (fixed-width SIMD) + ARRAY, // `elem` x `count` + POINTER, // `elem`, address space in `aux` + STRUCT, // members in `fields` + FUNCTION, // `fields` = params ++ [result]; `count` = param count + OPAQUE, // named / forward-declared / abstract handle (images, tokens, ...) + REF, // funcref / externref / typed GC reference (`elem` for typed refs) +} + +// One node in a module's type table. `fields` (struct members / function +// signature) is caller-owned, like the rest of the decoded module. +Type :: struct { + fields: []Type_Ref, // STRUCT members, or FUNCTION params ++ result + name: string, // OPAQUE / named struct + elem: Type_Ref, // VECTOR / ARRAY / POINTER / typed REF element + count: u32, // VECTOR / ARRAY length, or FUNCTION param count + bits: u16, // INT / FLOAT width + aux: u16, // POINTER address space, packed kind flags, ... + kind: Type_Kind, + _: [3]u8, +} + +@(require_results) type_void :: #force_inline proc "contextless" () -> Type { return Type{kind = .VOID} } +@(require_results) type_int :: #force_inline proc "contextless" (bits: u16) -> Type { return Type{kind = .INT, bits = bits} } +@(require_results) type_float :: #force_inline proc "contextless" (bits: u16) -> Type { return Type{kind = .FLOAT, bits = bits} } + +@(require_results) +type_vector :: #force_inline proc "contextless" (elem: Type_Ref, count: u32) -> Type { + return Type{kind = .VECTOR, elem = elem, count = count} +} + +@(require_results) +type_pointer :: #force_inline proc "contextless" (elem: Type_Ref, address_space: u16 = 0) -> Type { + return Type{kind = .POINTER, elem = elem, aux = address_space} +}