mirror of
https://github.com/ghostty-org/ghostty.git
synced 2026-04-14 03:25:50 +00:00
This PR builds on https://github.com/ghostty-org/ghostty/pull/9678 ~so the diff from there is included here (it's not possible to stack PRs unless it's a PR against my own fork)--review that one first!~ This PR updates the `graphemeBreak` calculation to use `uucode`'s `computeGraphemeBreakNoControl`, which has [tests in uucode](215ff09730/src/x/grapheme.zig (L753)) that confirm it passes the `GraphemeBreakTest.txt` (minus some exceptions). Note that the `grapheme_break` (and `grapheme_break_no_control`) property in `uucode` incorporates `emoji_modifier` and `emoji_modifier_base`, diverging from UAX #29 but matching UTS #51. See [this comment in uucode](215ff09730/src/grapheme.zig (L420-L434)) for details. The `grapheme_break_no_control` property and `computeGraphemeBreakNoControl` both assume `control`, `cr`, and `lf` have been filtered out, matching the current grapheme break logic in Ghostty. This PR keeps the `Precompute.data` logic mostly equivalent, since the `uucode` `precomputedGraphemeBreak` lacks benchmarks in the `uucode` repository (it was benchmarked in [the original PR adding `uucode` to Ghostty](https://github.com/ghostty-org/ghostty/pull/8757)). Note however, that due to `grapheme_break` being one bit larger than `grapheme_boundary_class` and the new `BreakState` also being one bit larger, the state jumps up by a factor of 8 (u10 -> u13), to 8KB. ## Benchmarks ~I benchmarked the old `main` version versus this PR for `+grapheme-break` and surprisingly this PR is 2% faster (?). Looking at the assembly though, I'm thinking something else might be causing that. Once I get to the bottom of that I'll remove the below TODO and include the benchmark results here.~ When seeing the speedup with `data.txt` and maybe a tiny speedup on English wiki, I was surprised given the 1KB -> 8KB tables. Here's what AI said when I asked it to inspect the assembly: https://ampcode.com/threads/T-979b1743-19e7-47c9-8074-9778b4b2a61e, and here's what it said when I asked it to predict the faster version: https://ampcode.com/threads/T-3291dcd3-7a21-4d24-a192-7b3f6e18cd31 It looks like two loads got reordered and that put the load that depended on stage1 -> stage2 -> stage3 second, "hiding memory latency". So that makes the new one faster when looking up the `grapheme_break` property. These gains go away with the Japanese and Arabic benchmarks, which spend more time processing utf8, and may even have more grapheme clusters too. ### with data.txt (200 MB ghostty-gen random utf8) <img width="1822" height="464" alt="CleanShot 2025-11-26 at 08 42 03@2x" src="https://github.com/user-attachments/assets/56d4ee98-21db-4eab-93ab-a0463a653883" /> ### with English wiki dump <img width="2012" height="506" alt="CleanShot 2025-11-26 at 08 43 15@2x" src="https://github.com/user-attachments/assets/230fbfb7-272d-4a2a-93e7-7268962a9814" /> ### with Japanese wiki dump <img width="2008" height="518" alt="CleanShot 2025-11-26 at 08 43 49@2x" src="https://github.com/user-attachments/assets/edb408c8-a604-4a8f-bd5b-80f19e3d65ee" /> ### with Arabic wiki dump <img width="2010" height="512" alt="CleanShot 2025-11-26 at 08 44 25@2x" src="https://github.com/user-attachments/assets/81a29ac8-0586-4e82-8276-d7fa90c31c90" /> TODO: * [x] Take a closer look at the assembly and understand why this PR (8 KB vs 1 KB table) is faster on my machine. * [x] _(**edit**: checking this off because it seems unnecessary)_ If this turns out to actually be unacceptably slower, one possibility is to switch to `uucode`'s `precomputedGraphemeBreak` which uses a 1445 byte table since it uses a dense table (indexed using multiplication instead of bitCast, though, which did show up in the initial benchmarks from https://github.com/ghostty-org/ghostty/pull/8757 a small amount.) AI was used in some of the uucode changes in https://github.com/ghostty-org/ghostty/pull/9678 (Amp--primarily for tests), but everything was carefully vetted and much of it done by hand. This PR was made without AI with the exception of consulting AI about whether the "Prepend + ASCII" scenario is common (hopefully it's right about that being uncommon).