mirror of
https://github.com/ghostty-org/ghostty.git
synced 2026-01-01 11:12:16 +00:00
A whole bunch of optimizations in hot paths in the IO processing areas of our code (well, one of them covers everything). I validated that each commit either improved one or more of our vtebench results, or improved the time it takes to process 2 years worth (2.4GB) of data from asciinema. ## vtebench <img width="1278" height="903" alt="image" src="https://github.com/user-attachments/assets/bad46777-4606-4870-b7d7-8df0c4bb3b39" /> (I decided to patch vtebench to report in nanoseconds instead of milliseconds since clearly it was not designed for a machine as fast as mine. Nanoseconds gives much more useful results when the numbers are this low.) Do note the *slight* regression in the "unicode" test, this is probably because I added a branch hint in `Terminal.print` in order to optimize for printing narrow characters, since they make up the vast majority of characters typically printed in the terminal, but the vtebench "unicode" test is pretty much all wide characters. This shouldn't have a negative effect on users of CJK languages since it's a *very* slight reduction in speed and they will still be printing many narrow characters, especially in TUIs; spaces, box drawing characters, symbols, punctuation, etc. ## asciinema processing I wrote a program that uses libghostty to push 2 years worth (2.4GB) of data from publicly uploaded asciinema recordings in to the terminal as fast as possible- since it's just libghostty, there's no renderer overhead happening, it's just the core terminal emulation, effectively everything that io-reader thread does if it didn't have wait for the renderer ever. On main, this took roughly 26.1–26.7 seconds to process, on this branch it takes just 18.4–18.6 seconds, that's a ~30% improvement in raw IO processing speed when processing real world data! ## Summary of changes In order of commits: - Fixed a bug that I hit when trying to have Ghostty process all that asciinema data, in certain bad cases it was possible to accidentally insert the `0` hyperlink ID in to a page, which would then cause a lockup in ReleaseFast mode when trying to clone that page since the string alloc would try to iterate `1..0` to allocate 0 chunks. - I noticed in profiling Ghostty that `std.debug.assert` was showing up in the profile, which it should not have been since its doc comment promises that it will be optimized out in ReleaseFast- but evidently something is wrong with Zig, or that comment's promise is based on an expectation from LLVM that it fails to meet - but either way, by replacing all uses of `assert` with a version that is explicitly marked `inline`, that function call overhead in tight loops and hotpaths is avoided. This change alone accounts for like a third of the IO processing time improvement, though it had minimal impact on vtebench scores. - I optimized the SGR parser somewhat by adding branch hints and removing the `.reset_underline` action, replacing it with `.{ .underline = .none }`. - Gated a somewhat expensive assert in RefCountedSet behind a runtime safety check. - Improved the performance of `Style.eql` and `Style.hash` since these are hot functions, called extremely frequently since adding styles to the style set is a very common operation. Achieved this by making `eql` less generic - explicitly comparing each part of the style rather than looping over fields - and ordering checks from most likely to differ to least likely to differ so that differences can be found as soon as possible; and changed the hash from xxhash to simply folding the packed struct down to 64 bits and then using `std.hash.int`. Also manually inlined the code from `std.meta.activeTag` in `Packed.fromStyle`, since profiling showed it in the callstack and it's a single cast so it really should not have the function call overhead. - Explicitly marked some trivial functions as inline, the optimizer would already have been doing this (probably) but doing it explicitly gives the optimizer more time to spend on other things. Added cold branch hints to "should be impossible" and error-returning paths that should be very rare, and unlikely branch hints to a lot of "invalid" paths- to optimize for receiving valid data. - Removed a branch in the parser csi param action, just unconditionally multiply by 10 before adding digit value, even if it's the first digit. This codepath is rarely hit since we have a fast path for this in the stream code, but the stream code already has this optimization so I just copied it over. - `CharsetState.charsets` used to be an `EnumArray`, but the layout/access logic for that was less-than-ideal, and the access functions were not inlining-- and these are very hot since we access this for every single print, so I wrote a bespoke struct to hold that info instead, gained a couple percent of IO perf with that. - Added branch hints based on the data I derived from the asciinema dump, which gave big boost to vtebench results, especially for the cursor movement and dense cells tests (which makes sense, since cursor movement and setting attributes both got `likely` hints :p) -- data at https://github.com/qwerasd205/asciinema-stats - This is probably the most invasive change in this PR: I removed the dirty bitset from `Page` and replaced it with a dirty flag on each row, for the majority of operations this is faster to write, since the row being dirtied is probably already loaded and probably will be written to for other changes as well. This gave a couple percent IO processing improvement. The only exception is scrolling-type operations, which are extremely efficient by just moving rows around with a single memmov, so looping through the rows to mark each dirty slows them down, and indeed after this change the scrolling benchmarks in vtebench regressed, *however*... - Added a "full page dirty" flag on `Page`, which is set when an operation is performed that dirties most or all the rows in the page, which is used for scrolling-type operations. This *does* make the dirty tracking slightly less precise for these operations, but with the caching and stuff we do in the renderer, I don't think `rebuildCells` is a bottleneck, so rebuilding a few extra rows shouldn't hurt. After this change, all the scrolling benchmarks in vtebench improved drastically. - Tiny micro-improvements to RefCountedSet; streamlined the control flow in `lookup`, added an unlikely branch hint in `insert` for the branch that resurrects dead items since dead items aren't that common. - Improve SGR parser performance again by using `@call(.always_inline` to explicitly inline calls to `StaticBitSet.isSet` (for the separator list), since I noticed they weren't being inlined, causing function call overhead in a hotpath. - I noticed that `clearGrapheme` and `clearHyperlink` would check every cell in the row after they were done in order to update the `grapheme`/`hyperlink` flag on the row if there were none left, which isn't great since `clearCells` called these functions for multiple cells in the same row back-to-back, which leads to a ton of excess work. I separated the flag updating parts of these functions out and called them only if necessary (if the cells being cleared were the full row then the flag could unconditionally be set to false) and only after all the cells were cleared. This gave a nice improvement to IO processing since clearCells is evidently a very hot function. - Removed inline annotations on `Page.clearGrapheme` and `Page.clearHyperlink` in favor of inlining directly at the one callsite that benefited from inlining, this improved IO processing speed. - Inlined trivial function `Charset.table`. - Inlined `size.getOffset` and `size.intFromBase` as they are both trivial pointer math that often benefits from surrounding context. --- If you'd like me to separate out the trivial improvements (branch hints, inline annotations, 1-line changes) from the functionality-changing ones (pretty much just the changes to dirty tracking), just let me know!