This makes `cursorStyle` utilize `RenderState` to determine the
appropriate cursor style. This moves the cursor style logic outside the
critical area, although it was cheap to begin with.
This always removes `viewport_is_bottom` which had no practical use.
A whole bunch of optimizations in hot paths in the IO processing areas
of our code (well, one of them covers everything). I validated that each
commit either improved one or more of our vtebench results, or improved
the time it takes to process 2 years worth (2.4GB) of data from
asciinema.
## vtebench
<img width="1278" height="903" alt="image"
src="https://github.com/user-attachments/assets/bad46777-4606-4870-b7d7-8df0c4bb3b39"
/>
(I decided to patch vtebench to report in nanoseconds instead of
milliseconds since clearly it was not designed for a machine as fast as
mine. Nanoseconds gives much more useful results when the numbers are
this low.)
Do note the *slight* regression in the "unicode" test, this is probably
because I added a branch hint in `Terminal.print` in order to optimize
for printing narrow characters, since they make up the vast majority of
characters typically printed in the terminal, but the vtebench "unicode"
test is pretty much all wide characters.
This shouldn't have a negative effect on users of CJK languages since
it's a *very* slight reduction in speed and they will still be printing
many narrow characters, especially in TUIs; spaces, box drawing
characters, symbols, punctuation, etc.
## asciinema processing
I wrote a program that uses libghostty to push 2 years worth (2.4GB) of
data from publicly uploaded asciinema recordings in to the terminal as
fast as possible- since it's just libghostty, there's no renderer
overhead happening, it's just the core terminal emulation, effectively
everything that io-reader thread does if it didn't have wait for the
renderer ever.
On main, this took roughly 26.1–26.7 seconds to process, on this branch
it takes just 18.4–18.6 seconds, that's a ~30% improvement in raw IO
processing speed when processing real world data!
## Summary of changes
In order of commits:
- Fixed a bug that I hit when trying to have Ghostty process all that
asciinema data, in certain bad cases it was possible to accidentally
insert the `0` hyperlink ID in to a page, which would then cause a
lockup in ReleaseFast mode when trying to clone that page since the
string alloc would try to iterate `1..0` to allocate 0 chunks.
- I noticed in profiling Ghostty that `std.debug.assert` was showing up
in the profile, which it should not have been since its doc comment
promises that it will be optimized out in ReleaseFast- but evidently
something is wrong with Zig, or that comment's promise is based on an
expectation from LLVM that it fails to meet - but either way, by
replacing all uses of `assert` with a version that is explicitly marked
`inline`, that function call overhead in tight loops and hotpaths is
avoided. This change alone accounts for like a third of the IO
processing time improvement, though it had minimal impact on vtebench
scores.
- I optimized the SGR parser somewhat by adding branch hints and
removing the `.reset_underline` action, replacing it with `.{ .underline
= .none }`.
- Gated a somewhat expensive assert in RefCountedSet behind a runtime
safety check.
- Improved the performance of `Style.eql` and `Style.hash` since these
are hot functions, called extremely frequently since adding styles to
the style set is a very common operation. Achieved this by making `eql`
less generic - explicitly comparing each part of the style rather than
looping over fields - and ordering checks from most likely to differ to
least likely to differ so that differences can be found as soon as
possible; and changed the hash from xxhash to simply folding the packed
struct down to 64 bits and then using `std.hash.int`. Also manually
inlined the code from `std.meta.activeTag` in `Packed.fromStyle`, since
profiling showed it in the callstack and it's a single cast so it really
should not have the function call overhead.
- Explicitly marked some trivial functions as inline, the optimizer
would already have been doing this (probably) but doing it explicitly
gives the optimizer more time to spend on other things. Added cold
branch hints to "should be impossible" and error-returning paths that
should be very rare, and unlikely branch hints to a lot of "invalid"
paths- to optimize for receiving valid data.
- Removed a branch in the parser csi param action, just unconditionally
multiply by 10 before adding digit value, even if it's the first digit.
This codepath is rarely hit since we have a fast path for this in the
stream code, but the stream code already has this optimization so I just
copied it over.
- `CharsetState.charsets` used to be an `EnumArray`, but the
layout/access logic for that was less-than-ideal, and the access
functions were not inlining-- and these are very hot since we access
this for every single print, so I wrote a bespoke struct to hold that
info instead, gained a couple percent of IO perf with that.
- Added branch hints based on the data I derived from the asciinema
dump, which gave big boost to vtebench results, especially for the
cursor movement and dense cells tests (which makes sense, since cursor
movement and setting attributes both got `likely` hints :p) -- data at
https://github.com/qwerasd205/asciinema-stats
- This is probably the most invasive change in this PR: I removed the
dirty bitset from `Page` and replaced it with a dirty flag on each row,
for the majority of operations this is faster to write, since the row
being dirtied is probably already loaded and probably will be written to
for other changes as well. This gave a couple percent IO processing
improvement. The only exception is scrolling-type operations, which are
extremely efficient by just moving rows around with a single memmov, so
looping through the rows to mark each dirty slows them down, and indeed
after this change the scrolling benchmarks in vtebench regressed,
*however*...
- Added a "full page dirty" flag on `Page`, which is set when an
operation is performed that dirties most or all the rows in the page,
which is used for scrolling-type operations. This *does* make the dirty
tracking slightly less precise for these operations, but with the
caching and stuff we do in the renderer, I don't think `rebuildCells` is
a bottleneck, so rebuilding a few extra rows shouldn't hurt. After this
change, all the scrolling benchmarks in vtebench improved drastically.
- Tiny micro-improvements to RefCountedSet; streamlined the control flow
in `lookup`, added an unlikely branch hint in `insert` for the branch
that resurrects dead items since dead items aren't that common.
- Improve SGR parser performance again by using `@call(.always_inline`
to explicitly inline calls to `StaticBitSet.isSet` (for the separator
list), since I noticed they weren't being inlined, causing function call
overhead in a hotpath.
- I noticed that `clearGrapheme` and `clearHyperlink` would check every
cell in the row after they were done in order to update the
`grapheme`/`hyperlink` flag on the row if there were none left, which
isn't great since `clearCells` called these functions for multiple cells
in the same row back-to-back, which leads to a ton of excess work. I
separated the flag updating parts of these functions out and called them
only if necessary (if the cells being cleared were the full row then the
flag could unconditionally be set to false) and only after all the cells
were cleared. This gave a nice improvement to IO processing since
clearCells is evidently a very hot function.
- Removed inline annotations on `Page.clearGrapheme` and
`Page.clearHyperlink` in favor of inlining directly at the one callsite
that benefited from inlining, this improved IO processing speed.
- Inlined trivial function `Charset.table`.
- Inlined `size.getOffset` and `size.intFromBase` as they are both
trivial pointer math that often benefits from surrounding context.
---
If you'd like me to separate out the trivial improvements (branch hints,
inline annotations, 1-line changes) from the functionality-changing ones
(pretty much just the changes to dirty tracking), just let me know!
This fixes the source of a deadlock that some tip users have hit. If our
surface mailbox is full and there is a dirty scrollbar state, then
drawFrame would block forever trying to queue to the surface mailbox.
We now fail instantly if the queue is full and keep the scrollbar state
dirty. We can try again on the next frame, it's not a critical thing to
get updated.
Fixes#2731 (again)
This regressed in 1.2 due to the renderer rework missing porting this. I
believe this issue is still valid even with the rework since the font
grid changes the atlas and if there are still cached cells that
reference the old atlas coordinates it will produce garbage.
Related to #111
This adds the necessary logic and data for the `PageList` data structure
to keep track of **total length** of the screen, **offset** into the
viewport, and **length** of the viewport. These three values are
necessary to _render_ a scrollbar. This PR updates the renderer to grab
this information but stops short of actually drawing a scrollbar (which
we'll do with native UI), in the interest of having a PR that doesn't
contain too many changes.
**This doesn't yet draw a scrollbar, these are just the internal changes
necessary to support it.**
## Background
The `PageList` structure is very core to how we represent terminal
state. It maintains a doubly linked list of "pages" (not literally
virtual memory pages, but close). Each page stores cell information,
styles, hyperlinks, etc fully self-contained in a contiguous sets of VM
pages using offset addresses rather than full pointers. **Pages are not
guaranteed to be equal sizes.** (This is where scrollbars get difficult)
Because it is a linked list structure of non-equal sized nodes, it isn't
amenable to typical scrollbar behavior. A scrollbar needs to know: full
size, offset, and length in order to draw the scrollbar properly.
Getting these values naively is `O(N)` within the data structure that is
on the hottest IO performance path in all of Ghostty.
## Implementation
### PageList
We now maintain two cached values for **total length** and **viewport
offset**.
The total length is relatively straightforward, we just have to be
careful to update it in every operation that could add or remove rows.
I've done this and ensured that every place we update it is covered with
unit test coverage.
The viewport offset is nasty, but I came up with what I believe is a
good solution. The viewport when arbitrarily scrolled is defined as a
direct pointer to the linked list node plus a row offset into that node.
The only way to calculate offset from the top is `O(N)`.
But we have a couple shortcuts:
1. If the viewport is at the bottom (most common) or top, calculating
the offset is `O(1)`: bottom is `total_rows - active_rows`, both readily
available. And top is `0` by definition.
2. Operations on the PageList typically add or remove rows. We don't do
arbitrary linked list surgery. If we instrument those areas with delta
updates to our cache, we can avoid the `O(N)` cost for most operations,
including scrolling a scrollbar. The only expensive operation is a full,
arbitrary jump (new node pointer).
Point 1 was quick to implement, so I focused all the complexity on point
2. Whenever we have an operation that adds or removes rows (for example
pruning the scroll back, adding more, erase rows within the active area,
etc.) then I do the math to calculate the delta change required for the
offset if we've already calculated it, and apply that directly.
### Renderer
The other issue was how to notify the apprts of scrollbar state. Sending
messages on any terminal change within the IO thread is a non-option
because (1) sending messages is slow (2) the terminal changes a lot and
(3) any slowness in the IO thread slows down overall terminal
throughput.
The solution was to **trigger scrollbar notifications with the renderer
vsync**. We read the scrollbar information when we render a frame,
compare it to renderer previous state, and if the scrollbar changed,
send a message to the apprt _after the frame is GPU-renderer_.
The renderer spends _most_ of its time sleeping compared to the IO
thread, and has more opportunities for optimizing its awake time.
Additionally, there's no reason to update the scrollbar information if
the renderer hasn't rendered the new frames because the user can't even
see the stuff the scrollbar wants to scroll to. We're talking about
millisecond scale stuff here at worst but it adds up.
## Performance
No noticeable performance impact for the additional metrics:
<img width="1012" height="738" alt="image"
src="https://github.com/user-attachments/assets/4ed0a3e8-6d76-40c1-b249-e34041c2f6fd"
/>
## AI Usage
I used Amp to help audit the codebase and write tests. I wrote all the
main implementation code manually. I came up with the main design
myself. Relevant threads:
-
https://ampcode.com/threads/T-95fff686-75bb-4553-a2fb-e41fe4cd4b77#message-0-block-0
-
https://ampcode.com/threads/T-48e9a288-b280-4eec-83b7-ca73d029b4ef#message-91-block-0
## Future
This is just the internal changes necessary to _draw_ a scrollbar. There
will be other changes we'll need to add to handle grabbing and actually
jumping the scrollbar. I have a good idea of how to implement those
performantly as well.
Changes it so that the renderer retains its own MemoryPool for PageList
pages so that new pages rarely need to be allocated when cloning the
screen. Also switches to using an arena allocator in `updateFrame` to
avoid having to deinit the cloned screen since instead we can just throw
out the memory.
The GLSL to MSL conversion process uses a passed-in sampler state for
the `iChannel0` parameter and we weren't providing it. This magically
worked on Apple Silicon for unknown reasons but failed on Intel GPUs.
In normal, hand-written MSL, we'd explicitly create the sampler state as
a normal variable (we do this in `shaders.metal` already!), but the
Shadertoy conversion stuff doesn't do this, probably because the exact
sampler parameters can't be safely known.
This fixes a Metal validation error when using custom shaders:
```
-[MTLDebugRenderCommandEncoder validateCommonDrawErrors:]:5970: failed
assertion `Draw Errors Validation Fragment Function(main0): missing Sampler
binding at index 0 for iChannel0Smplr[0].
```
When processing kitty images in a loop in a few places we were returning
under certain conditions where we should instead have just continued the
loop. This caused serious problems for kitty images, especially for apps
that used multiple images on screen at once.
... I have no clue how I originally wrote this code and didn't see such
a trivial mistake, I think I was sleep deprived or something.
This pull request adds the `--faint-opacity` option, as discussed in
#7637.
The default value of the option is also changed from `0.68` to `0.5` for
greater consistency with other popular terminal emulators.
This math was incorrect from the start, the previous fix helped OpenGL
but broke positioning under Metal; this commit fixes the math to be
correct under both backends and adds comments explaining exactly what's
going on.
Fix for discussion #8113
The cursor Y position value exposed to the shader uniforms was
incorrectly calculated. As per the doc in cell_text.v.glsl: In order to
get the top left of the glyph, we compute an offset based on the
bearings. The Y bearing is the distance from the bottom of the cell to
the top of the glyph, so we subtract it from the cell height to get the
y offset.
This calculation was mistakenly left out of the original code.
This will ensure that the custom shaders using
iCurrentCursor/iPreviousCursor get the correct Y coordinate representing
the top-left corner of the cursor rectangle, matching the documented
uniform behavior
This was a memory leak under Metal, leaked 1 swapchain worth of targets
every time a surface was closed.
Under OpenGL I think it was all cleaned up when the GL context was
destroyed.
Fixes#7893
Previously, custom shader cursor uniforms were only updated when the
cursor glyph was in the front (block) cursor list. This caused non-block
cursors (such as bar, underline, hollow block, and lock) to be missing
from custom shader effects.
This commit adds a helper to the cell contents struct to retrieve the
current cursor glyph from either the front or back cursor lists, and
updates the renderer to use this helper when setting custom shader
uniforms. As a result, custom shaders now receive correct cursor
information for all supported cursor styles.
This is a big'un.
- **Glyph constraint logic is now done fully on the CPU** at the
rasterization stage, so it only needs to be done once per glyph instead
of every frame. This also lets us eliminate padding between glyphs on
the atlas because we're doing nearest-neighbor sampling instead of
interpolating-- which ever so slightly increases our packing efficiency.
- **Special constraints for nerd font glyphs** are applied based roughly
on the constraints they use in their patcher. It's a simplification of
what they do, the largest difference being that they scale groups of
glyphs based on a shared bounding box so that they maintain relative
size to one another, but that would require loading all glyphs on the
group and I'd want to do that on font load TBH and at that point I'd
basically be re-implementing the nerd fonts patcher in Zig to patch
fonts at load time which is way beyond the scope I want to have. (Fixes
#7069)
- These constraints allow for **perfectly sized and centered emojis**,
this is very nice.
- **Changed the default embedded fonts** from 4 copies (regular, italic,
bold, bold italic) of a patched (and outdated) JetBrains Mono to a
single JetBrains Mono variable font and a single Nerd Fonts Symbols Only
font. This cuts the weight of those down from 9MB to 3MB!
- **FreeType's `renderGlyph` is significantly reworked**, and the new
code is, IMO, much cleaner- although there are probably some edge case
behavior differences I've introduced.
> [!NOTE]
> One breaking change I definitely introduced is changing the
`monochrome` freetype load flag config from its previous completely
backwards meaning to instead the correct one (I also changed the
default, so this won't affect any user who hasn't touched it, but users
who set the `monochrome` flag will find their fonts quite crispy after
this change because they will have no anti-aliasing anymore)
### Future work
Following this change I want to get to work on automatic font size
matching (a la CSS
[`font-size-adjust`](https://developer.mozilla.org/en-US/docs/Web/CSS/font-size-adjust)).
I set the stage for that quite some time ago so it shouldn't be too much
work and it will be a big benefit for users who regularly use multiple
writing systems and so have multiple fonts for them that aren't
necessarily size-compatible.
This PR does two things.
1. Build system improvements to make developing on macOS more enjoyable
2. Delete the GLFW apprt
## Build System Improvements (macOS)
On macOS, there are a few major improvements:
* `zig build` now produces a full macOS app bundle and copies it into
`zig-out`
* `zig build run` now builds the macOS app and runs it, streaming logs
directly into the terminal
* `-Demit-macos-app` can control whether app bundle is created
* `-Dxcframework-target` can be set to one of `native` or `universal` to
control whether the xcframework uses only your target machines arch or
creates a universal one with macOS and iOS. This defaults to `native`
for the `run` command and `universal` for all others.
* The general flow of the `build.zig` file was improved to be more
declarative
## Nuke GLFW
> This deletes the GLFW apprt from the Ghostty codebase.
>
> The GLFW apprt was the original apprt used by Ghostty (well, before
> Ghostty even had the concept of an "apprt" -- it was all just a single
> application then). It let me iterate on the core terminal features,
> rendering, etc. without bothering about the UI. It was a good way to
get
> started. But it has long since outlived its usefulness.
>
> We've had a stable GTK apprt for Linux (and Windows via WSL) and a
> native macOS app via libghostty for awhile now. The GLFW apprt only
> remained within the tree for a few reasons:
>
> 1. Primarily, it provided a faster feedback loop on macOS because
> building the macOS app historically required us to hop out of the
> zig build system and into Xcode, which is slow and cumbersome.
>
> 2. It was a convenient way to narrow whether a bug was in the
> core Ghostty codebase or in the apprt itself. If a bug was in both
> the glfw and macOS app then it was likely in the core.
>
> 3. It provided us a way on macOS to test OpenGL.
>
> All of these reasons are no longer valid. Respectively:
>
> 1. Our Zig build scripts now execute the `xcodebuild` CLI directly and
> can open the resulting app, stream logs, etc. This is the same
> experience we have on Linux. (Xcode has always been a dependency of
> building on macOS in general, so this is not cumbersome.)
>
> 2. We have a healthy group of maintainers, many of which have access
> to both macOS and Linux, so we can quickly narrow down bugs
> regardless of the apprt.
>
> 3. Our OpenGL renderer hasn't been compatible with macOS for some time
> now, so this is no longer a useful feature.
>
> At this point, the GLFW apprt is just a burden. It adds complexity
> across the board, and some people try to run Ghostty with it in the
real
> world and get confused when it doesn't work (it's always been lacking
in
> features and buggy compared to the other apprts).
>
> So, it's time to say goodbye. Its bittersweet because it is a big part
> of Ghostty's history, but we've grown up now and it's time to move on.
> Thank you, goodbye.
>
> (NOTE: If you are a user of the GLFW apprt, then please fork the
project
> prior to this commit or start a new project based on it. We've warned
> against using it for a very, very long time now.)
Now that it's done at the rasterization stage, we don't need to handle
it on the GPU. This also means that we can switch to nearest neighbor
interpolation in the Metal shader since we're guaranteed to be pixel
perfect. Accidentally, we were already nearest neighbor in the OpenGL
shaders because I used the Rectangle texture mode in the big renderer
rework, which doesn't support interpolation- anyway, that's no longer
problematic since we won't be scaling glyphs on the GPU anymore.
This deletes the GLFW apprt from the Ghostty codebase.
The GLFW apprt was the original apprt used by Ghostty (well, before
Ghostty even had the concept of an "apprt" -- it was all just a single
application then). It let me iterate on the core terminal features,
rendering, etc. without bothering about the UI. It was a good way to get
started. But it has long since outlived its usefulness.
We've had a stable GTK apprt for Linux (and Windows via WSL) and a
native macOS app via libghostty for awhile now. The GLFW apprt only
remained within the tree for a few reasons:
1. Primarily, it provided a faster feedback loop on macOS because
building the macOS app historically required us to hop out of the
zig build system and into Xcode, which is slow and cumbersome.
2. It was a convenient way to narrow whether a bug was in the
core Ghostty codebase or in the apprt itself. If a bug was in both
the glfw and macOS app then it was likely in the core.
3. It provided us a way on macOS to test OpenGL.
All of these reasons are no longer valid. Respectively:
1. Our Zig build scripts now execute the `xcodebuild` CLI directly and
can open the resulting app, stream logs, etc. This is the same
experience we have on Linux. (Xcode has always been a dependency of
building on macOS in general, so this is not cumbersome.)
2. We have a healthy group of maintainers, many of which have access
to both macOS and Linux, so we can quickly narrow down bugs
regardless of the apprt.
3. Our OpenGL renderer hasn't been compatible with macOS for some time
now, so this is no longer a useful feature.
At this point, the GLFW apprt is just a burden. It adds complexity
across the board, and some people try to run Ghostty with it in the real
world and get confused when it doesn't work (it's always been lacking in
features and buggy compared to the other apprts).
So, it's time to say goodbye. Its bittersweet because it is a big part
of Ghostty's history, but we've grown up now and it's time to move on.
Thank you, goodbye.
(NOTE: If you are a user of the GLFW apprt, then please fork the project
prior to this commit or start a new project based on it. We've warned
against using it for a very, very long time now.)