This makes `cursorStyle` utilize `RenderState` to determine the
appropriate cursor style. This moves the cursor style logic outside the
critical area, although it was cheap to begin with.
This always removes `viewport_is_bottom` which had no practical use.
This makes `cursorStyle` utilize `RenderState` to determine the
appropriate cursor style. This moves the cursor style logic outside the
critical area, although it was cheap to begin with.
This always removes `viewport_is_bottom` which had no practical use.
This adds a new API called `RenderState` that produces the proper state
required for a renderer to draw a viewport and updates our renderer to
use it instead of the prior screen clone method.
The newsworthy change is here is that we've shortened the critical area
where the renderer holds the terminal lock and blocks IO by anywhere
from **2x to 5x faster**, and in about half the frames we're now in the
critical area for **zero microseconds** because `RenderState` has much
better dirty/damage tracking.
**For libghostty Zig users**, this API is available to the Zig module
and can be used to create your own renderers (to any artifact, it
doesn't have to be graphical! It is even useful for text like a tmux
clone or something).
## Differences vs Old Method
The renderer previously called `Screen.clone`. This produces a copy of
all the screen data (even stuff the renderer may not need) and attempts
to create a standalone, fully functional `Screen` allocation. I didn't
microbenchmark this to understand exactly where the slowdown was, but I
think it was based primarily in two places:
- Screens have a minimum allocation of two Ghostty "pages" which are
quite large. I think this allocation was the primary expensive part.
Without refactoring our entire Screen/PageList work, this was
unavoidable.
- The clone was unconditional and would produce a fully new Screen on
every frame.
The new structure `RenderState` is stateful and each frame calls
`update` on the prior value and it modifies itself in place. This
addresses the above two points in a few ways:
- Allocation is minimized to only what we need, no full pages.
- Since it is stateful (updating in-place), we only change dirty data
and don't need to copy everything. **Note the benchmarks below only test
_full updates_, so you aren't even seeing the benefits of this!**
- Also since it is stateful, we cache expensive calculations (such as
for selections) so future screen updates can reuse those cached values
rather than recompute them.
- We retain memory from prior updates even when the screen is dirty, so
even in dirty states, it's unlikely we allocate.
More details in the "Design" section.
## Benchmarks
Here are some microbenchmarks on the previous render critical area
versus the new one.
For each of the benchmarks below, ignore the time units (milliseconds)
and instead **focus on the relative speedup.** The benchmarks are all
doing a full render frame setup around 1000 times, because the actual
cost of a frame update is in dozens or hundreds of microseconds.
> [!NOTE]
>
> I'm still working on some more benchmarks before merging. I'll update
this space.
### Full screen, single style
<img width="1542" height="480" alt="CleanShot 2025-11-21 at 07 35 13@2x"
src="https://github.com/user-attachments/assets/c28c9f33-d1aa-4723-8a8e-3c6d70fe3667"
/>
### Full screen, plaintext
<img width="1642" height="612" alt="CleanShot 2025-11-21 at 07 36 06@2x"
src="https://github.com/user-attachments/assets/b51f57cf-7c48-46c8-a347-8ecc0bdd3d47"
/>
### Full screen, different style per cell (pathological case)
<img width="1704" height="456" alt="CleanShot 2025-11-21 at 07 37 55@2x"
src="https://github.com/user-attachments/assets/71a98250-d8d1-47ab-ae69-5e6b3b60bf2d"
/>
### Critical Area: Typing at Shell Prompt
| This PR | Main Branch |
---------|--------
| <img width="1064" height="764" alt="CleanShot 2025-11-21 at 14 44
31@2x"
src="https://github.com/user-attachments/assets/8a0ab3a1-3d68-41f0-9469-bc08a4569286"
/> | <img width="1040" height="684" alt="CleanShot 2025-11-21 at 14 47
10@2x"
src="https://github.com/user-attachments/assets/04ffa607-8841-436b-b6e9-eeeb6ee9482d"
/> |
### Critical Area: Neovim Scrolling
| This PR | Main Branch |
---------|--------
| <img width="1054" height="748" alt="CleanShot 2025-11-21 at 14 45
06@2x"
src="https://github.com/user-attachments/assets/ccafaee8-720f-41be-820d-fd705835607a"
/> | <img width="1068" height="796" alt="CleanShot 2025-11-21 at 14 47
48@2x"
src="https://github.com/user-attachments/assets/68087496-d371-4c7c-8b4c-b967dbaeaa7c"
/> |
### Critical Area: `btop -u 100`
This is closer to a pathological case, about as close as you get with a
real tool in the wild. `btop` uses hundreds of unique styles and updates
many cells across many rows very frequently (every 100ms in this case).
You can see that some of our frame times in this case are similar but
there are _so many more near-idle frames_ thanks to our dirty tracking.
| This PR | Main Branch |
---------|--------
| <img width="1088" height="900" alt="CleanShot 2025-11-21 at 14 45
44@2x"
src="https://github.com/user-attachments/assets/ea63f0eb-f06e-4d00-95a3-c55a3755cc67"
/> | <img width="1078" height="906" alt="CleanShot 2025-11-21 at 14 48
18@2x"
src="https://github.com/user-attachments/assets/cef360de-2b12-440f-8c4c-6a69b5ce4058"
/> |
### "DOOM Fire"
Fullscreen on my macOS when from 770 FPS to ~808 FPS consistently, a
solid 5% increase repeatedly.
<img width="3520" height="2392" alt="CleanShot 2025-11-21 at 07 45
29@2x"
src="https://github.com/user-attachments/assets/033effca-0abb-4ff8-a21b-83214d118d12"
/>
### IO
While this was rendering focused, the smaller critical area does help IO
performance a bit.
We've already tracked down the remaining issues to the IO thread going
to sleep and overhead with context switching. We're investigating
switching to a spin lock for the IO thread only in another track of
work.
> [!NOTE]
>
> **This is comparing with `main`, which already has a 20-30%
performance improvement over v1.2.3.**
<img width="982" height="698" alt="image"
src="https://github.com/user-attachments/assets/52a86f6c-6f09-45fe-9ac7-ca62c7ac6ee4"
/>
## Design
The design of the API is a _stateful_ `RenderState` struct that you call
`update` on each frame with the target terminal, and it only updates
what is needed. RenderState keeps track of rows, cells, hyperlinks,
selections, etc.
```zig
// Start empty
var state: terminal.RenderState = .empty;
defer state.deinit(alloc);
// Each frame update it with a terminal.
// THIS IS THE ONLY PART THAT ACCESS `t`
try state.update(alloc, &t);
// Access render data (can be outside any locking for `t`)
...
```
The ergonomics of the `RenderState` structure a wee bit clunky because
we make use of struct-of-arrays (SoA, Zig's MultiArrayList) to better
optimize cache locality for renderer data vs. update data (what we need
to update the render state is different from what we need to draw).
Once you get used to the API though, it's pretty beautiful. I mean, look
at this:
```zig
for (
0..,
row_data.items(.raw),
row_data.items(.cells),
) |y, row, cells| {
const cells_slice = cells.slice();
for (
0..,
cells_slice.items(.raw),
cells_slice.items(.grapheme),
) |x, cell, graphemes| {
```
## Improvements
This PR makes various improvements across the board:
- It bears repeating in case it was missed previously that the critical
area time of a render has gone down 2x to 5x when there is work and is
now free when there is no work (the previous implementation always did
work).
- Font shaping is much more efficient now and only requires access to a
render state.
- Selection handling is now cached and works with dirty tracking.
Previously, if you had an active selection, we'd search the entire
screen multiple times (like... once per row). Yikes.
- Hyperlink handling is _much_ more efficient. Instead of iterating
through the entire screen contents _per configured link_ we now cache
the screen contents as a string and search one whole string multiple
times. Obvious, but we didn't do this before.
- The `contrainedWidth` and `rowNeverExtendBg` helper methods are now
both much more efficient and live within the renderer package rather
than being awkwardly in the terminal package.
## Future Notes
- Our `terminal.Selection` API is very bad. It conceptually makes sense
and I understand why I designed it this way (easy) but it makes it hard
to render or manipulate performantly.
**AI Disclosure:** AI was used only to assist with writing some tests
and converting some tests. The primary logic is all organic, meatbag
produced.