Problem:
`get_node_text()` returned inconsistent results between buffer and
string sources when a node's range ends at `end_col == 0` (i.e. the node
ends with a newline). The buffer path dropped the trailing newline; the
string path included it correctly.
Solution:
Append `'\n'` in `buf_range_get_text()` when `end_col == 0` and
`start_row ~= end_row`. The `start_row ~= end_row` guard excludes
zero-width nodes at column 0, which should return `""`.
Remove the workaround in the `#trim!` directive that manually
compensated for the missing newline.
Strip whitespace in `resolve_lang()` so injection language nodes ending
at `end_col == 0` (e.g. `">lua\n"`) still resolve correctly.
After an edit, LanguageTree:_edit() updates the current trees. When the
LanguageTree manages explicit regions, _edit() also refreshes _regions
from tree:included_ranges(true), so those regions have the edited byte
offsets.
A later injection pass may call set_included_regions() with a different
number of child regions. That path discards the old trees and emits
changedtree callbacks for them. invalidate(true) does the same when a
buffer is reloaded. Before this change, both discard paths called
tree:included_ranges(true) for every old tree, even if _edit() had just
collected those exact ranges.
That duplicate range extraction is expensive with many injection trees.
Realistic shapes include generated C files with many macro bodies parsed
by the C preproc_arg injection, Markdown documents with many fenced blocks
of the same language, and template files with many embedded-language
islands. The stock highlighter registers recursive changedtree callbacks,
so this is on the normal highlighting edit path.
Track whether _regions currently came from tree:included_ranges(true)
with _regions_from_tree_ranges. _do_changedtree_callbacks() reuses
_regions only in that state; otherwise it falls back to calling
tree:included_ranges(true). Clear the marker when regions are replaced by
injection ranges, when a tree is reparsed, or when trees are discarded.
This avoids keeping a second copy of the ranges while preserving callback
precision: changedtree still receives tree:included_ranges(true) for the
old tree, not the broader managed region.
Benchmark on 100k C macro injections, one-line edit, recursive
changedtree callback:
- HEAD median: edited parse 84.2 ms, child region replacement 58.7 ms
- This change: edited parse 34.6 ms, child region replacement 8.5 ms
That is about 2.4x faster for the edit parse and 6.9x faster for child
region replacement in this workload.
Add a regression test that replacing injection regions still fires
changedtree and still reports the old tree's exact included ranges.
AI-assisted: Codex
Hyphenated language names are silently dropped when used as injections
(see #38132).
This combines the normalization of language aliases into `resolve_lang`,
and also adds the normalization of hyphens to underscores, which allows
for handling of injected language tags with hyphens in their names.
Fixes#38132.
**Problem:** Whenever `LanguageTree:parse()` is called, injection trees
from previously parsed ranges are dropped.
**Solution:** Allow the function to accept a list of ranges, so it can
return injection trees for all the given ranges.
Co-authored-by: Jaehwang Jung <tomtomjhj@gmail.com>
This commit changes `languagetree.lua` so that it creates a scratch
buffer under the hood when dealing with string parsers. This will make
it much easier to just use extmarks whenever we need to track injection
trees in `languagetree.lua`. This also allows us to remove the
`treesitter.c` code for parsing a string directly.
Note that the string parser's scratch buffer has `set noeol nofixeol` so
that the parsed source exactly matches the passed in string.
Problem:
The previous fix in #34314 relies on copying the tree in `tree_root` to
ensure the `TSNode`'s tree cannot be mutated. But that causes the
problem where two calls to `tree_root` return nodes from different
copies of a tree, which do not compare as equal. This has broken at
least one plugin.
Solution:
Make all `TSTree`s on the Lua side always immutable, avoiding the need
to copy the tree in `tree_root`, and make the only mutation point,
`tree_edit`, copy the tree instead.
Before, only the last capture's range would be counted for injection.
Now all captured ranges will be counted in the ranges array. This is
more intuitive, and also provides a nice solution/alternative to the
"scoped injections" issue.
**Problem:** `LanguageTree:contains()` considers any range within the
start of the first tree and end of the last tree as "within" the
language tree. In the case of combined injections, this is problematic
because we only want to consider ranges within any of the combined trees
as "contained" (as opposed to any range within the entire range spanned
by all combined trees).
**Solution:** Use a more discriminative check in
`LanguageTree:contains()`.
Problem:
treesitter injected language ranges sometimes cross over the capture
boundaries when `@combined`.
Solution:
Clip child regions to not spill out of parent regions within
languagetree.lua, and only apply highlights within those regions in
highlighter.lua.
Co-authored-by: Cormac Relf <web@cormacrelf.net>
Remove the `set_timeout` functions for `TSParser` and instead add a timeout
parameter to the regular parse function. Remove these deprecated tree-sitter
API functions and replace them with the preferred `TSParseOptions` style.
Simplify the logic for retrieving the injection ranges for the language
tree. The trees are now also sorted by starting position, regardless of
whether they are part of a combined injection or not. This would be
helpful if ranges are ever to be stored in an interval tree or other
kind of sorted tree structure.
Lua coroutines can yield across non-coroutine function boundaries,
meaning that we don't need to wrap each helper function in a coroutine
and resume it within `_parse()`. If we just have them yield when
appropriate, this will be caught by the top level `_parse()` coroutine,
and resuming the `_parse()` will resume from the position in the helper
function where we yielded last.
**Problem:** Currently, parsing is asynchronous, but it involves a
(sometimes lengthy) step which finds all injection ranges for a tree by
iterating over that language's injection queries. This causes edits in
large files to be extremely slow, and also causes a long stutter during
the initial parse of a large file.
**Solution:** Break up the injection query iteration over multiple event
loop iterations.
We need to add a separate variable to keep track of this information,
since we cannot read the length of the valid regions table itself, since
it has holes.
When given, only that range will be checked for validity rather than the
entire tree. This is used in the highlighter to save CPU cycles since we
only need to parse a certain region at a time anyway.
This means that all work previously done by a `_parse()` iteration will
be kept in future iterations. This prevents it from running indefinitely
in some cases where the file is very large and there are 2+ injections.
Problem:
When running an initial parse, parse() returns an empty table rather
than an actual range. In `languagetree.lua`, we manually check if
a parse was incremental to determine the changed parse region.
Solution:
- Always return a range (in the C side) from parse().
- Simplify the language tree code a bit.
- Logger no longer shows empty ranges on the initial parse.
This simplifies some logic in `languagetree.lua`, removing the need for
`_has_regions`, and removing side effects in `:included_regions()`.
Before:
- Edit is made which sets `_regions = nil`
- Upon the next call to `included_regions()` (usually right after we
marked `_regions` as `nil` due to an `_iter_regions()` call), if
`_regions` is nil, we repopulate the table (as long as the tree
actually has regions)
After:
- Edit is made which resets `_regions` if it exists
- `included_regions()` no longer needs to perform this logic itself, and
also no longer needs to read a `_has_regions` variable
**Problem:** Currently, if users want to efficiently disable injections,
they have to delete the injection query files at their runtime path.
This is because we only check for existence of the files before running
the query over the entire buffer.
**Solution:** Check for existence of query files, *and* that those files
actually have captures. This will allow users to just comment out
existing queries (or better yet, just add their own injection query to
`~/.config/nvim` which contains only comments) to disable running the
query over the entire buffer (a potentially slow operation)
**Problem:** Parsing can be slow for large files, and it is a blocking
operation which can be disruptive and annoying.
**Solution:** Provide a function for asynchronous parsing, which accepts
a callback to be run after parsing completes.
Co-authored-by: Lewis Russell <lewis6991@gmail.com>
Co-authored-by: Luuk van Baal <luukvbaal@gmail.com>
Co-authored-by: VanaIgr <vanaigranov@gmail.com>
Problem: Treesitter highlighter implements an on_bytes callback that
just re-marks a buffer range for redraw. The edit that
prompted the callback will already have done that.
Solution: Remove redundant on_bytes callback from the treesitter
highlighter module.
Problem: No clear way to check whether parsers are available for a given
language.
Solution: Make `language.add()` return `true` if a parser was
successfully added and `nil` otherwise. Use explicit `assert` instead of
relying on thrown errors.
For context, see https://github.com/neovim/neovim/pull/24738. Before
that PR, Nvim did not correctly handle captures with quantifiers. That
PR made the correct behavior opt-in to minimize breaking changes, with
the intention that the correct behavior would eventually become the
default. Users can still opt-in to the old (incorrect) behavior for now,
but this option will eventually be removed completely.
BREAKING CHANGE: Any plugin which uses `Query:iter_matches()` must
update their call sites to expect an array of nodes in the `match`
table, rather than a single node.
This is identical to `named_node_for_range` except that it includes
anonymous nodes. This maintains consistency in the API because we
already have `descendant_for_range` and `named_descendant_for_range`.
* fix(treesitter): enforce lowercase language names
Problem: On case-insensitive file systems (e.g., macOS), `has_parser`
will return `true` for uppercase aliases, which will then try to inject
the uppercase language unsuccessfully.
Solution: Enforce and assume parser names to be lowercase when
resolving language names.
Problem: Injecting languages for file redirects (e.g., in bash) is not
possible.
Solution: Add `@injection.filename` capture that is piped through
`vim.filetype.match({ filename = node_text })`; the resulting filetype
(if not `nil`) is then resolved as a language (either directly or
through the list maintained via `vim.treesitter.language.register()`).
Note: `@injection.filename` is a non-standard capture introduced by
Helix; having two editors implement it makes it likely to be upstreamed.