These are not needed after #35129 but making uncrustify still play nice
with them was a bit tricky.
Unfortunately `uncrustify --update-config-with-doc` breaks strings
with backslashes. This issue has been reported upstream,
and in the meanwhile auto-update on every single run has been disabled.
Using the preprocessor before generating prototypes provides some
"niceties" but the places that rely on these are pretty few.
Vastly simplifying the BUILD SYSTEM is a better trade-off.
Unbalancing { } blocks due to the preprocessor is cringe anyway (think
of the tree-sitter trees!), write these in a different way.
Add some workarounds for plattform specific features.
INCLUDE_GENERATED_DECLARATIONS flag is now technically redundant,
but will be cleaned up in a follow-up PR as it is a big mess.
Problem:
vim.uv.os_setenv gets "stuck" per-key. #32550
Caused by the internal `envmap` cache. #7920
:echo $FOO <-- prints nothing
:lua vim.uv.os_setenv("FOO", "bar")
:echo $FOO <-- prints bar (as expected)
:lua vim.uv.os_setenv("FOO", "fizz")
:echo $FOO <-- prints bar, still (not expected. Should be "fizz")
:lua vim.uv.os_unsetenv("FOO")
:echo $FOO <-- prints bar, still (not expected. Should be nothing)
:lua vim.uv.os_setenv("FOO", "buzz")
:echo $FOO <-- prints bar, still (not expected. Should be "buzz")
Solution:
- Remove the `envmap` cache.
- Callers to `os_getenv` must free the result.
- Update all call-sites.
- Introduce `os_getenv_noalloc`.
- Extend `os_env_exists()` the `nonempty` parameter.
Problem: inline word diff treats multibyte chars as word char
(after 9.1.1243)
Solution: treat all non-alphanumeric characters as non-word characters
(Yee Cheng Chin)
Previously inline word diff simply used Vim's definition of keyword to
determine what is a word, which leads to multi-byte character classes
such as emojis and CJK (Chinese/Japanese/Korean) characters all
classifying as word characters, leading to entire sentences being
grouped as a single word which does not provide meaningful information
in a diff highlight.
Fix this by treating all non-alphanumeric characters (with class number
above 2) as non-word characters, as there is usually no benefit in using
word diff on them. These include CJK characters, emojis, and also
subscript/superscript numbers. Meanwhile, multi-byte characters like
Cyrillic and Greek letters will still continue to considered as words.
Note that this is slightly inconsistent with how words are defined
elsewhere, as Vim usually considers any character with class >=2 to be
a "word".
related: vim/vim#16881 (diff inline highlight)
closes: vim/vim#170509aa120f7ad
Co-authored-by: Yee Cheng Chin <ychin.git@gmail.com>
Problem: Option metadata like list of valid values for an option and
option flags are not listed in the `options.lua` file and are instead
manually defined in C, which means option metadata is split between
several places.
Solution: Put metadata such as list of valid values for an option and
option flags in `options.lua`, and autogenerate the corresponding C
variables and enums.
Supersedes #28659
Co-authored-by: glepnir <glephunter@gmail.com>
while the implementation is not tied to screen chars, it is a reasonable
expectation to support the same size. If nvim is able to display a
multibyte character, it will accept the same character as input,
including in normal mode commands like r{char}
Problem: Resetting cell widths can make 'listchars' or 'fillchars'
invalid.
Solution: Check for conflicts when resetting cell widths (zeertzjq).
closes: vim/vim#1562966f65a46c5
Some sequences beginning with ASCII might be rendered as emoji, as for
instance emoji 1️⃣ which is encoded as ascii 0x31 + U+FE0F + U+20E3.
While it is tricky to make the width of such sequences configurable,
we can make TUI be careful with such sequences and reset the cursor,
just like for Extended_Pictogram based sequences.
Use the grapheme break algorithm from utf8proc to support grapheme
clusters from recent unicode versions.
Handle variant selector VS16 turning some codepoints into double-width
emoji. This means we need to use ptr2cells rather than char2cells when
possible.
According to `CaseFolding-15.1.0.txt`, full casefolding should be
preferred over simple casefolding as it's considered to be more correct.
Since utf8proc already provides full casefolding it makes sense to
switch to it. This will also remove a lot of unnecessary build code.
Temporary exceptions are made for two sets characters:
- `ß` will still be considered `ß` (instead of `ss`) as using a full
casefolding requires interfering with upstream spell files in some
form.
- `İ` will still be considered `İ` (instead of `i̇`) as using full
casefolding requires making a value judgement on the "correct"
behavior. There are two, equally valid case-insensetive comparison for
this character according to unicode. It is essentially up to the
implementor to decide which conversion is correct. For this reason it
might make sense to allow users to decide which conversion should be
done as an added option to `casemap` in a future PR.
Problem: regex: wrong match when searching multi-byte char
case-insensitive (diffsetter)
Solution: Apply proper case-folding for characters and search-string
This patch does the following 4 things:
1) When the regexp engine compares two utf-8 codepoints case
insensitive it may match an adjacent character, because it assumes
it can step over as many bytes as the pattern contains.
This however is not necessarily true because of case-folding, a
multi-byte UTF-8 character can be considered equal to some
single-byte value.
Let's consider the pattern 'ſ' and the string 's'. When comparing and
ignoring case, the single character 's' matches, and since it matches
Vim will try to step over the match (by the amount of bytes of the
pattern), assuming that since it matches, the length of both strings is
the same.
However in that case, it should only step over the single byte value
's' by 1 byte and try to start matching after it again. So for the
backtracking engine we need to ensure:
* we try to match the correct length for the pattern and the text
* in case of a match, we step over it correctly
There is one tricky thing for the backtracing engine. We also need to
calculate correctly the number of bytes to compare the 2 different
utf-8 strings s1 and s2. So we will count the number of characters in
s1 that the byte len specified. Then we count the number of bytes to
step over the same number of characters in string s2 and then we can
correctly compare the 2 utf-8 strings.
2) A similar thing can happen for the NFA engine, when skipping to the
next character to test for a match. We are skipping over the regstart
pointer, however we do not consider the case that because of
case-folding we may need to adjust the number of bytes to skip over.
So this needs to be adjusted in find_match_text() as well.
3) A related issue turned out, when prog->match_text is actually empty.
In that case we should try to find the next match and skip this
condition.
4) When comparing characters using collections, we must also apply case
folding to each character in the collection and not just to the
current character from the search string. This doesn't apply to the
NFA engine, because internally it converts collections to branches
[abc] -> a\|b\|c
fixes: vim/vim#14294closes: vim/vim#1475622e8e12d9f
N/A patches:
vim-patch:9.0.1771: regex: combining chars in collections not handled
vim-patch:9.0.1777: patch 9.0.1771 causes problems
Co-authored-by: Christian Brabandt <cb@256bit.org>
utf8proc already defines LATIN CAPITAL LETTER SHARP S (ẞ) to be the
uppercase variant of LATIN SMALL LETTER SHARP S (ß), so this special
workaround when using `gU` is no longer needed on the neovim side.
Problem: Wrong cursor position after using setcellwidths().
Solution: Invalidate cursor position in addition to redrawing.
(zeertzjq)
closes: vim/vim#1454505aacec6ab
Reorder functions in test_utf8.vim to match upstream.
Problem: Patch 9.1.0296 causes too many issues
(Tony Mechelynck, chdiza, CI)
Solution: Back out the change for now
Revert "patch 9.1.0296: regexp: engines do not handle case-folding well"
This reverts commit 7a27c108e0509f3255ebdcb6558e896c223e4d23 it causes
issues with syntax highlighting and breaks the FreeBSD and MacOS CI. It
needs more work.
fixes: vim/vim#14487c97f4d61cd
Co-authored-by: Christian Brabandt <cb@256bit.org>
Problem: Regex engines do not handle case-folding well
Solution: Correctly calculate byte length of characters to skip
When the regexp engine compares two utf-8 codepoints case insensitively
it may match an adjacent character, because it assumes it can step over
as many bytes as the pattern contains.
This however is not necessarily true because of case-folding, a
multi-byte UTF-8 character can be considered equal to some single-byte
value.
Let's consider the pattern 'ſ' and the string 's'. When comparing and
ignoring case, the single character 's' matches, and since it matches
Vim will try to step over the match (by the amount of bytes of the
pattern), assuming that since it matches, the length of both strings is
the same.
However in that case, it should only step over the single byte
value 's' so by 1 byte and try to start matching after it again. So for the
backtracking engine we need to ensure:
- we try to match the correct length for the pattern and the text
- in case of a match, we step over it correctly
The same thing can happen for the NFA engine, when skipping to the next
character to test for a match. We are skipping over the regstart
pointer, however we do not consider the case that because of
case-folding we may need to adjust the number of bytes to skip over. So
this needs to be adjusted in find_match_text() as well.
A related issue turned out, when prog->match_text is actually empty. In
that case we should try to find the next match and skip this condition.
fixes: vim/vim#14294closes: vim/vim#144337a27c108e0
Co-authored-by: Christian Brabandt <cb@256bit.org>
Problems:
- Illegal bytes after valid UTF-8 char cause utf_cp_*_off() to fail.
- When stream isn't NUL-terminated, utf_cp_*_off() may go over the end.
Solution: Don't go over end of the char of end of the string.
Problem: upper-case of ß should be U+1E9E (CAPITAL LETTER SHARP S)
(fenuks)
Solution: Make gU, ~ and g~ convert the U+00DF LATIN SMALL LETTER SHARP S (ß)
to U+1E9E LATIN CAPITAL LETTER SHARP S (ẞ), update tests
(glepnir)
This is part of Unicode 5.1.0 from April 2008, so should be fairly safe
to use now and since 2017 is part of the German standard orthography,
according to Wikipedia:
https://en.wikipedia.org/wiki/Capital_%E1%BA%9E#cite_note-auto-12
There is however one exception: UnicodeData.txt for U+00DF
LATIN SMALL LETTER SHARP S does NOT define U+1E9E LATIN CAPITAL LETTER
SHARP S as its upper case version. Therefore, toupper() won't be able
to convert from lower sharp s to upper case sharp s (the other way
around however works, since U+00DF is considered the lower case
character of U+1E9E and therefore tolower() works correctly for the
upper case version).
fixes: vim/vim#5573closes: vim/vim#14018bd1232a1fa
Co-authored-by: glepnir <glephunter@gmail.com>
Problem: qsort() comparison functions should be transitive
Solution: Do not subtract values, but rather use explicit comparisons
Improve qsort() comparison functions
There has been a recent report on qsort() causing out-of-bounds read &
write in glibc for non transitive comparison functions
https://www.qualys.com/2024/01/30/qsort.txt
Even so the bug is in glibc's implementation of the qsort() algorithm,
it's bad style to just use substraction for the comparison functions,
which may cause overflow issues and as hinted at in OpenBSD's manual
page for qsort(): "It is almost always an error to use subtraction to
compute the return value of the comparison function."
So check the qsort() comparison functions and change them to be safe.
closes: vim/vim#13980e06e437665
Co-authored-by: Christian Brabandt <cb@256bit.org>
`utf_char2cells()` calls `utf_printable()` twice (sometimes indirectly,
through `vim_isprintc()`) for characters >= 128. The function can be
refactored to call to it only once.
`utf_printable()` uses binary search on ranges of unprintable characters
to determine if a given character is printable. Since there are only 9
ranges, and the first range contains only one character, binary search
can be replaced with SSE2 SIMD comparisons that check 8 ranges at a
time, and the first range is checked separately. SSE2 is enabled by
default in GCC, Clang and MSVC for x86-64.
Add 3-byte utf-8 to screenpos_spec benchmarks.
The optimized virtual column calculation loop in getvcol()
was decoding the current character twice: once in ptr2cells()
and the second time in utfc_ptr2len(). For combining charcters, they were
decoded up to 2 times in utfc_ptr2len(). Additionally, the function used to
decode the character could be further optimised.
Remove `export` pramgas from defs headers as it causes IWYU to believe
that the definitions from the defs headers comes from main header, which
is not what we really want.