mirror of
https://github.com/neovim/neovim.git
synced 2025-10-17 07:16:09 +00:00
vim-patch:9.1.1258: regexp: max \U and \%U value is limited by INT_MAX (#33156)
Problem: regexp: max \U and \%U value is limited by INT_MAX but gives a
confusing error message (related: v8.1.0985).
Solution: give a better error message when the value reaches INT_MAX
When searching Vim allows to get up to 8 hex characters using the /\V
and /\%V regex atoms. However, when using "/\UFFFFFFFF" the code point is
already above what an integer variable can hold, which is 2,147,483,647.
Since patch v8.1.0985, Vim already limited the max codepoint to INT_MAX
(otherwise it caused a crash in the nfa regex engine), but instead of
error'ing out it silently fell back to parse the number as a backslash
value and not as a codepoint value and as such this "/[\UFFFFFFFF]" will
happily find a "\" or an literal "F". And this "/[\d127-\UFFFFFFFF]"
will error out as "reverse range in character class).
Interestingly, the max Unicode codepoint value is U+10FFFF which still
fits into an ordinary integer value, which means, that we don't even
need to parse 8 hex characters, but 6 should have been enough.
However, let's not limit Vim to search for only max 6 hex characters
(which would be a backward incompatible change), but instead allow all 8
characters and only if the codepoint reaches INT_MAX, give a more
precise error message (about what the max unicode codepoint value is).
This allows to search for "[\U7FFFFFFE]" (will likely return "E486
Pattern not found") and "[/\U7FFFFFF]" now errors "E1517: Value too
large, max Unicode codepoint is U+10FFFF".
While this change is straight forward on architectures where long is 8
bytes, this is not so simple on Windows or 32bit architectures where long
is 4 bytes (and therefore the test fails there). To account for that,
let's make use of the vimlong_T number type and make a few corresponding
changes in the regex engine code and cast the value to the expected data
type. This however may not work correctly on systems that doesn't have
the long long datatype (e.g. OpenVMS) and probably the test will fail
there.
fixes: vim/vim#16949
closes: vim/vim#16994
f2b16986a1
Co-authored-by: Christian Brabandt <cb@256bit.org>
This commit is contained in:
@@ -367,6 +367,8 @@ static const char e_nfa_regexp_missing_value_in_chr[]
|
||||
static const char e_atom_engine_must_be_at_start_of_pattern[]
|
||||
= N_("E1281: Atom '\\%%#=%c' must be at the start of the pattern");
|
||||
static const char e_substitute_nesting_too_deep[] = N_("E1290: substitute nesting too deep");
|
||||
static const char e_unicode_val_too_large[]
|
||||
= N_("E1541: Value too large, max Unicode codepoint is U+10FFFF");
|
||||
|
||||
#define NOT_MULTI 0
|
||||
#define MULTI_ONE 1
|
||||
@@ -4796,6 +4798,11 @@ collection:
|
||||
|| *regparse == 'u'
|
||||
|| *regparse == 'U') {
|
||||
startc = coll_get_char();
|
||||
// max UTF-8 Codepoint is U+10FFFF,
|
||||
// but allow values until INT_MAX
|
||||
if (startc == INT_MAX) {
|
||||
EMSG_RET_NULL(_(e_unicode_val_too_large));
|
||||
}
|
||||
if (startc == 0) {
|
||||
regc(0x0a);
|
||||
} else {
|
||||
@@ -5548,12 +5555,15 @@ static int coll_get_char(void)
|
||||
case 'U':
|
||||
nr = gethexchrs(8); break;
|
||||
}
|
||||
if (nr < 0 || nr > INT_MAX) {
|
||||
if (nr < 0) {
|
||||
// If getting the number fails be backwards compatible: the character
|
||||
// is a backslash.
|
||||
regparse--;
|
||||
nr = '\\';
|
||||
}
|
||||
if (nr > INT_MAX) {
|
||||
nr = INT_MAX;
|
||||
}
|
||||
return (int)nr;
|
||||
}
|
||||
|
||||
@@ -10565,6 +10575,11 @@ collection:
|
||||
|| *regparse == 'U') {
|
||||
// TODO(RE): This needs more testing
|
||||
startc = coll_get_char();
|
||||
// max UTF-8 Codepoint is U+10FFFF,
|
||||
// but allow values until INT_MAX
|
||||
if (startc == INT_MAX) {
|
||||
EMSG_RET_FAIL(_(e_unicode_val_too_large));
|
||||
}
|
||||
got_coll_char = true;
|
||||
MB_PTR_BACK(old_regparse, regparse);
|
||||
} else {
|
||||
|
Reference in New Issue
Block a user