mirror of
https://github.com/nim-lang/Nim.git
synced 2026-05-25 14:28:15 +00:00
pegs: accept UTF-8 bytes in bare identifier terminals (#25829)
## Summary
- Fixes `std/pegs` lexing for bare UTF-8 terminals such as `\i café`.
- The lexer previously stopped at the first non-ASCII byte, so
`pkTerminalIgnoreCase` never saw the full term despite its rune-aware
`fastRuneAt`/`toLower` matching.
- This now keeps non-ASCII bytes in identifier-style terminals while
ASCII non-ident characters still terminate the symbol.
## Behavior
Before: `match("CAFÉ", peg"\i café")` failed because the terminal was
lexed as `caf`.
After: `match("CAFÉ", peg"\i café")`, `match("Café", peg"\i café")`, and
`findAll` over mixed-case occurrences pass.
`std/pegs` documents `useUnicode = true` as proper UTF-8 support, and
quoted terminals already preserved the same bytes; this makes bare
terminals consistent with that path.
I did not find an existing relevant issue or PR in searches for
pegs/unicode/utf8/getSymbol/pkTerminalIgnoreCase.
This commit is contained in:
@@ -81,6 +81,9 @@ parameter and result types, not just their source-level shape. Use
|
||||
- `std/re` and `std/nre` are deprecated as PCRE library is obsolete.
|
||||
Use https://github.com/nitely/nim-regex or `std/nre2`.
|
||||
See: https://github.com/nim-lang/Nim/issues/23668.
|
||||
- `std/pegs` now correctly lexes UTF-8 bytes inside bare identifier-style
|
||||
terminals, so case-insensitive matching of non-ASCII terms (e.g. ``\i café``)
|
||||
works without single-quoting.
|
||||
|
||||
## Language changes
|
||||
|
||||
|
||||
@@ -1668,7 +1668,10 @@ func getSymbol(c: var PegLexer, tok: var Token) =
|
||||
while pos < c.buf.len:
|
||||
add(tok.literal, c.buf[pos])
|
||||
inc(pos)
|
||||
if pos < c.buf.len and c.buf[pos] notin strutils.IdentChars: break
|
||||
if pos < c.buf.len:
|
||||
let ch = c.buf[pos]
|
||||
# Keep non-ASCII bytes so UTF-8 terminals reach the rune-aware matchers.
|
||||
if ch notin strutils.IdentChars and ord(ch) < 0x80: break
|
||||
c.bufpos = pos
|
||||
tok.kind = tkIdentifier
|
||||
|
||||
|
||||
@@ -259,6 +259,11 @@ block:
|
||||
doAssert match("EINE ÜBERSICHT UND AUSSERDEM", peg"(\upper \white*)+")
|
||||
doAssert(not match("456678", peg"(\letter)+"))
|
||||
|
||||
block:
|
||||
doAssert match("CAFÉ", peg"\i café")
|
||||
doAssert match("Café", peg"\i café")
|
||||
doAssert "two cafés: Café and CAFÉ".findAll(peg"\i café").len == 3
|
||||
|
||||
doAssert("var1 = key; var2 = key2".replacef(
|
||||
peg"\skip(\s*) {\ident}'='{\ident}", "$1<-$2$2") ==
|
||||
"var1<-keykey;var2<-key2key2")
|
||||
|
||||
Reference in New Issue
Block a user