pegs: accept UTF-8 bytes in bare identifier terminals (#25829)

## Summary - Fixes `std/pegs` lexing for bare UTF-8 terminals such as `\i café`. - The lexer previously stopped at the first non-ASCII byte, so `pkTerminalIgnoreCase` never saw the full term despite its rune-aware `fastRuneAt`/`toLower` matching. - This now keeps non-ASCII bytes in identifier-style terminals while ASCII non-ident characters still terminate the symbol. ## Behavior Before: `match("CAFÉ", peg"\i café")` failed because the terminal was lexed as `caf`. After: `match("CAFÉ", peg"\i café")`, `match("Café", peg"\i café")`, and `findAll` over mixed-case occurrences pass. `std/pegs` documents `useUnicode = true` as proper UTF-8 support, and quoted terminals already preserved the same bytes; this makes bare terminals consistent with that path. I did not find an existing relevant issue or PR in searches for pegs/unicode/utf8/getSymbol/pkTerminalIgnoreCase.
2026-07-11 11:49:33 +00:00 · 2026-05-19 18:27:48 -03:00
parent f9647276d8
commit 4f6b727d9e
3 changed files with 12 additions and 1 deletions
--- a/changelog.md
+++ b/changelog.md
@@ -81,6 +81,9 @@ parameter and result types, not just their source-level shape. Use
 - `std/re` and `std/nre` are deprecated as PCRE library is obsolete.
  Use https://github.com/nitely/nim-regex or `std/nre2`.
  See: https://github.com/nim-lang/Nim/issues/23668.
+- `std/pegs` now correctly lexes UTF-8 bytes inside bare identifier-style
+  terminals, so case-insensitive matching of non-ASCII terms (e.g. ``\i café``)
+  works without single-quoting.

 ## Language changes

--- a/lib/pure/pegs.nim
+++ b/lib/pure/pegs.nim
@@ -1668,7 +1668,10 @@ func getSymbol(c: var PegLexer, tok: var Token) =
  while pos < c.buf.len:
    add(tok.literal, c.buf[pos])
    inc(pos)
-    if pos < c.buf.len and c.buf[pos] notin strutils.IdentChars: break
+    if pos < c.buf.len:
+      let ch = c.buf[pos]
+      # Keep non-ASCII bytes so UTF-8 terminals reach the rune-aware matchers.
+      if ch notin strutils.IdentChars and ord(ch) < 0x80: break
  c.bufpos = pos
  tok.kind = tkIdentifier

--- a/tests/stdlib/tpegs.nim
+++ b/tests/stdlib/tpegs.nim
@@ -259,6 +259,11 @@ block:
    doAssert match("EINE ÜBERSICHT UND AUSSERDEM", peg"(\upper \white*)+")
    doAssert(not match("456678", peg"(\letter)+"))

+    block:
+      doAssert match("CAFÉ", peg"\i café")
+      doAssert match("Café", peg"\i café")
+      doAssert "two cafés: Café and CAFÉ".findAll(peg"\i café").len == 3
+
    doAssert("var1 = key; var2 = key2".replacef(
      peg"\skip(\s*) {\ident}'='{\ident}", "$1<-$2$2") ==
           "var1<-keykey;var2<-key2key2")