mirror of
https://github.com/odin-lang/Odin.git
synced 2025-12-30 18:02:02 +00:00
98 lines
3.8 KiB
Odin
98 lines
3.8 KiB
Odin
/*
|
|
package regex implements a complete suite for using Regular Expressions to
|
|
match and capture text.
|
|
|
|
Regular expressions are used to describe how a piece of text can match to
|
|
another, using a pattern language.
|
|
|
|
Odin's regex library implements the following features:
|
|
|
|
Alternation: `apple|cherry`
|
|
Classes: `[0-9_]`
|
|
Classes, negated: `[^0-9_]`
|
|
Shorthands: `\d\s\w`
|
|
Shorthands, negated: `\D\S\W`
|
|
Wildcards: `.`
|
|
Repeat, optional: `a*`
|
|
Repeat, at least once: `a+`
|
|
Repetition: `a{1,2}`
|
|
Optional: `a?`
|
|
Group, capture: `([0-9])`
|
|
Group, non-capture: `(?:[0-9])`
|
|
Start & End Anchors: `^hello$`
|
|
Word Boundaries: `\bhello\b`
|
|
Non-Word Boundaries: `hello\B`
|
|
|
|
These specifiers can be composed together, such as an optional group:
|
|
`(?:hello)?`
|
|
|
|
This package also supports the non-greedy variants of the repeating and
|
|
optional specifiers by appending a `?` to them.
|
|
|
|
Of the shorthand classes that are supported, they are all ASCII-based, even
|
|
when compiling in Unicode mode. This is for the sake of general performance and
|
|
simplicity, as there are thousands of Unicode codepoints which would qualify as
|
|
either a digit, space, or word character which could be irrelevant depending on
|
|
what is being matched.
|
|
|
|
Here are the shorthand class equivalencies:
|
|
\d: [0-9]
|
|
\s: [\t\n\f\r ]
|
|
\w: [0-9A-Z_a-z]
|
|
|
|
If you need your own shorthands, you can compose strings together like so:
|
|
MY_HEX :: "[0-9A-Fa-f]"
|
|
PATTERN :: MY_HEX + "-" + MY_HEX
|
|
|
|
The compiler will handle turning multiple identical classes into references to
|
|
the same set of matching runes, so there's no penalty for doing it like this.
|
|
|
|
|
|
|
|
``Some people, when confronted with a problem, think
|
|
"I know, I'll use regular expressions." Now they have two problems.''
|
|
|
|
- Jamie Zawinski
|
|
|
|
|
|
Regular expressions have gathered a reputation over the decades for often being
|
|
chosen as the wrong tool for the job. Here, we will clarify a few cases in
|
|
which RegEx might be good or bad.
|
|
|
|
|
|
**When is it a good time to use RegEx?**
|
|
|
|
- You don't know at compile-time what patterns of text the program will need to
|
|
match when it's running.
|
|
- As an example, you are making a client which can be configured by the user to
|
|
trigger on certain text patterns received from a server.
|
|
- For another example, you need a way for users of a text editor to compose
|
|
matching strings that are more intricate than a simple substring lookup.
|
|
- The text you're matching against is small (< 64 KiB) and your patterns aren't
|
|
overly complicated with branches (alternations, repeats, and optionals).
|
|
- If none of the above general impressions apply but your project doesn't
|
|
warrant long-term maintenance.
|
|
|
|
**When is it a bad time to use RegEx?**
|
|
|
|
- You know at compile-time the grammar you're parsing; a hand-made parser has
|
|
the potential to be more maintainable and readable.
|
|
- The grammar you're parsing has certain validation steps that lend itself to
|
|
forming complicated expressions, such as e-mail addresses, URIs, dates,
|
|
postal codes, credit cards, et cetera. Using RegEx to validate these
|
|
structures is almost always a bad sign.
|
|
- The text you're matching against is big (> 1 MiB); you would be better served
|
|
by first dividing the text into manageable chunks and using some heuristic to
|
|
locate the most likely location of a match before applying RegEx against it.
|
|
- You value high performance and low memory usage; RegEx will always have a
|
|
certain overhead which increases with the complexity of the pattern.
|
|
|
|
|
|
The implementation of this package has been optimized, but it will never be as
|
|
thoroughly performant as a hand-made parser. In comparison, there are just too
|
|
many intermediate steps, assumptions, and generalizations in what it takes to
|
|
handle a regular expression.
|
|
|
|
*/
|
|
package regex
|