From 0dc86145ea4ac2a0fad2b4b3625361ade4ed4970 Mon Sep 17 00:00:00 2001 From: Flaviu Tamas Date: Sat, 11 Apr 2015 08:51:33 -0400 Subject: [PATCH] Convert readme to RST --- README.asciidoc | 194 ---------------------------------- README.rst | 269 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 269 insertions(+), 194 deletions(-) delete mode 100644 README.asciidoc create mode 100644 README.rst diff --git a/README.asciidoc b/README.asciidoc deleted file mode 100644 index c307cc153a..0000000000 --- a/README.asciidoc +++ /dev/null @@ -1,194 +0,0 @@ -= NRE -:toc: -:toclevels: 4 -:toc-placement!: - -toc::[] - -== What is NRE? - -A regular expression library for Nim using PCRE to do the hard work. - -== Why? - -The http://nim-lang.org/re.html[re.nim] module that http://nim-lang.org/[Nim] -provides in its standard library is inadequate: - - - It provides only a limited number of captures, while the underling library - (PCRE) allows an unlimited number. - - Instead of having one proc that returns both the bounds and substring, it - has one for the bounds and another for the substring. - - If the splitting regex is empty (`""`), then it returns the input string - instead of following https://ideone.com/dDMjmz[Perl], - http://jsfiddle.net/xtcbxurg/[Javascript], and - https://ideone.com/hYJuJ5[Java]'s precedent of returning a list of each - character (`"123".split(re"") == @["1", "2", "3"]`). - -== Documentation - -=== Operations - -[[proc-find]] -==== find(string, Regex, start = 0, endpos = int.high): RegexMatch - -Finds the given pattern in the string between the end and start positions. - -`start` :: The start point at which to start matching. `|abc` is `0`; `a|bc` - is `1` -`endpos` :: The maximum index for a match; `int.high` means the end of the - string, otherwise it's an inclusive upper bound. - -[[proc-match]] -==== match(string, Regex, start = 0, endpos = int.high): RegexMatch - -Like link:#proc-find[`find(...)`], but anchored to the start of the string. -This means that `"foo".match(re"f") == true`, but `"foo".match(re"o") == -false`. - -[[iter-find]] -==== iterator findIter(string, Regex, start = 0, endpos = int.high): RegexMatch - -Works the same as link:#proc-find[`find(...)`], but finds every non-overlapping -match. `"2222".find(re"22")` is `"22", "22"`, not `"22", "22", "22"`. - -Arguments are the same as link:#proc-find[`find(...)`] - -Variants: - - - `proc findAll(...)` returns a `seq[string]` - -[[proc-split]] -==== split(string, Regex, maxsplit = -1, start = 0): seq[string] - -Splits the string with the given regex. This works according to the rules that -Perl and Javascript use: - - - If the match is zero-width, then the string is still split: - `"123".split(r"") == @["1", "2", "3"]`. - - If the pattern has a capture in it, it is added after the string split: - `"12".split(re"(\d)") == @["", "1", "", "2", ""]`. - - If `maxsplit != -1`, then the string will only be split `maxsplit - 1` - times. This means that there will be `maxsplit` strings in the output seq. - `"1.2.3".split(re"\.", maxsplit = 2) == @["1", "2.3"]` - -`start` behaves the same as in link:#proc-find[`find(...)`]. - -[[proc-replace]] -==== replace(string, Regex, sub): string - -Replaces each match of Regex in the string with `sub`, which should never be -or return `nil`. - -If `sub` is a `proc (RegexMatch): string`, then it is executed with each match -and the return value is the replacement value. - -If `sub` is a `proc (string): string`, then it is executed with the full text -of the match and and the return value is the replacement value. - -If `sub` is a string, the syntax is as follows: - -- `$$` - literal `$` -- `$123` - capture number `123` -- `$foo` - named capture `foo` -- `${foo}` - same as above -- `$1$#` - first and second captures -- `$#` - first capture -- `$0` - full match - -If a given capture is missing, a `ValueError` exception is thrown. - -[[proc-escapere]] -==== escapeRe(string): string - -Escapes the string so it doesn't match any special characters. Incompatible -with the Extra flag (`X`). - -=== Option[RegexMatch] - -Represents the result of an execution. On failure, it is `None[RegexMatch]`, -but if you want automated derefrence, import `optional_t.nonstrict`. The -available fields are as follows: - -`pattern: Regex` :: the pattern that is being matched -`str: string` :: the string that was matched against -`captures[]: string` :: the string value of whatever was captured -at that id. If the value is invalid, then behavior is undefined. If the id is -`-1`, then the whole match is returned. If the given capture was not matched, -`nil` is returned. - - `"abc".match(re"(\w)").captures[0] == "a"` - - `"abc".match(re"(?\w)").captures["letter"] == "a"` - - `"abc".match(re"(\w)\w").captures[-1] == "ab"` -`captureBounds[]: Option[Slice[int]]` :: gets the bounds of the -given capture according to the same rules as the above. If the capture is not -filled, then `None` is returned. The bounds are both inclusive. - - `"abc".match(re"(\w)").captureBounds[0] == 0 .. 0` - - `"abc".match(re"").captureBounds[-1] == 0 .. -1` - - `"abc".match(re"abc").captureBounds[-1] == 0 .. 2` -`match: string` :: the full text of the match. -`matchBounds: Slice[int]` :: the bounds of the match, as in `captureBounds[]` -`(captureBounds|captures).toTable` :: returns a table with each named capture -as a key. -`(captureBounds|captures).toSeq` :: returns all the captures by their number. -`$: string` :: same as `match` - -=== Pattern - -Represents the pattern that things are matched against, constructed with -`re(string, string)`. Examples: `re"foo"`, `re(r"foo # comment", -"x")`, `re"(?x)(*ANYCRLF)foo # comment"`. -For more details on the leading option groups, see the -link:http://man7.org/linux/man-pages/man3/pcresyntax.3.html#OPTION_SETTING[Option Setting] -and the -link:http://man7.org/linux/man-pages/man3/pcresyntax.3.html#NEWLINE_CONVENTION[Newline Convention] -sections of the -link:http://man7.org/linux/man-pages/man3/pcresyntax.3.html[PCRE syntax manual]. - -`pattern: string` :: the string that was used to create the pattern. -`captureCount: int` :: the number of captures that the pattern has. -`captureNameId: Table[string, int]` :: a table from the capture names to - their numeric id. - -==== Flags - - `8` - treat both the pattern and subject as UTF8 - - `9` - prevents the pattern from being interpreted as UTF, no matter what - - `A` - as if the pattern had a `^` at the beginning - - `E` - DOLLAR_ENDONLY - - `f` - fails if there is not a match on the first line - - `i` - case insensitive - - `m` - multi-line, `^` and `$` match the beginning and end of lines, not of the - subject string - - `N` - turn off auto-capture, `(?foo)` is necessary to capture. - - `s` - `.` matches newline - - `U` - expressions are not greedy by default. `?` can be added to a qualifier - to make it greedy. - - `u` - same as `8` - - `W` - Unicode character properties; `\w` matches `к`. - - `X` - "Extra", character escapes without special meaning (`\w` vs. `\a`) are - errors - - `x` - extended, comments (`#`) and newlines are ignored (extended) - - `Y` - pcre.NO_START_OPTIMIZE, - - `` - newlines are separated by `\r` - - `` - newlines are separated by `\r\n` (Windows default) - - `` - newlines are separated by `\n` (UNIX default) - - `` - newlines are separated by any of the above - - `` - newlines are separated by any of the above and Unicode newlines: -[quote, , man pcre] -____ -single characters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL -(next line, U+0085), LS (line separator, U+2028), and PS (paragraph -separator, U+2029). For the 8-bit library, the last two are recognized -only in UTF-8 mode. -____ - - `` - `\R` matches CR, LF, or CRLF - - `` - `\R` matches any unicode newline - - `` - Javascript compatibility - - `` - turn off studying; study is enabled by deafault - -== Other Notes - -By default, NRE compiles it's own PCRE. If this is undesirable, pass -`-d:pcreDynlib` to use whatever dynamic library is available on the system. -This may have unexpected consequences if the dynamic library doesn't have -certain features enabled. - -image::web/logo.png["NRE Logo", width=auto, link="https://github.com/flaviut/nre"] diff --git a/README.rst b/README.rst new file mode 100644 index 0000000000..920bbdf27c --- /dev/null +++ b/README.rst @@ -0,0 +1,269 @@ +What is NRE? +============ + +A regular expression library for Nim using PCRE to do the hard work. + +Why? +==== + +The `re.nim `__ module that +`Nim `__ provides in its standard library is +inadequate: + +- It provides only a limited number of captures, while the underling + library (PCRE) allows an unlimited number. + +- Instead of having one proc that returns both the bounds and + substring, it has one for the bounds and another for the substring. + +- If the splitting regex is empty (``""``), then it returns the input + string instead of following `Perl `__, + `Javascript `__, and + `Java `__'s precedent of returning a list + of each character (``"123".split(re"") == @["1", "2", "3"]``). + +Documentation +============= + +Operations +---------- + +find(string, Regex, start = 0, endpos = int.high): RegexMatch +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Finds the given pattern in the string between the end and start +positions. + +``start`` + The start point at which to start matching. ``|abc`` is ``0``; + ``a|bc`` is ``1`` + +``endpos`` + The maximum index for a match; ``int.high`` means the end of the + string, otherwise it’s an inclusive upper bound. + +match(string, Regex, start = 0, endpos = int.high): RegexMatch +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Like ```find(...)`` <#proc-find>`__, but anchored to the start of the +string. This means that ``"foo".match(re"f") == true``, but +``"foo".match(re"o") == +false``. + +iterator findIter(string, Regex, start = 0, endpos = int.high): RegexMatch +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Works the same as ```find(...)`` <#proc-find>`__, but finds every +non-overlapping match. ``"2222".find(re"22")`` is ``"22", "22"``, not +``"22", "22", "22"``. + +Arguments are the same as ```find(...)`` <#proc-find>`__ + +Variants: + +- ``proc findAll(...)`` returns a ``seq[string]`` + +split(string, Regex, maxsplit = -1, start = 0): seq[string] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Splits the string with the given regex. This works according to the +rules that Perl and Javascript use: + +- If the match is zero-width, then the string is still split: + ``"123".split(r"") == @["1", "2", "3"]``. + +- If the pattern has a capture in it, it is added after the string + split: ``"12".split(re"(\d)") == @["", "1", "", "2", ""]``. + +- If ``maxsplit != -1``, then the string will only be split + ``maxsplit - 1`` times. This means that there will be ``maxsplit`` + strings in the output seq. + ``"1.2.3".split(re"\.", maxsplit = 2) == @["1", "2.3"]`` + +``start`` behaves the same as in ```find(...)`` <#proc-find>`__. + +replace(string, Regex, sub): string +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Replaces each match of Regex in the string with ``sub``, which should +never be or return ``nil``. + +If ``sub`` is a ``proc (RegexMatch): string``, then it is executed with +each match and the return value is the replacement value. + +If ``sub`` is a ``proc (string): string``, then it is executed with the +full text of the match and and the return value is the replacement +value. + +If ``sub`` is a string, the syntax is as follows: + +- ``$$`` - literal ``$`` + +- ``$123`` - capture number ``123`` + +- ``$foo`` - named capture ``foo`` + +- ``${foo}`` - same as above + +- ``$1$#`` - first and second captures + +- ``$#`` - first capture + +- ``$0`` - full match + +If a given capture is missing, a ``ValueError`` exception is thrown. + +escapeRe(string): string +~~~~~~~~~~~~~~~~~~~~~~~~ + +Escapes the string so it doesn’t match any special characters. +Incompatible with the Extra flag (``X``). + +Option[RegexMatch] +------------------ + +Represents the result of an execution. On failure, it is +``None[RegexMatch]``, but if you want automated derefrence, import +``optional_t.nonstrict``. The available fields are as follows: + +``pattern: Regex`` + the pattern that is being matched + +``str: string`` + the string that was matched against + +``captures[]: string`` + the string value of whatever was captured at that id. If the value + is invalid, then behavior is undefined. If the id is ``-1``, then + the whole match is returned. If the given capture was not matched, + ``nil`` is returned. + + - ``"abc".match(re"(\w)").captures[0] == "a"`` + + - ``"abc".match(re"(?\w)").captures["letter"] == "a"`` + + - ``"abc".match(re"(\w)\w").captures[-1] == "ab"`` + +``captureBounds[]: Option[Slice[int]]`` + gets the bounds of the given capture according to the same rules as + the above. If the capture is not filled, then ``None`` is returned. + The bounds are both inclusive. + + - ``"abc".match(re"(\w)").captureBounds[0] == 0 .. 0`` + + - ``"abc".match(re"").captureBounds[-1] == 0 .. -1`` + + - ``"abc".match(re"abc").captureBounds[-1] == 0 .. 2`` + +``match: string`` + the full text of the match. + +``matchBounds: Slice[int]`` + the bounds of the match, as in ``captureBounds[]`` + +``(captureBounds|captures).toTable`` + returns a table with each named capture as a key. + +``(captureBounds|captures).toSeq`` + returns all the captures by their number. + +``$: string`` + same as ``match`` + +Pattern +------- + +Represents the pattern that things are matched against, constructed with +``re(string, string)``. Examples: ``re"foo"``, ``re(r"foo # comment", +"x")``, ``re"(?x)(*ANYCRLF)foo # comment"``. For more details +on the leading option groups, see the `Option +Setting `__ +and the `Newline +Convention `__ +sections of the `PCRE syntax +manual `__. + +``pattern: string`` + the string that was used to create the pattern. + +``captureCount: int`` + the number of captures that the pattern has. + +``captureNameId: Table[string, int]`` + a table from the capture names to their numeric id. + +Flags +~~~~~ + +- ``8`` - treat both the pattern and subject as UTF8 + +- ``9`` - prevents the pattern from being interpreted as UTF, no matter + what + +- ``A`` - as if the pattern had a ``^`` at the beginning + +- ``E`` - DOLLAR\_ENDONLY + +- ``f`` - fails if there is not a match on the first line + +- ``i`` - case insensitive + +- ``m`` - multi-line, ``^`` and ``$`` match the beginning and end of + lines, not of the subject string + +- ``N`` - turn off auto-capture, ``(?foo)`` is necessary to capture. + +- ``s`` - ``.`` matches newline + +- ``U`` - expressions are not greedy by default. ``?`` can be added to + a qualifier to make it greedy. + +- ``u`` - same as ``8`` + +- ``W`` - Unicode character properties; ``\w`` matches ``к``. + +- ``X`` - "Extra", character escapes without special meaning (``\w`` + vs. ``\a``) are errors + +- ``x`` - extended, comments (``#``) and newlines are ignored + (extended) + +- ``Y`` - pcre.NO\_START\_OPTIMIZE, + +- ```` - newlines are separated by ``\r`` + +- ```` - newlines are separated by ``\r\n`` (Windows default) + +- ```` - newlines are separated by ``\n`` (UNIX default) + +- ```` - newlines are separated by any of the above + +- ```` - newlines are separated by any of the above and Unicode + newlines: + + single characters VT (vertical tab, U+000B), FF (form feed, U+000C), + NEL (next line, U+0085), LS (line separator, U+2028), and PS + (paragraph separator, U+2029). For the 8-bit library, the last two + are recognized only in UTF-8 mode. + + — man pcre + +- ```` - ``\R`` matches CR, LF, or CRLF + +- ```` - ``\R`` matches any unicode newline + +- ```` - Javascript compatibility + +- ```` - turn off studying; study is enabled by deafault + +Other Notes +=========== + +By default, NRE compiles it’s own PCRE. If this is undesirable, pass +``-d:pcreDynlib`` to use whatever dynamic library is available on the +system. This may have unexpected consequences if the dynamic library +doesn’t have certain features enabled. + +|"NRE Logo"| + +.. |"NRE Logo"| image:: web/logo.png