mirror of
https://github.com/nim-lang/Nim.git
synced 2026-01-07 13:33:22 +00:00
390 lines
14 KiB
Plaintext
390 lines
14 KiB
Plaintext
Lexical Analysis
|
||
================
|
||
|
||
Encoding
|
||
--------
|
||
|
||
All Nim source files are in the UTF-8 encoding (or its ASCII subset). Other
|
||
encodings are not supported. Any of the standard platform line termination
|
||
sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
|
||
form using the ASCII sequence CR LF (return followed by linefeed), or the old
|
||
Macintosh form using the ASCII CR (return) character. All of these forms can be
|
||
used equally, regardless of platform.
|
||
|
||
|
||
Indentation
|
||
-----------
|
||
|
||
Nim's standard grammar describes an `indentation sensitive`:idx: language.
|
||
This means that all the control structures are recognized by indentation.
|
||
Indentation consists only of spaces; tabulators are not allowed.
|
||
|
||
The indentation handling is implemented as follows: The lexer annotates the
|
||
following token with the preceding number of spaces; indentation is not
|
||
a separate token. This trick allows parsing of Nim with only 1 token of
|
||
lookahead.
|
||
|
||
The parser uses a stack of indentation levels: the stack consists of integers
|
||
counting the spaces. The indentation information is queried at strategic
|
||
places in the parser but ignored otherwise: The pseudo terminal ``IND{>}``
|
||
denotes an indentation that consists of more spaces than the entry at the top
|
||
of the stack; IND{=} an indentation that has the same number of spaces. ``DED``
|
||
is another pseudo terminal that describes the *action* of popping a value
|
||
from the stack, ``IND{>}`` then implies to push onto the stack.
|
||
|
||
With this notation we can now easily define the core of the grammar: A block of
|
||
statements (simplified example)::
|
||
|
||
ifStmt = 'if' expr ':' stmt
|
||
(IND{=} 'elif' expr ':' stmt)*
|
||
(IND{=} 'else' ':' stmt)?
|
||
|
||
simpleStmt = ifStmt / ...
|
||
|
||
stmt = IND{>} stmt ^+ IND{=} DED # list of statements
|
||
/ simpleStmt # or a simple statement
|
||
|
||
|
||
|
||
Comments
|
||
--------
|
||
|
||
Comments start anywhere outside a string or character literal with the
|
||
hash character ``#``.
|
||
Comments consist of a concatenation of `comment pieces`:idx:. A comment piece
|
||
starts with ``#`` and runs until the end of the line. The end of line characters
|
||
belong to the piece. If the next line only consists of a comment piece with
|
||
no other tokens between it and the preceding one, it does not start a new
|
||
comment:
|
||
|
||
|
||
.. code-block:: nim
|
||
i = 0 # This is a single comment over multiple lines.
|
||
# The scanner merges these two pieces.
|
||
# The comment continues here.
|
||
|
||
|
||
`Documentation comments`:idx: are comments that start with two ``##``.
|
||
Documentation comments are tokens; they are only allowed at certain places in
|
||
the input file as they belong to the syntax tree!
|
||
|
||
|
||
Identifiers & Keywords
|
||
----------------------
|
||
|
||
Identifiers in Nim can be any string of letters, digits
|
||
and underscores, beginning with a letter. Two immediate following
|
||
underscores ``__`` are not allowed::
|
||
|
||
letter ::= 'A'..'Z' | 'a'..'z' | '\x80'..'\xff'
|
||
digit ::= '0'..'9'
|
||
IDENTIFIER ::= letter ( ['_'] (letter | digit) )*
|
||
|
||
Currently any Unicode character with an ordinal value > 127 (non ASCII) is
|
||
classified as a ``letter`` and may thus be part of an identifier but later
|
||
versions of the language may assign some Unicode characters to belong to the
|
||
operator characters instead.
|
||
|
||
The following keywords are reserved and cannot be used as identifiers:
|
||
|
||
.. code-block:: nim
|
||
:file: keywords.txt
|
||
|
||
Some keywords are unused; they are reserved for future developments of the
|
||
language.
|
||
|
||
|
||
Identifier equality
|
||
-------------------
|
||
|
||
Two identifiers are considered equal if the following algorithm returns true:
|
||
|
||
.. code-block:: nim
|
||
proc sameIdentifier(a, b: string): bool =
|
||
a[0] == b[0] and
|
||
a.replace(re"_|–", "").toLower == b.replace(re"_|–", "").toLower
|
||
|
||
That means only the first letters are compared in a case sensitive manner. Other
|
||
letters are compared case insensitively and underscores and en-dash (Unicode
|
||
point U+2013) are ignored.
|
||
|
||
This rather unorthodox way to do identifier comparisons is called
|
||
`partial case insensitivity`:idx: and has some advantages over the conventional
|
||
case sensitivity:
|
||
|
||
It allows programmers to mostly use their own preferred
|
||
spelling style, be it humpStyle, snake_style or dash–style and libraries written
|
||
by different programmers cannot use incompatible conventions.
|
||
A Nim-aware editor or IDE can show the identifiers as preferred.
|
||
Another advantage is that it frees the programmer from remembering
|
||
the exact spelling of an identifier. The exception with respect to the first
|
||
letter allows common code like ``var foo: Foo`` to be parsed unambiguously.
|
||
|
||
Historically, Nim was a fully `style-insensitive`:idx: language. This meant that
|
||
it was not case-sensitive and underscores were ignored and there was no even a
|
||
distinction between ``foo`` and ``Foo``.
|
||
|
||
|
||
String literals
|
||
---------------
|
||
|
||
Terminal symbol in the grammar: ``STR_LIT``.
|
||
|
||
String literals can be delimited by matching double quotes, and can
|
||
contain the following `escape sequences`:idx:\ :
|
||
|
||
================== ===================================================
|
||
Escape sequence Meaning
|
||
================== ===================================================
|
||
``\n`` `newline`:idx:
|
||
``\r``, ``\c`` `carriage return`:idx:
|
||
``\l`` `line feed`:idx:
|
||
``\f`` `form feed`:idx:
|
||
``\t`` `tabulator`:idx:
|
||
``\v`` `vertical tabulator`:idx:
|
||
``\\`` `backslash`:idx:
|
||
``\"`` `quotation mark`:idx:
|
||
``\'`` `apostrophe`:idx:
|
||
``\`` '0'..'9'+ `character with decimal value d`:idx:;
|
||
all decimal digits directly
|
||
following are used for the character
|
||
``\a`` `alert`:idx:
|
||
``\b`` `backspace`:idx:
|
||
``\e`` `escape`:idx: `[ESC]`:idx:
|
||
``\x`` HH `character with hex value HH`:idx:;
|
||
exactly two hex digits are allowed
|
||
================== ===================================================
|
||
|
||
|
||
Strings in Nim may contain any 8-bit value, even embedded zeros. However
|
||
some operations may interpret the first binary zero as a terminator.
|
||
|
||
|
||
Triple quoted string literals
|
||
-----------------------------
|
||
|
||
Terminal symbol in the grammar: ``TRIPLESTR_LIT``.
|
||
|
||
String literals can also be delimited by three double quotes
|
||
``"""`` ... ``"""``.
|
||
Literals in this form may run for several lines, may contain ``"`` and do not
|
||
interpret any escape sequences.
|
||
For convenience, when the opening ``"""`` is followed by a newline (there may
|
||
be whitespace between the opening ``"""`` and the newline),
|
||
the newline (and the preceding whitespace) is not included in the string. The
|
||
ending of the string literal is defined by the pattern ``"""[^"]``, so this:
|
||
|
||
.. code-block:: nim
|
||
""""long string within quotes""""
|
||
|
||
Produces::
|
||
|
||
"long string within quotes"
|
||
|
||
|
||
Raw string literals
|
||
-------------------
|
||
|
||
Terminal symbol in the grammar: ``RSTR_LIT``.
|
||
|
||
There are also raw string literals that are preceded with the
|
||
letter ``r`` (or ``R``) and are delimited by matching double quotes (just
|
||
like ordinary string literals) and do not interpret the escape sequences.
|
||
This is especially convenient for regular expressions or Windows paths:
|
||
|
||
.. code-block:: nim
|
||
|
||
var f = openFile(r"C:\texts\text.txt") # a raw string, so ``\t`` is no tab
|
||
|
||
To produce a single ``"`` within a raw string literal, it has to be doubled:
|
||
|
||
.. code-block:: nim
|
||
|
||
r"a""b"
|
||
|
||
Produces::
|
||
|
||
a"b
|
||
|
||
``r""""`` is not possible with this notation, because the three leading
|
||
quotes introduce a triple quoted string literal. ``r"""`` is the same
|
||
as ``"""`` since triple quoted string literals do not interpret escape
|
||
sequences either.
|
||
|
||
|
||
Generalized raw string literals
|
||
-------------------------------
|
||
|
||
Terminal symbols in the grammar: ``GENERALIZED_STR_LIT``,
|
||
``GENERALIZED_TRIPLESTR_LIT``.
|
||
|
||
The construct ``identifier"string literal"`` (without whitespace between the
|
||
identifier and the opening quotation mark) is a
|
||
generalized raw string literal. It is a shortcut for the construct
|
||
``identifier(r"string literal")``, so it denotes a procedure call with a
|
||
raw string literal as its only argument. Generalized raw string literals
|
||
are especially convenient for embedding mini languages directly into Nim
|
||
(for example regular expressions).
|
||
|
||
The construct ``identifier"""string literal"""`` exists too. It is a shortcut
|
||
for ``identifier("""string literal""")``.
|
||
|
||
|
||
Character literals
|
||
------------------
|
||
|
||
Character literals are enclosed in single quotes ``''`` and can contain the
|
||
same escape sequences as strings - with one exception: `newline`:idx: (``\n``)
|
||
is not allowed as it may be wider than one character (often it is the pair
|
||
CR/LF for example). Here are the valid `escape sequences`:idx: for character
|
||
literals:
|
||
|
||
================== ===================================================
|
||
Escape sequence Meaning
|
||
================== ===================================================
|
||
``\r``, ``\c`` `carriage return`:idx:
|
||
``\l`` `line feed`:idx:
|
||
``\f`` `form feed`:idx:
|
||
``\t`` `tabulator`:idx:
|
||
``\v`` `vertical tabulator`:idx:
|
||
``\\`` `backslash`:idx:
|
||
``\"`` `quotation mark`:idx:
|
||
``\'`` `apostrophe`:idx:
|
||
``\`` '0'..'9'+ `character with decimal value d`:idx:;
|
||
all decimal digits directly
|
||
following are used for the character
|
||
``\a`` `alert`:idx:
|
||
``\b`` `backspace`:idx:
|
||
``\e`` `escape`:idx: `[ESC]`:idx:
|
||
``\x`` HH `character with hex value HH`:idx:;
|
||
exactly two hex digits are allowed
|
||
================== ===================================================
|
||
|
||
A character is not an Unicode character but a single byte. The reason for this
|
||
is efficiency: for the overwhelming majority of use-cases, the resulting
|
||
programs will still handle UTF-8 properly as UTF-8 was specially designed for
|
||
this. Another reason is that Nim can thus support ``array[char, int]`` or
|
||
``set[char]`` efficiently as many algorithms rely on this feature. The `Rune`
|
||
type is used for Unicode characters, it can represent any Unicode character.
|
||
``Rune`` is declared in the `unicode module <unicode.html>`_.
|
||
|
||
|
||
Numerical constants
|
||
-------------------
|
||
|
||
Numerical constants are of a single type and have the form::
|
||
|
||
hexdigit = digit | 'A'..'F' | 'a'..'f'
|
||
octdigit = '0'..'7'
|
||
bindigit = '0'..'1'
|
||
HEX_LIT = '0' ('x' | 'X' ) hexdigit ( ['_'] hexdigit )*
|
||
DEC_LIT = digit ( ['_'] digit )*
|
||
OCT_LIT = '0' ('o' | 'c' | 'C') octdigit ( ['_'] octdigit )*
|
||
BIN_LIT = '0' ('b' | 'B' ) bindigit ( ['_'] bindigit )*
|
||
|
||
INT_LIT = HEX_LIT
|
||
| DEC_LIT
|
||
| OCT_LIT
|
||
| BIN_LIT
|
||
|
||
INT8_LIT = INT_LIT ['\''] ('i' | 'I') '8'
|
||
INT16_LIT = INT_LIT ['\''] ('i' | 'I') '16'
|
||
INT32_LIT = INT_LIT ['\''] ('i' | 'I') '32'
|
||
INT64_LIT = INT_LIT ['\''] ('i' | 'I') '64'
|
||
|
||
UINT_LIT = INT_LIT ['\''] ('u' | 'U')
|
||
UINT8_LIT = INT_LIT ['\''] ('u' | 'U') '8'
|
||
UINT16_LIT = INT_LIT ['\''] ('u' | 'U') '16'
|
||
UINT32_LIT = INT_LIT ['\''] ('u' | 'U') '32'
|
||
UINT64_LIT = INT_LIT ['\''] ('u' | 'U') '64'
|
||
|
||
exponent = ('e' | 'E' ) ['+' | '-'] digit ( ['_'] digit )*
|
||
FLOAT_LIT = digit (['_'] digit)* (('.' (['_'] digit)* [exponent]) |exponent)
|
||
FLOAT32_SUFFIX = ('f' | 'F') ['32']
|
||
FLOAT32_LIT = HEX_LIT '\'' FLOAT32_SUFFIX
|
||
| (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT32_SUFFIX
|
||
FLOAT64_SUFFIX = ( ('f' | 'F') '64' ) | 'd' | 'D'
|
||
FLOAT64_LIT = HEX_LIT '\'' FLOAT64_SUFFIX
|
||
| (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT64_SUFFIX
|
||
|
||
|
||
As can be seen in the productions, numerical constants can contain underscores
|
||
for readability. Integer and floating point literals may be given in decimal (no
|
||
prefix), binary (prefix ``0b``), octal (prefix ``0o`` or ``0c``) and hexadecimal
|
||
(prefix ``0x``) notation.
|
||
|
||
There exists a literal for each numerical type that is
|
||
defined. The suffix starting with an apostrophe ('\'') is called a
|
||
`type suffix`:idx:. Literals without a type suffix are of the type ``int``,
|
||
unless the literal contains a dot or ``E|e`` in which case it is of
|
||
type ``float``. For notational convenience the apostrophe of a type suffix
|
||
is optional if it is not ambiguous (only hexadecimal floating point literals
|
||
with a type suffix can be ambiguous).
|
||
|
||
|
||
The type suffixes are:
|
||
|
||
================= =========================
|
||
Type Suffix Resulting type of literal
|
||
================= =========================
|
||
``'i8`` int8
|
||
``'i16`` int16
|
||
``'i32`` int32
|
||
``'i64`` int64
|
||
``'u`` uint
|
||
``'u8`` uint8
|
||
``'u16`` uint16
|
||
``'u32`` uint32
|
||
``'u64`` uint64
|
||
``'f`` float32
|
||
``'d`` float64
|
||
``'f32`` float32
|
||
``'f64`` float64
|
||
``'f128`` float128
|
||
================= =========================
|
||
|
||
Floating point literals may also be in binary, octal or hexadecimal
|
||
notation:
|
||
``0B0_10001110100_0000101001000111101011101111111011000101001101001001'f64``
|
||
is approximately 1.72826e35 according to the IEEE floating point standard.
|
||
|
||
Literals are bounds checked so that they fit the datatype. Non base-10
|
||
literals are used mainly for flags and bit pattern representations, therefore
|
||
bounds checking is done on bit width, not value range. If the literal fits in
|
||
the bit width of the datatype, it is accepted.
|
||
Hence: 0b10000000'u8 == 0x80'u8 == 128, but, 0b10000000'i8 == 0x80'i8 == -1
|
||
instead of causing an overflow error.
|
||
|
||
Operators
|
||
---------
|
||
|
||
Nim allows user defined operators. An operator is any combination of the
|
||
following characters::
|
||
|
||
= + - * / < >
|
||
@ $ ~ & % |
|
||
! ? ^ . : \
|
||
|
||
These keywords are also operators:
|
||
``and or not xor shl shr div mod in notin is isnot of``.
|
||
|
||
`=`:tok:, `:`:tok:, `::`:tok: are not available as general operators; they
|
||
are used for other notational purposes.
|
||
|
||
``*:`` is as a special case the two tokens `*`:tok: and `:`:tok:
|
||
(to support ``var v*: T``).
|
||
|
||
|
||
Other tokens
|
||
------------
|
||
|
||
The following strings denote other tokens::
|
||
|
||
` ( ) { } [ ] , ; [. .] {. .} (. .)
|
||
|
||
|
||
The `slice`:idx: operator `..`:tok: takes precedence over other tokens that
|
||
contain a dot: `{..}`:tok: are the three tokens `{`:tok:, `..`:tok:, `}`:tok:
|
||
and not the two tokens `{.`:tok:, `.}`:tok:.
|
||
|