mirror of
https://github.com/nim-lang/Nim.git
synced 2025-12-31 18:32:11 +00:00
* Corrected manual (Identifier equality) - Clarified that identifiers are only case insensitive for ASCII characters - Removed mention of dash-style, since it has been removed
417 lines
15 KiB
Plaintext
417 lines
15 KiB
Plaintext
Lexical Analysis
|
|
================
|
|
|
|
Encoding
|
|
--------
|
|
|
|
All Nim source files are in the UTF-8 encoding (or its ASCII subset). Other
|
|
encodings are not supported. Any of the standard platform line termination
|
|
sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
|
|
form using the ASCII sequence CR LF (return followed by linefeed), or the old
|
|
Macintosh form using the ASCII CR (return) character. All of these forms can be
|
|
used equally, regardless of platform.
|
|
|
|
|
|
Indentation
|
|
-----------
|
|
|
|
Nim's standard grammar describes an `indentation sensitive`:idx: language.
|
|
This means that all the control structures are recognized by indentation.
|
|
Indentation consists only of spaces; tabulators are not allowed.
|
|
|
|
The indentation handling is implemented as follows: The lexer annotates the
|
|
following token with the preceding number of spaces; indentation is not
|
|
a separate token. This trick allows parsing of Nim with only 1 token of
|
|
lookahead.
|
|
|
|
The parser uses a stack of indentation levels: the stack consists of integers
|
|
counting the spaces. The indentation information is queried at strategic
|
|
places in the parser but ignored otherwise: The pseudo terminal ``IND{>}``
|
|
denotes an indentation that consists of more spaces than the entry at the top
|
|
of the stack; ``IND{=}`` an indentation that has the same number of spaces. ``DED``
|
|
is another pseudo terminal that describes the *action* of popping a value
|
|
from the stack, ``IND{>}`` then implies to push onto the stack.
|
|
|
|
With this notation we can now easily define the core of the grammar: A block of
|
|
statements (simplified example)::
|
|
|
|
ifStmt = 'if' expr ':' stmt
|
|
(IND{=} 'elif' expr ':' stmt)*
|
|
(IND{=} 'else' ':' stmt)?
|
|
|
|
simpleStmt = ifStmt / ...
|
|
|
|
stmt = IND{>} stmt ^+ IND{=} DED # list of statements
|
|
/ simpleStmt # or a simple statement
|
|
|
|
|
|
|
|
Comments
|
|
--------
|
|
|
|
Comments start anywhere outside a string or character literal with the
|
|
hash character ``#``.
|
|
Comments consist of a concatenation of `comment pieces`:idx:. A comment piece
|
|
starts with ``#`` and runs until the end of the line. The end of line characters
|
|
belong to the piece. If the next line only consists of a comment piece with
|
|
no other tokens between it and the preceding one, it does not start a new
|
|
comment:
|
|
|
|
|
|
.. code-block:: nim
|
|
i = 0 # This is a single comment over multiple lines.
|
|
# The scanner merges these two pieces.
|
|
# The comment continues here.
|
|
|
|
|
|
`Documentation comments`:idx: are comments that start with two ``##``.
|
|
Documentation comments are tokens; they are only allowed at certain places in
|
|
the input file as they belong to the syntax tree!
|
|
|
|
|
|
Multiline comments
|
|
------------------
|
|
|
|
Starting with version 0.13.0 of the language Nim supports multiline comments.
|
|
They look like:
|
|
|
|
.. code-block:: nim
|
|
#[Comment here.
|
|
Multiple lines
|
|
are not a problem.]#
|
|
|
|
Multiline comments support nesting:
|
|
|
|
.. code-block:: nim
|
|
#[ #[ Multiline comment in already
|
|
commented out code. ]#
|
|
proc p[T](x: T) = discard
|
|
]#
|
|
|
|
Multiline documentation comments also exist and support nesting too:
|
|
|
|
.. code-block:: nim
|
|
proc foo =
|
|
##[Long documentation comment
|
|
here.
|
|
]##
|
|
|
|
|
|
Identifiers & Keywords
|
|
----------------------
|
|
|
|
Identifiers in Nim can be any string of letters, digits
|
|
and underscores, beginning with a letter. Two immediate following
|
|
underscores ``__`` are not allowed::
|
|
|
|
letter ::= 'A'..'Z' | 'a'..'z' | '\x80'..'\xff'
|
|
digit ::= '0'..'9'
|
|
IDENTIFIER ::= letter ( ['_'] (letter | digit) )*
|
|
|
|
Currently any Unicode character with an ordinal value > 127 (non ASCII) is
|
|
classified as a ``letter`` and may thus be part of an identifier but later
|
|
versions of the language may assign some Unicode characters to belong to the
|
|
operator characters instead.
|
|
|
|
The following keywords are reserved and cannot be used as identifiers:
|
|
|
|
.. code-block:: nim
|
|
:file: ../keywords.txt
|
|
|
|
Some keywords are unused; they are reserved for future developments of the
|
|
language.
|
|
|
|
|
|
Identifier equality
|
|
-------------------
|
|
|
|
Two identifiers are considered equal if the following algorithm returns true:
|
|
|
|
.. code-block:: nim
|
|
proc sameIdentifier(a, b: string): bool =
|
|
a[0] == b[0] and
|
|
a.replace("_", "").toLowerAscii == b.replace("_", "").toLowerAscii
|
|
|
|
That means only the first letters are compared in a case sensitive manner. Other
|
|
letters are compared case insensitively within the ASCII range and underscores are ignored.
|
|
|
|
This rather unorthodox way to do identifier comparisons is called
|
|
`partial case insensitivity`:idx: and has some advantages over the conventional
|
|
case sensitivity:
|
|
|
|
It allows programmers to mostly use their own preferred
|
|
spelling style, be it humpStyle or snake_style, and libraries written
|
|
by different programmers cannot use incompatible conventions.
|
|
A Nim-aware editor or IDE can show the identifiers as preferred.
|
|
Another advantage is that it frees the programmer from remembering
|
|
the exact spelling of an identifier. The exception with respect to the first
|
|
letter allows common code like ``var foo: Foo`` to be parsed unambiguously.
|
|
|
|
Historically, Nim was a fully `style-insensitive`:idx: language. This meant that
|
|
it was not case-sensitive and underscores were ignored and there was no even a
|
|
distinction between ``foo`` and ``Foo``.
|
|
|
|
|
|
String literals
|
|
---------------
|
|
|
|
Terminal symbol in the grammar: ``STR_LIT``.
|
|
|
|
String literals can be delimited by matching double quotes, and can
|
|
contain the following `escape sequences`:idx:\ :
|
|
|
|
================== ===================================================
|
|
Escape sequence Meaning
|
|
================== ===================================================
|
|
``\n`` `newline`:idx:
|
|
``\r``, ``\c`` `carriage return`:idx:
|
|
``\l`` `line feed`:idx:
|
|
``\f`` `form feed`:idx:
|
|
``\t`` `tabulator`:idx:
|
|
``\v`` `vertical tabulator`:idx:
|
|
``\\`` `backslash`:idx:
|
|
``\"`` `quotation mark`:idx:
|
|
``\'`` `apostrophe`:idx:
|
|
``\`` '0'..'9'+ `character with decimal value d`:idx:;
|
|
all decimal digits directly
|
|
following are used for the character
|
|
``\a`` `alert`:idx:
|
|
``\b`` `backspace`:idx:
|
|
``\e`` `escape`:idx: `[ESC]`:idx:
|
|
``\x`` HH `character with hex value HH`:idx:;
|
|
exactly two hex digits are allowed
|
|
================== ===================================================
|
|
|
|
|
|
Strings in Nim may contain any 8-bit value, even embedded zeros. However
|
|
some operations may interpret the first binary zero as a terminator.
|
|
|
|
|
|
Triple quoted string literals
|
|
-----------------------------
|
|
|
|
Terminal symbol in the grammar: ``TRIPLESTR_LIT``.
|
|
|
|
String literals can also be delimited by three double quotes
|
|
``"""`` ... ``"""``.
|
|
Literals in this form may run for several lines, may contain ``"`` and do not
|
|
interpret any escape sequences.
|
|
For convenience, when the opening ``"""`` is followed by a newline (there may
|
|
be whitespace between the opening ``"""`` and the newline),
|
|
the newline (and the preceding whitespace) is not included in the string. The
|
|
ending of the string literal is defined by the pattern ``"""[^"]``, so this:
|
|
|
|
.. code-block:: nim
|
|
""""long string within quotes""""
|
|
|
|
Produces::
|
|
|
|
"long string within quotes"
|
|
|
|
|
|
Raw string literals
|
|
-------------------
|
|
|
|
Terminal symbol in the grammar: ``RSTR_LIT``.
|
|
|
|
There are also raw string literals that are preceded with the
|
|
letter ``r`` (or ``R``) and are delimited by matching double quotes (just
|
|
like ordinary string literals) and do not interpret the escape sequences.
|
|
This is especially convenient for regular expressions or Windows paths:
|
|
|
|
.. code-block:: nim
|
|
|
|
var f = openFile(r"C:\texts\text.txt") # a raw string, so ``\t`` is no tab
|
|
|
|
To produce a single ``"`` within a raw string literal, it has to be doubled:
|
|
|
|
.. code-block:: nim
|
|
|
|
r"a""b"
|
|
|
|
Produces::
|
|
|
|
a"b
|
|
|
|
``r""""`` is not possible with this notation, because the three leading
|
|
quotes introduce a triple quoted string literal. ``r"""`` is the same
|
|
as ``"""`` since triple quoted string literals do not interpret escape
|
|
sequences either.
|
|
|
|
|
|
Generalized raw string literals
|
|
-------------------------------
|
|
|
|
Terminal symbols in the grammar: ``GENERALIZED_STR_LIT``,
|
|
``GENERALIZED_TRIPLESTR_LIT``.
|
|
|
|
The construct ``identifier"string literal"`` (without whitespace between the
|
|
identifier and the opening quotation mark) is a
|
|
generalized raw string literal. It is a shortcut for the construct
|
|
``identifier(r"string literal")``, so it denotes a procedure call with a
|
|
raw string literal as its only argument. Generalized raw string literals
|
|
are especially convenient for embedding mini languages directly into Nim
|
|
(for example regular expressions).
|
|
|
|
The construct ``identifier"""string literal"""`` exists too. It is a shortcut
|
|
for ``identifier("""string literal""")``.
|
|
|
|
|
|
Character literals
|
|
------------------
|
|
|
|
Character literals are enclosed in single quotes ``''`` and can contain the
|
|
same escape sequences as strings - with one exception: `newline`:idx: (``\n``)
|
|
is not allowed as it may be wider than one character (often it is the pair
|
|
CR/LF for example). Here are the valid `escape sequences`:idx: for character
|
|
literals:
|
|
|
|
================== ===================================================
|
|
Escape sequence Meaning
|
|
================== ===================================================
|
|
``\r``, ``\c`` `carriage return`:idx:
|
|
``\l`` `line feed`:idx:
|
|
``\f`` `form feed`:idx:
|
|
``\t`` `tabulator`:idx:
|
|
``\v`` `vertical tabulator`:idx:
|
|
``\\`` `backslash`:idx:
|
|
``\"`` `quotation mark`:idx:
|
|
``\'`` `apostrophe`:idx:
|
|
``\`` '0'..'9'+ `character with decimal value d`:idx:;
|
|
all decimal digits directly
|
|
following are used for the character
|
|
``\a`` `alert`:idx:
|
|
``\b`` `backspace`:idx:
|
|
``\e`` `escape`:idx: `[ESC]`:idx:
|
|
``\x`` HH `character with hex value HH`:idx:;
|
|
exactly two hex digits are allowed
|
|
================== ===================================================
|
|
|
|
A character is not an Unicode character but a single byte. The reason for this
|
|
is efficiency: for the overwhelming majority of use-cases, the resulting
|
|
programs will still handle UTF-8 properly as UTF-8 was specially designed for
|
|
this. Another reason is that Nim can thus support ``array[char, int]`` or
|
|
``set[char]`` efficiently as many algorithms rely on this feature. The `Rune`
|
|
type is used for Unicode characters, it can represent any Unicode character.
|
|
``Rune`` is declared in the `unicode module <unicode.html>`_.
|
|
|
|
|
|
Numerical constants
|
|
-------------------
|
|
|
|
Numerical constants are of a single type and have the form::
|
|
|
|
hexdigit = digit | 'A'..'F' | 'a'..'f'
|
|
octdigit = '0'..'7'
|
|
bindigit = '0'..'1'
|
|
HEX_LIT = '0' ('x' | 'X' ) hexdigit ( ['_'] hexdigit )*
|
|
DEC_LIT = digit ( ['_'] digit )*
|
|
OCT_LIT = '0' ('o' | 'c' | 'C') octdigit ( ['_'] octdigit )*
|
|
BIN_LIT = '0' ('b' | 'B' ) bindigit ( ['_'] bindigit )*
|
|
|
|
INT_LIT = HEX_LIT
|
|
| DEC_LIT
|
|
| OCT_LIT
|
|
| BIN_LIT
|
|
|
|
INT8_LIT = INT_LIT ['\''] ('i' | 'I') '8'
|
|
INT16_LIT = INT_LIT ['\''] ('i' | 'I') '16'
|
|
INT32_LIT = INT_LIT ['\''] ('i' | 'I') '32'
|
|
INT64_LIT = INT_LIT ['\''] ('i' | 'I') '64'
|
|
|
|
UINT_LIT = INT_LIT ['\''] ('u' | 'U')
|
|
UINT8_LIT = INT_LIT ['\''] ('u' | 'U') '8'
|
|
UINT16_LIT = INT_LIT ['\''] ('u' | 'U') '16'
|
|
UINT32_LIT = INT_LIT ['\''] ('u' | 'U') '32'
|
|
UINT64_LIT = INT_LIT ['\''] ('u' | 'U') '64'
|
|
|
|
exponent = ('e' | 'E' ) ['+' | '-'] digit ( ['_'] digit )*
|
|
FLOAT_LIT = digit (['_'] digit)* (('.' (['_'] digit)* [exponent]) |exponent)
|
|
FLOAT32_SUFFIX = ('f' | 'F') ['32']
|
|
FLOAT32_LIT = HEX_LIT '\'' FLOAT32_SUFFIX
|
|
| (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT32_SUFFIX
|
|
FLOAT64_SUFFIX = ( ('f' | 'F') '64' ) | 'd' | 'D'
|
|
FLOAT64_LIT = HEX_LIT '\'' FLOAT64_SUFFIX
|
|
| (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT64_SUFFIX
|
|
|
|
|
|
As can be seen in the productions, numerical constants can contain underscores
|
|
for readability. Integer and floating point literals may be given in decimal (no
|
|
prefix), binary (prefix ``0b``), octal (prefix ``0o`` or ``0c``) and hexadecimal
|
|
(prefix ``0x``) notation.
|
|
|
|
There exists a literal for each numerical type that is
|
|
defined. The suffix starting with an apostrophe ('\'') is called a
|
|
`type suffix`:idx:. Literals without a type suffix are of the type ``int``,
|
|
unless the literal contains a dot or ``E|e`` in which case it is of
|
|
type ``float``. For notational convenience the apostrophe of a type suffix
|
|
is optional if it is not ambiguous (only hexadecimal floating point literals
|
|
with a type suffix can be ambiguous).
|
|
|
|
|
|
The type suffixes are:
|
|
|
|
================= =========================
|
|
Type Suffix Resulting type of literal
|
|
================= =========================
|
|
``'i8`` int8
|
|
``'i16`` int16
|
|
``'i32`` int32
|
|
``'i64`` int64
|
|
``'u`` uint
|
|
``'u8`` uint8
|
|
``'u16`` uint16
|
|
``'u32`` uint32
|
|
``'u64`` uint64
|
|
``'f`` float32
|
|
``'d`` float64
|
|
``'f32`` float32
|
|
``'f64`` float64
|
|
``'f128`` float128
|
|
================= =========================
|
|
|
|
Floating point literals may also be in binary, octal or hexadecimal
|
|
notation:
|
|
``0B0_10001110100_0000101001000111101011101111111011000101001101001001'f64``
|
|
is approximately 1.72826e35 according to the IEEE floating point standard.
|
|
|
|
Literals are bounds checked so that they fit the datatype. Non base-10
|
|
literals are used mainly for flags and bit pattern representations, therefore
|
|
bounds checking is done on bit width, not value range. If the literal fits in
|
|
the bit width of the datatype, it is accepted.
|
|
Hence: 0b10000000'u8 == 0x80'u8 == 128, but, 0b10000000'i8 == 0x80'i8 == -1
|
|
instead of causing an overflow error.
|
|
|
|
Operators
|
|
---------
|
|
|
|
Nim allows user defined operators. An operator is any combination of the
|
|
following characters::
|
|
|
|
= + - * / < >
|
|
@ $ ~ & % |
|
|
! ? ^ . : \
|
|
|
|
These keywords are also operators:
|
|
``and or not xor shl shr div mod in notin is isnot of``.
|
|
|
|
`=`:tok:, `:`:tok:, `::`:tok: are not available as general operators; they
|
|
are used for other notational purposes.
|
|
|
|
``*:`` is as a special case treated as the two tokens `*`:tok: and `:`:tok:
|
|
(to support ``var v*: T``).
|
|
|
|
|
|
Other tokens
|
|
------------
|
|
|
|
The following strings denote other tokens::
|
|
|
|
` ( ) { } [ ] , ; [. .] {. .} (. .)
|
|
|
|
|
|
The `slice`:idx: operator `..`:tok: takes precedence over other tokens that
|
|
contain a dot: `{..}`:tok: are the three tokens `{`:tok:, `..`:tok:, `}`:tok:
|
|
and not the two tokens `{.`:tok:, `.}`:tok:.
|
|
|