Nim/doc/manual/lexing.txt

Lexical Analysis
================

Encoding
--------

All Nim source files are in the UTF-8 encoding (or its ASCII subset). Other
encodings are not supported. Any of the standard platform line termination
sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
form using the ASCII sequence CR LF (return followed by linefeed), or the old
Macintosh form using the ASCII CR (return) character. All of these forms can be
used equally, regardless of platform.


Indentation
-----------

Nim's standard grammar describes an `indentation sensitive`:idx: language.
This means that all the control structures are recognized by indentation.
Indentation consists only of spaces; tabulators are not allowed.

The indentation handling is implemented as follows: The lexer annotates the
following token with the preceding number of spaces; indentation is not
a separate token. This trick allows parsing of Nim with only 1 token of
lookahead.

The parser uses a stack of indentation levels: the stack consists of integers
counting the spaces. The indentation information is queried at strategic
places in the parser but ignored otherwise: The pseudo terminal ``IND{>}``
denotes an indentation that consists of more spaces than the entry at the top
of the stack; IND{=} an indentation that has the same number of spaces. ``DED``
is another pseudo terminal that describes the *action* of popping a value
from the stack, ``IND{>}`` then implies to push onto the stack.

With this notation we can now easily define the core of the grammar: A block of
statements (simplified example)::

  ifStmt = 'if' expr ':' stmt
           (IND{=} 'elif' expr ':' stmt)*
           (IND{=} 'else' ':' stmt)?

  simpleStmt = ifStmt / ...

  stmt = IND{>} stmt ^+ IND{=} DED  # list of statements
       / simpleStmt                 # or a simple statement


Comments
--------

Comments start anywhere outside a string or character literal with the
hash character ``#``.
Comments consist of a concatenation of `comment pieces`:idx:. A comment piece
starts with ``#`` and runs until the end of the line. The end of line characters
belong to the piece. If the next line only consists of a comment piece with
no other tokens between it and the preceding one, it does not start a new
comment:


.. code-block:: nim
  i = 0     # This is a single comment over multiple lines.
    # The scanner merges these two pieces.
    # The comment continues here.


`Documentation comments`:idx: are comments that start with two ``##``.
Documentation comments are tokens; they are only allowed at certain places in
the input file as they belong to the syntax tree!


Identifiers & Keywords
----------------------

Identifiers in Nim can be any string of letters, digits
and underscores, beginning with a letter. Two immediate following
underscores ``__`` are not allowed::

  letter ::= 'A'..'Z' | 'a'..'z' | '\x80'..'\xff'
  digit ::= '0'..'9'
  IDENTIFIER ::= letter ( ['_'] (letter | digit) )*

Currently any Unicode character with an ordinal value > 127 (non ASCII) is
classified as a ``letter`` and may thus be part of an identifier but later
versions of the language may assign some Unicode characters to belong to the
operator characters instead.

The following keywords are reserved and cannot be used as identifiers:

.. code-block:: nim
   :file: keywords.txt

Some keywords are unused; they are reserved for future developments of the
language.


Identifier equality
-------------------

Two identifiers are considered equal if the following algorithm returns true:

.. code-block:: nim
  proc sameIdentifier(a, b: string): bool =
    a[0] == b[0] and
      a.replace(re"_|–", "").toLower == b.replace(re"_|–", "").toLower

That means only the first letters are compared in a case sensitive manner. Other
letters are compared case insensitively and underscores and en-dash (Unicode
point U+2013) are ignored.

This rather unorthodox way to do identifier comparisons is called
`partial case insensitivity`:idx: and has some advantages over the conventional
case sensitivity:

It allows programmers to mostly use their own preferred
spelling style, be it humpStyle, snake_style or dash–style and libraries written
by different programmers cannot use incompatible conventions.
A Nim-aware editor or IDE can show the identifiers as preferred.
Another advantage is that it frees the programmer from remembering
the exact spelling of an identifier. The exception with respect to the first
letter allows common code like ``var foo: Foo`` to be parsed unambiguously.

Historically, Nim was a fully `style-insensitive`:idx: language. This meant that
it was not case-sensitive and underscores were ignored and there was no even a
distinction between ``foo`` and ``Foo``.


String literals
---------------

Terminal symbol in the grammar: ``STR_LIT``.

String literals can be delimited by matching double quotes, and can
contain the following `escape sequences`:idx:\ :

==================         ===================================================
  Escape sequence          Meaning
==================         ===================================================
  ``\n``                   `newline`:idx:
  ``\r``, ``\c``           `carriage return`:idx:
  ``\l``                   `line feed`:idx:
  ``\f``                   `form feed`:idx:
  ``\t``                   `tabulator`:idx:
  ``\v``                   `vertical tabulator`:idx:
  ``\\``                   `backslash`:idx:
  ``\"``                   `quotation mark`:idx:
  ``\'``                   `apostrophe`:idx:
  ``\`` '0'..'9'+          `character with decimal value d`:idx:;
                           all decimal digits directly
                           following are used for the character
  ``\a``                   `alert`:idx:
  ``\b``                   `backspace`:idx:
  ``\e``                   `escape`:idx: `[ESC]`:idx:
  ``\x`` HH                `character with hex value HH`:idx:;
                           exactly two hex digits are allowed
==================         ===================================================


Strings in Nim may contain any 8-bit value, even embedded zeros. However
some operations may interpret the first binary zero as a terminator.


Triple quoted string literals
-----------------------------

Terminal symbol in the grammar: ``TRIPLESTR_LIT``.

String literals can also be delimited by three double quotes
``"""`` ... ``"""``.
Literals in this form may run for several lines, may contain ``"`` and do not
interpret any escape sequences.
For convenience, when the opening ``"""`` is followed by a newline (there may
be whitespace between the opening ``"""`` and the newline),
the newline (and the preceding whitespace) is not included in the string. The
ending of the string literal is defined by the pattern ``"""[^"]``, so this:

.. code-block:: nim
  """"long string within quotes""""

Produces::

  "long string within quotes"


Raw string literals
-------------------

Terminal symbol in the grammar: ``RSTR_LIT``.

There are also raw string literals that are preceded with the
letter ``r`` (or ``R``) and are delimited by matching double quotes (just
like ordinary string literals) and do not interpret the escape sequences.
This is especially convenient for regular expressions or Windows paths:

.. code-block:: nim

  var f = openFile(r"C:\texts\text.txt") # a raw string, so ``\t`` is no tab

To produce a single ``"`` within a raw string literal, it has to be doubled:

.. code-block:: nim

  r"a""b"

Produces::

  a"b

``r""""`` is not possible with this notation, because the three leading
quotes introduce a triple quoted string literal. ``r"""`` is the same
as ``"""`` since triple quoted string literals do not interpret escape
sequences either.


Generalized raw string literals
-------------------------------

Terminal symbols in the grammar: ``GENERALIZED_STR_LIT``,
``GENERALIZED_TRIPLESTR_LIT``.

The construct ``identifier"string literal"`` (without whitespace between the
identifier and the opening quotation mark) is a
generalized raw string literal. It is a shortcut for the construct
``identifier(r"string literal")``, so it denotes a procedure call with a
raw string literal as its only argument. Generalized raw string literals
are especially convenient for embedding mini languages directly into Nim
(for example regular expressions).

The construct ``identifier"""string literal"""`` exists too. It is a shortcut
for ``identifier("""string literal""")``.


Character literals
------------------

Character literals are enclosed in single quotes ``''`` and can contain the
same escape sequences as strings - with one exception: `newline`:idx: (``\n``)
is not allowed as it may be wider than one character (often it is the pair
CR/LF for example).  Here are the valid `escape sequences`:idx: for character
literals:

==================         ===================================================
  Escape sequence          Meaning
==================         ===================================================
  ``\r``, ``\c``           `carriage return`:idx:
  ``\l``                   `line feed`:idx:
  ``\f``                   `form feed`:idx:
  ``\t``                   `tabulator`:idx:
  ``\v``                   `vertical tabulator`:idx:
  ``\\``                   `backslash`:idx:
  ``\"``                   `quotation mark`:idx:
  ``\'``                   `apostrophe`:idx:
  ``\`` '0'..'9'+          `character with decimal value d`:idx:;
                           all decimal digits directly
                           following are used for the character
  ``\a``                   `alert`:idx:
  ``\b``                   `backspace`:idx:
  ``\e``                   `escape`:idx: `[ESC]`:idx:
  ``\x`` HH                `character with hex value HH`:idx:;
                           exactly two hex digits are allowed
==================         ===================================================

A character is not an Unicode character but a single byte. The reason for this
is efficiency: for the overwhelming majority of use-cases, the resulting
programs will still handle UTF-8 properly as UTF-8 was specially designed for
this. Another reason is that Nim can thus support ``array[char, int]`` or
``set[char]`` efficiently as many algorithms rely on this feature.  The `Rune`
type is used for Unicode characters, it can represent any Unicode character.
``Rune`` is declared in the `unicode module <unicode.html>`_.


Numerical constants
-------------------

Numerical constants are of a single type and have the form::

  hexdigit = digit | 'A'..'F' | 'a'..'f'
  octdigit = '0'..'7'
  bindigit = '0'..'1'
  HEX_LIT = '0' ('x' | 'X' ) hexdigit ( ['_'] hexdigit )*
  DEC_LIT = digit ( ['_'] digit )*
  OCT_LIT = '0' ('o' | 'c' | 'C') octdigit ( ['_'] octdigit )*
  BIN_LIT = '0' ('b' | 'B' ) bindigit ( ['_'] bindigit )*

  INT_LIT = HEX_LIT
          | DEC_LIT
          | OCT_LIT
          | BIN_LIT

  INT8_LIT = INT_LIT ['\''] ('i' | 'I') '8'
  INT16_LIT = INT_LIT ['\''] ('i' | 'I') '16'
  INT32_LIT = INT_LIT ['\''] ('i' | 'I') '32'
  INT64_LIT = INT_LIT ['\''] ('i' | 'I') '64'

  UINT_LIT = INT_LIT ['\''] ('u' | 'U')
  UINT8_LIT = INT_LIT ['\''] ('u' | 'U') '8'
  UINT16_LIT = INT_LIT ['\''] ('u' | 'U') '16'
  UINT32_LIT = INT_LIT ['\''] ('u' | 'U') '32'
  UINT64_LIT = INT_LIT ['\''] ('u' | 'U') '64'

  exponent = ('e' | 'E' ) ['+' | '-'] digit ( ['_'] digit )*
  FLOAT_LIT = digit (['_'] digit)* (('.' (['_'] digit)* [exponent]) |exponent)
  FLOAT32_SUFFIX = ('f' | 'F') ['32']
  FLOAT32_LIT = HEX_LIT '\'' FLOAT32_SUFFIX
              | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT32_SUFFIX
  FLOAT64_SUFFIX = ( ('f' | 'F') '64' ) | 'd' | 'D'
  FLOAT64_LIT = HEX_LIT '\'' FLOAT64_SUFFIX
              | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT64_SUFFIX


As can be seen in the productions, numerical constants can contain underscores
for readability. Integer and floating point literals may be given in decimal (no
prefix), binary (prefix ``0b``), octal (prefix ``0o`` or ``0c``) and hexadecimal
(prefix ``0x``) notation.

There exists a literal for each numerical type that is
defined. The suffix starting with an apostrophe ('\'') is called a
`type suffix`:idx:. Literals without a type suffix are of the type ``int``,
unless the literal contains a dot or ``E|e`` in which case it is of
type ``float``. For notational convenience the apostrophe of a type suffix
is optional if it is not ambiguous (only hexadecimal floating point literals
with a type suffix can be ambiguous).


The type suffixes are:

=================    =========================
  Type Suffix        Resulting type of literal
=================    =========================
  ``'i8``            int8
  ``'i16``           int16
  ``'i32``           int32
  ``'i64``           int64
  ``'u``             uint
  ``'u8``            uint8
  ``'u16``           uint16
  ``'u32``           uint32
  ``'u64``           uint64
  ``'f``             float32
  ``'d``             float64
  ``'f32``           float32
  ``'f64``           float64
  ``'f128``          float128
=================    =========================

Floating point literals may also be in binary, octal or hexadecimal
notation:
``0B0_10001110100_0000101001000111101011101111111011000101001101001001'f64``
is approximately 1.72826e35 according to the IEEE floating point standard.

Literals are bounds checked so that they fit the datatype. Non base-10
literals are used mainly for flags and bit pattern representations, therefore
bounds checking is done on bit width, not value range. If the literal fits in
the bit width of the datatype, it is accepted.
Hence: 0b10000000'u8 == 0x80'u8 == 128, but, 0b10000000'i8 == 0x80'i8 == -1
instead of causing an overflow error.

Operators
---------

Nim allows user defined operators. An operator is any combination of the
following characters::

       =     +     -     *     /     <     >
       @     $     ~     &     %     |
       !     ?     ^     .     :     \

These keywords are also operators:
``and or not xor shl shr div mod in notin is isnot of``.

`=`:tok:, `:`:tok:, `::`:tok: are not available as general operators; they
are used for other notational purposes.

``*:`` is as a special case the two tokens `*`:tok: and `:`:tok:
(to support ``var v*: T``).


Other tokens
------------

The following strings denote other tokens::

    `   (     )     {     }     [     ]     ,  ;   [.    .]  {.   .}  (.  .)


The `slice`:idx: operator `..`:tok: takes precedence over other tokens that
contain a dot: `{..}`:tok: are the three tokens `{`:tok:, `..`:tok:, `}`:tok:
and not the two tokens `{.`:tok:, `.}`:tok:.