document how the incremental compilation scheme could work

2026-02-17 16:38:33 +00:00 · 2018-06-01 22:11:32 +02:00
parent 61fb83ecbb
commit cae1973856
4 changed files with 110 additions and 51 deletions
--- a/compiler/lineinfos.nim
+++ b/compiler/lineinfos.nim
@@ -246,10 +246,10 @@ const trackPosInvalidFileIdx* = FileIndex(-2) # special marker so that no sugges
                                   # are produced within comments and string literals

 type
-  MsgConfig* = object
+  MsgConfig* = object ## does not need to be stored in the incremental cache
    trackPos*: TLineInfo
-    trackPosAttached*: bool ## whether the tracking position was attached to some
-                            ## close token.
+    trackPosAttached*: bool ## whether the tracking position was attached to
+                            ## some close token.

    errorOutputs*: TErrorOutputs
    msgContext*: seq[TLineInfo]
--- a/compiler/modulegraphs.nim
+++ b/compiler/modulegraphs.nim
@@ -47,7 +47,7 @@ type
    doStopCompile*: proc(): bool {.closure.}
    usageSym*: PSym # for nimsuggest
    owners*: seq[PSym]
-    methods*: seq[tuple[methods: TSymSeq, dispatcher: PSym]]
+    methods*: seq[tuple[methods: TSymSeq, dispatcher: PSym]] # needs serialization!
    systemModule*: PSym
    sysTypes*: array[TTypeKind, PType]
    compilerprocs*: TStrTable
--- a/compiler/options.nim
+++ b/compiler/options.nim
@@ -156,24 +156,27 @@ type
    version*: int
  Suggestions* = seq[Suggest]

-  ConfigRef* = ref object ## eventually all global configuration should be moved here
-    target*: Target
+  ConfigRef* = ref object ## every global configuration
+                          ## fields marked with '*' are subject to
+                          ## the incremental compilation mechanisms
+                          ## (+) means "part of the dependency"
+    target*: Target       # (+)
    linesCompiled*: int  # all lines that have been compiled
-    options*: TOptions
-    globalOptions*: TGlobalOptions
+    options*: TOptions    # (+)
+    globalOptions*: TGlobalOptions # (+)
    m*: MsgConfig
    evalTemplateCounter*: int
    evalMacroCounter*: int
    exitcode*: int8
    cmd*: TCommands  # the command
-    selectedGC*: TGCMode       # the selected GC
+    selectedGC*: TGCMode       # the selected GC (+)
    verbosity*: int            # how verbose the compiler is
    numberOfProcessors*: int   # number of processors
    evalExpr*: string          # expression for idetools --eval
    lastCmdTime*: float        # when caas is enabled, we measure each command
    symbolFiles*: SymbolFilesOption

-    cppDefines*: HashSet[string]
+    cppDefines*: HashSet[string] # (*)
    headerFile*: string
    features*: set[Feature]
    arguments*: string ## the arguments to be passed to the program that
@@ -220,13 +223,13 @@ type
    cLinkedLibs*: seq[string]  # libraries to link

    externalToLink*: seq[string]  # files to link in addition to the file
-                                  # we compiled
+                                  # we compiled (*)
    linkOptionsCmd*: string
    compileOptionsCmd*: seq[string]
-    linkOptions*: string
-    compileOptions*: string
+    linkOptions*: string          # (*)
+    compileOptions*: string       # (*)
    ccompilerpath*: string
-    toCompile*: CfileList
+    toCompile*: CfileList         # (*)
    suggestionResultHook*: proc (result: Suggest) {.closure.}
    suggestVersion*: int
    suggestMaxResults*: int
--- a/doc/intern.txt
+++ b/doc/intern.txt
@@ -38,10 +38,6 @@ Path           Purpose
 Bootstrapping the compiler
 ==========================

-As of version 0.8.5 the compiler is maintained in Nim. (The first versions
-have been implemented in Object Pascal.) The Python-based build system has
-been rewritten in Nim too.
-
 Compiling the compiler is a simple matter of running::

  nim c koch.nim
@@ -202,16 +198,86 @@ Compilation cache
 =================

 The implementation of the compilation cache is tricky: There are lots
-of issues to be solved for the front- and backend. In the following
-sections *global* means *shared between modules* or *property of the whole
-program*.
+of issues to be solved for the front- and backend.
+
+
+General approach: AST replay
+----------------------------
+
+We store a module's AST of a successful semantic check in a SQLite
+database. There are plenty of features that require a sub sequence
+to be re-applied, for example:
+
+.. code-block:: nim
+  {.compile: "foo.c".} # even if the module is loaded from the DB,
+                       # "foo.c" needs to be compiled/linked.
+
+The solution is to **re-play** the module's top level statements.
+This solves the problem without having to special case the logic
+that fills the internal seqs which are affected by the pragmas.
+
+In fact, this decribes how the AST should be stored in the database,
+as a "shallow" tree. Let's assume we compile module ``m`` with the
+following contents:
+
+.. code-block:: nim
+  import strutils
+
+  var x*: int = 90
+  {.compile: "foo.c".}
+  proc p = echo "p"
+  proc q = echo "q"
+  static:
+    echo "static"
+
+Conceptually this is the AST we store for the module:
+
+.. code-block:: nim
+  import strutils
+
+  var x*
+  {.compile: "foo.c".}
+  proc p
+  proc q
+  static:
+    echo "static"
+
+The symbol's ``ast`` field is loaded lazily, on demand. This is where most
+savings come from, only the shallow outer AST is reconstructed immediately.
+
+It is also important that the replay involves the ``import`` statement so
+that the dependencies are resolved properly.
+
+
+Shared global compiletime state
+-------------------------------
+
+Nim allows ``.global, compiletime`` variables that can be filled by macro
+invokations across different modules. This feature breaks modularity in a
+severe way. Plenty of different solutions have been proposed:
+
+- Restrict the types of global compiletime variables to ``Set[T]`` or
+  similar unordered, only-growable collections so that we can track
+  the module's write effects to these variables and reapply the changes
+  in a different order.
+- In every module compilation, reset the variable to its default value.
+- Provide a restrictive API that can load/save the compiletime state to
+  a file.
+
+(These solutions are not mutually exclusive.)
+
+Since we adopt the "replay the top level statements" idea, the natural
+solution to this problem is to emit pseudo top level statements that
+reflect the mutations done to the global variable.


-Frontend issues
---------------

 Methods and type converters
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+---------------------------
+
+In the following
+sections *global* means *shared between modules* or *property of the whole
+program*.

 Nim contains language features that are *global*. The best example for that
 are multi methods: Introducing a new method with the same name and some
@@ -238,20 +304,17 @@ If in the above example module ``B`` is re-compiled, but ``A`` is not then
 ``B`` needs to be aware of ``toBool`` even though  ``toBool`` is not referenced
 in ``B`` *explicitly*.

-Both the multi method and the type converter problems are solved by storing
-them in special sections in the ROD file that are loaded *unconditionally*
-when the ROD file is read.
+Both the multi method and the type converter problems are solved by the
+AST replay implementation.
+

 Generics
 ~~~~~~~~

-If we generate an instance of a generic, we'd like to re-use that
-instance if possible across module boundaries. However, this is not
-possible if the compilation cache is enabled. So we give up then and use
-the caching of generics only per module, not per project. This means that
-``--symbolFiles:on`` hurts a bit for efficiency. A better solution would
-be to persist the instantiations in a global cache per project. This might be
-implemented in later versions.
+We cache generic instantiations and need to ensure this caching works
+well with the incremental compilation feature. Since the cache is
+attached to the ``PSym`` datastructure, it should work without any
+special logic.


 Backend issues
@@ -259,13 +322,10 @@ Backend issues

 - Init procs must not be "forgotten" to be called.
 - Files must not be "forgotten" to be linked.
- Anything that is contained in ``nim__dat.c`` is shared between modules
-  implicitly.
 - Method dispatchers are global.
 - DLL loading via ``dlsym`` is global.
 - Emulated thread vars are global.

-
 However the biggest problem is that dead code elimination breaks modularity!
 To see why, consider this scenario: The module ``G`` (for example the huge
 Gtk2 module...) is compiled with dead code elimination turned on. So none
@@ -274,25 +334,21 @@ of ``G``'s procs is generated at all.
 Then module ``B`` is compiled that requires ``G.P1``. Ok, no problem,
 ``G.P1`` is loaded from the symbol file and ``G.c`` now contains ``G.P1``.

-Then module ``A`` (that depends onto ``B`` and ``G``) is compiled and ``B``
+Then module ``A`` (that depends on ``B`` and ``G``) is compiled and ``B``
 and ``G`` are left unchanged. ``A`` requires ``G.P2``.

 So now ``G.c`` MUST contain both ``P1`` and ``P2``, but we haven't even
 loaded ``P1`` from the symbol file, nor do we want to because we then quickly
-would restore large parts of the whole program. But we also don't want to
-store ``P1`` in ``B.c`` because that would mean to store every symbol where
-it is referred from which ultimately means the main module and putting
-everything in a single C file.
+would restore large parts of the whole program.

-There is however another solution: The old file ``G.c`` containing ``P1`` is
-**merged** with the new file ``G.c`` containing ``P2``. This is the solution
-that is implemented in the C code generator (have a look at the ``ccgmerge``
-module). The merging may lead to *cruft* (aka dead code) in generated C code
-which can only be removed by recompiling a project with the compilation cache
-turned off. Nevertheless the merge solution is way superior to the
-cheap solution "turn off dead code elimination if the compilation cache is
-turned on".
+Solution
+~~~~~~~~ 

+The backend must have some logic so that if the currently processed module
+is from the compilation cache, the ``ast`` field is not accessed. Instead
+the generated C(++) for the symbol's body needs to be cached too and
+inserted back into the produced C file. This approach seems to deal with
+all the outlined problems above.


 Debugging Nim's memory management
@@ -317,7 +373,7 @@ Introduction

 I use the term *cell* here to refer to everything that is traced
 (sequences, refs, strings).
-This section describes how the new GC works.
+This section describes how the GC works.

 The basic algorithm is *Deferrent Reference Counting* with cycle detection.
 References on the stack are not counted for better performance and easier C