document how the incremental compilation scheme could work

This commit is contained in:
Andreas Rumpf
2018-06-01 22:11:32 +02:00
parent 61fb83ecbb
commit cae1973856
4 changed files with 110 additions and 51 deletions

View File

@@ -246,10 +246,10 @@ const trackPosInvalidFileIdx* = FileIndex(-2) # special marker so that no sugges
# are produced within comments and string literals
type
MsgConfig* = object
MsgConfig* = object ## does not need to be stored in the incremental cache
trackPos*: TLineInfo
trackPosAttached*: bool ## whether the tracking position was attached to some
## close token.
trackPosAttached*: bool ## whether the tracking position was attached to
## some close token.
errorOutputs*: TErrorOutputs
msgContext*: seq[TLineInfo]

View File

@@ -47,7 +47,7 @@ type
doStopCompile*: proc(): bool {.closure.}
usageSym*: PSym # for nimsuggest
owners*: seq[PSym]
methods*: seq[tuple[methods: TSymSeq, dispatcher: PSym]]
methods*: seq[tuple[methods: TSymSeq, dispatcher: PSym]] # needs serialization!
systemModule*: PSym
sysTypes*: array[TTypeKind, PType]
compilerprocs*: TStrTable

View File

@@ -156,24 +156,27 @@ type
version*: int
Suggestions* = seq[Suggest]
ConfigRef* = ref object ## eventually all global configuration should be moved here
target*: Target
ConfigRef* = ref object ## every global configuration
## fields marked with '*' are subject to
## the incremental compilation mechanisms
## (+) means "part of the dependency"
target*: Target # (+)
linesCompiled*: int # all lines that have been compiled
options*: TOptions
globalOptions*: TGlobalOptions
options*: TOptions # (+)
globalOptions*: TGlobalOptions # (+)
m*: MsgConfig
evalTemplateCounter*: int
evalMacroCounter*: int
exitcode*: int8
cmd*: TCommands # the command
selectedGC*: TGCMode # the selected GC
selectedGC*: TGCMode # the selected GC (+)
verbosity*: int # how verbose the compiler is
numberOfProcessors*: int # number of processors
evalExpr*: string # expression for idetools --eval
lastCmdTime*: float # when caas is enabled, we measure each command
symbolFiles*: SymbolFilesOption
cppDefines*: HashSet[string]
cppDefines*: HashSet[string] # (*)
headerFile*: string
features*: set[Feature]
arguments*: string ## the arguments to be passed to the program that
@@ -220,13 +223,13 @@ type
cLinkedLibs*: seq[string] # libraries to link
externalToLink*: seq[string] # files to link in addition to the file
# we compiled
# we compiled (*)
linkOptionsCmd*: string
compileOptionsCmd*: seq[string]
linkOptions*: string
compileOptions*: string
linkOptions*: string # (*)
compileOptions*: string # (*)
ccompilerpath*: string
toCompile*: CfileList
toCompile*: CfileList # (*)
suggestionResultHook*: proc (result: Suggest) {.closure.}
suggestVersion*: int
suggestMaxResults*: int

View File

@@ -38,10 +38,6 @@ Path Purpose
Bootstrapping the compiler
==========================
As of version 0.8.5 the compiler is maintained in Nim. (The first versions
have been implemented in Object Pascal.) The Python-based build system has
been rewritten in Nim too.
Compiling the compiler is a simple matter of running::
nim c koch.nim
@@ -202,16 +198,86 @@ Compilation cache
=================
The implementation of the compilation cache is tricky: There are lots
of issues to be solved for the front- and backend. In the following
sections *global* means *shared between modules* or *property of the whole
program*.
of issues to be solved for the front- and backend.
General approach: AST replay
----------------------------
We store a module's AST of a successful semantic check in a SQLite
database. There are plenty of features that require a sub sequence
to be re-applied, for example:
.. code-block:: nim
{.compile: "foo.c".} # even if the module is loaded from the DB,
# "foo.c" needs to be compiled/linked.
The solution is to **re-play** the module's top level statements.
This solves the problem without having to special case the logic
that fills the internal seqs which are affected by the pragmas.
In fact, this decribes how the AST should be stored in the database,
as a "shallow" tree. Let's assume we compile module ``m`` with the
following contents:
.. code-block:: nim
import strutils
var x*: int = 90
{.compile: "foo.c".}
proc p = echo "p"
proc q = echo "q"
static:
echo "static"
Conceptually this is the AST we store for the module:
.. code-block:: nim
import strutils
var x*
{.compile: "foo.c".}
proc p
proc q
static:
echo "static"
The symbol's ``ast`` field is loaded lazily, on demand. This is where most
savings come from, only the shallow outer AST is reconstructed immediately.
It is also important that the replay involves the ``import`` statement so
that the dependencies are resolved properly.
Shared global compiletime state
-------------------------------
Nim allows ``.global, compiletime`` variables that can be filled by macro
invokations across different modules. This feature breaks modularity in a
severe way. Plenty of different solutions have been proposed:
- Restrict the types of global compiletime variables to ``Set[T]`` or
similar unordered, only-growable collections so that we can track
the module's write effects to these variables and reapply the changes
in a different order.
- In every module compilation, reset the variable to its default value.
- Provide a restrictive API that can load/save the compiletime state to
a file.
(These solutions are not mutually exclusive.)
Since we adopt the "replay the top level statements" idea, the natural
solution to this problem is to emit pseudo top level statements that
reflect the mutations done to the global variable.
Frontend issues
---------------
Methods and type converters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
---------------------------
In the following
sections *global* means *shared between modules* or *property of the whole
program*.
Nim contains language features that are *global*. The best example for that
are multi methods: Introducing a new method with the same name and some
@@ -238,20 +304,17 @@ If in the above example module ``B`` is re-compiled, but ``A`` is not then
``B`` needs to be aware of ``toBool`` even though ``toBool`` is not referenced
in ``B`` *explicitly*.
Both the multi method and the type converter problems are solved by storing
them in special sections in the ROD file that are loaded *unconditionally*
when the ROD file is read.
Both the multi method and the type converter problems are solved by the
AST replay implementation.
Generics
~~~~~~~~
If we generate an instance of a generic, we'd like to re-use that
instance if possible across module boundaries. However, this is not
possible if the compilation cache is enabled. So we give up then and use
the caching of generics only per module, not per project. This means that
``--symbolFiles:on`` hurts a bit for efficiency. A better solution would
be to persist the instantiations in a global cache per project. This might be
implemented in later versions.
We cache generic instantiations and need to ensure this caching works
well with the incremental compilation feature. Since the cache is
attached to the ``PSym`` datastructure, it should work without any
special logic.
Backend issues
@@ -259,13 +322,10 @@ Backend issues
- Init procs must not be "forgotten" to be called.
- Files must not be "forgotten" to be linked.
- Anything that is contained in ``nim__dat.c`` is shared between modules
implicitly.
- Method dispatchers are global.
- DLL loading via ``dlsym`` is global.
- Emulated thread vars are global.
However the biggest problem is that dead code elimination breaks modularity!
To see why, consider this scenario: The module ``G`` (for example the huge
Gtk2 module...) is compiled with dead code elimination turned on. So none
@@ -274,25 +334,21 @@ of ``G``'s procs is generated at all.
Then module ``B`` is compiled that requires ``G.P1``. Ok, no problem,
``G.P1`` is loaded from the symbol file and ``G.c`` now contains ``G.P1``.
Then module ``A`` (that depends onto ``B`` and ``G``) is compiled and ``B``
Then module ``A`` (that depends on ``B`` and ``G``) is compiled and ``B``
and ``G`` are left unchanged. ``A`` requires ``G.P2``.
So now ``G.c`` MUST contain both ``P1`` and ``P2``, but we haven't even
loaded ``P1`` from the symbol file, nor do we want to because we then quickly
would restore large parts of the whole program. But we also don't want to
store ``P1`` in ``B.c`` because that would mean to store every symbol where
it is referred from which ultimately means the main module and putting
everything in a single C file.
would restore large parts of the whole program.
There is however another solution: The old file ``G.c`` containing ``P1`` is
**merged** with the new file ``G.c`` containing ``P2``. This is the solution
that is implemented in the C code generator (have a look at the ``ccgmerge``
module). The merging may lead to *cruft* (aka dead code) in generated C code
which can only be removed by recompiling a project with the compilation cache
turned off. Nevertheless the merge solution is way superior to the
cheap solution "turn off dead code elimination if the compilation cache is
turned on".
Solution
~~~~~~~~ 
The backend must have some logic so that if the currently processed module
is from the compilation cache, the ``ast`` field is not accessed. Instead
the generated C(++) for the symbol's body needs to be cached too and
inserted back into the produced C file. This approach seems to deal with
all the outlined problems above.
Debugging Nim's memory management
@@ -317,7 +373,7 @@ Introduction
I use the term *cell* here to refer to everything that is traced
(sequences, refs, strings).
This section describes how the new GC works.
This section describes how the GC works.
The basic algorithm is *Deferrent Reference Counting* with cycle detection.
References on the stack are not counted for better performance and easier C