Objects implementation notes

I would like to share some notes about how some of the object features were implemented in the Red compiler. As these are probably the most complex parts in the Red toolchain right now, I thought that it would be worth documenting them a bit, as it can be useful to current and future code contributors.

Shadow objects and functions

Reminder: the Red toolchain is currently written entirely in Rebol.

During the work on object support in the Red compiler, I realized that I could leverage the proximity of Red with Rebol much deeper than before, in order to more easily map some Red constructs directly to Rebol ones. That’s how I came up with the “shadow” objects concept (later extended to functions too).

It is pretty simple in fact, each time a Red object is processed from the source code, an equivalent, minimized object is created by the compiler in memory and connected to a tree of existing objects in order to match the definitional scoping used in the Red code.

Here is an example Red source code with two nested objects:

Red []

a: object [
    n: 123
    b: object [
        inc: func [value][value + n]

Once processed by the compiler, the following shadow objects are created in memory:

objects: [
    a object [
        n: integer!
        b: object [
            inc: function!

But it does not just stop there, the body of the Red object is bound (using Rebol’s bind native) to the internal Rebol object, in such way that the definitional scoping order is preserved. So the Red code is directly linked to the Rebol shadow objects in memory. The same procedure (including the Red code binding to Rebol objects part) is applied to all compiled functions, which context is represented as a nested Rebol object in the compiler’s memory.

If you get where I am heading, yes, that means that resolving the context of any of the words contained in a Red object/function body becomes as simple as calling Rebol’s bind' native on the word value. (Remember that Red source code is converted to a tree of blocks before being compiled). The bind' native will return one of the Rebol’s objects, that can then be used as a key in an hashtable to retrieve all the associated metadata.

I wish I had come up with that simple method when I was implementing namespaces support for Red/System. I think that I will rework that part in Red/System in the future, reusing the same approach in order to reduce compilation times (namespaces compilation overhead is significant in Red/System, roughly taking 20% of the compilation time).

Choosing Rebol as the bootstrapping language for Red, shows here its unique advantages.

Dynamic invocation

Processing path values is really difficult in Red (as it would be in Rebol if it had a compiler). The main issue can be visualized easily in this simple example:

foo: func [o][o/a 123]

Now if you put yourself in the shoes of the compiler, what code would you generate for o/a ‘… Could be a block access, could be a function call with /a as refinement, could be an object path accessing a field, could be an object path calling the function a defined in the object. All these cases would require a different code output, and the compiler has no way to accurately guess which one it is in the general case. Moreover, foo can be called with different datatypes as argument, and the compiled code still need to account for that…

One method could be to generate different code paths for each case listed above. As you can guess, this would become quickly very expensive to manage for expressions with multiple paths, as the possible combinations would make the number of cases explode quickly.

Another, very simple solution, would be to defer that code evaluation to the interpreter, but as you cannot know where the expression ends, the whole function (or at least the root level of the function) would need to be passed to the interpreter. Not a satisfying solution performance-wise.

The solution currently implemented in Red compiler for such cases, is a form of “dynamic invocation”. If you go through all cases, actually they can be sorted in two categories only:

a) access to a passive value
b) function invocation

Only at runtime you can know which category the o/a path belongs to (even worse, category can change at each foo function call!). The issue is that the compiler generates code that evaluates Red expressions as stack manipulations (not the native stack, but a high-level Red stack), so the compiler needs to know which category it is, so it can:

  • create the right corresponding stack frames.
  • consume the right number of arguments in case of a function invocation.

Basically, the generated Red/System code for the foo function would be (omitting prolog/epilog parts for clarity):

For a) case:

stack/mark-native ~eval-path 
stack/push ~o
word/push ~a 
actions/eval-path* false 
integer/push 123

For b) case (with /a being a refinement):

stack/mark-func ~o
integer/push 123
logic/push true               ;-- refinement /a set to TRUE

As you can see, the moment where the integer value 123 is pushed on stack for processing is very different in both cases. In case a), it is outside of the o/a stack frame, in case b), it is part of it. So what should the compiler do then…looks unsolvable’

Actually some stack tricks can help solve it. This is how the compiler handles it now:

  • The stack can either overwrite new expressions (default) or accumulate them.
  • At each level of a path evaluation, a check for a function result is applied. When a function is detected, it is pushed on stack and a new stack frame is opened to gather the required arguments. Such function is named a “dynamic call” in this context.
  • Some stack primitives (like stack/reset) are modified to not only support the overwritting/accumulative modes, but also check if the arity for the pending dynamic call has been fulfilled, and when appropriate, run the deferred function call, clean-up the stack and revert to the default overwritting mode.

This is the code currently produced by the Red compiler for o/a 123:

stack/push ~o
either stack/func' [stack/push-call path388 0 0 null] [
    stack/mark-native ~eval-path
    stack/push stack/arguments - 1      ;-- pushes ~o
    word/push ~a
    actions/eval-path* false
    either stack/func' [
        stack/push-call path388 1 0 null
integer/push 123

This generated code, with the help of the dual-mode stack, can support evaluation of o/a whatever value the path refers to (passive or function). stack/func’ here checks if the stack top entry is a function or not. There is a little performance impact, but it is not significant, especially in respect to the high flexibility it brings.

So far so good. What happens now if the path is used as argument of a function call:

foo: func [o][probe o/a 123]

The outer stack frame that probe will create then becomes problematic, because it will close just after o/a, preventing it to fetch eventual arguments when o/a is a function call…so back to the drawing board’ Fortunately not, we can apply the same transformation for the wrapping call and defer it until its arguments have been fully evaluated. This is the resulting code:

f_~path389: func [/local pos] [
    pos: stack/arguments 
    stack/mark-func ~probe 
    stack/push pos 

stack/defer-call ~probe as-integer :f_~path389 1 null

stack/push ~o
either stack/func' [stack/push-call path388 0 0 null] [
    stack/mark-native ~eval-path
    stack/push stack/arguments - 1
    word/push ~a
    actions/eval-path* false
    either stack/func' [
        stack/push-call path388 1 0 null
------------| "probe o/a"
integer/push 123

As you can see, it gets more hairy, but still manageable. The outer stack frame is externalized (into another Red/System function), so it can be called later, once the nested expressions are evaluated.

That said, dynamic calls still need a bit more work in order to support routine! calls and refinements for wrapping calls. Those features will be added in the next releases. Also, the gain in flexibility makes the compiler more short-sighted when a particular structure is expected, like for control flow keywords requiring blocks. I don’t see yet how this dynamic call approach could support such use-cases in a more user-friendly way.

But another feature can come to the rescue, the upcoming #alias directive proposed in the previous blog post. As long as the user will be willing to use this new directive, it would simply avoid these dynamic constructions, by providing enough information to the compiler to statically determine what kind of value, paths are referring to, resulting in much shorter and faster code, without the short-sightness issue.

This is the kind of problem I had to solve during object implementation and why it took much longer than planned initially.

Hope this deeper look inside the compiler’s guts is not too scary. ;-) Now, back to coding for next release!

And, by the way, Merry Christmas to all Red followers. :-)

Red/System compiler overview

Source Navigation

As requested by several users, I am giving a little more insights on the Red/System compiler inner workings and a map for navigating in the source code.

Current Red/System source tree:

    %compiler.r        ; Main compiler code, loads everything else
    %emitter.r         ; Target code emitter abstract layer
    %linker.r          ; Format files loader
    %rsc.r             ; Compiler's front-end for standalone usage

formats/               ; Contains all supported executable formats
    %PE.r              ; Windows PE/COFF file format emitter
    %ELF.r             ; UNIX ELF file format emitter

library/               ; Third-party libraries

runtime/               ; Contains all runtime definitions
    %common.reds       ; Cross-platform definitions
    %win32.r           ; Windows-specific bindings
    %linux.r           ; Linux-specific bindings

targets/               ; Contains target execution unit code emitters
    %target-class.r    ; Base utility class for emitters
    %IA32.r            ; Intel IA32 code emitter

tests/                 ; Unit tests

Once the compiler code is loaded in memory, the objects hierarchy looks like:

system/words/          ; global REBOL context

system-dialect/        ; main object
    loader/            ; preprocessor object
        process        ; preprocessor entry point function
    compiler/          ; compiler object
        compile        ; compiler entry point function

emitter/               ; code emitter object
     compiler/         ; short-cut reference to compiler object
     target/           ; reference the target code emitter object
         compiler/     ; short-cut reference to compiler object

linker/                ; executable file emitter
    PE/                ; Windows PE/COFF format emitter object
    ELF/               ; UNIX ELF format emitter object

Note: the linker file formats are currently statically loaded, this will be probably changed to a dynamic loading model.

Compilation Phases

The compilation is a process that goes through several phases to transform a textual source code to an executable file. Here is an overview of the process:

1) Source loading

This is a preparatory phase that would convert the text source code to its memory representation (close to an AST). This is achieved in 3 steps:

  1. source is preprocessed in its text form to make the syntax REBOL-compatible
  2. source is converted to a tree of nested blocks using the REBOL’s LOAD function
  3. source is postprocessed to interpret some compiler directives (like #include and #define)

2) Compilation

The compiler walks through the source tree using a recursive descent parser. It attempts to match keywords and values with its internal rules and emits:

  • the corresponding machine code in the code buffer
  • the global variables and complex data (c-string! and struct! literals) in the data buffer

The internal entry point function for the compilation is compiler/comp-dialect. All the compiler/comp-* functions are used to recursively analyze the source code and each one matches a specific semantic rule (or set of rules) from the Red/System language specification.

The production of native code is direct, there is no intermediary representation, machine code is generated as soon as a language statement or expression is matched. This is the simplest way to do it, but code cannot be efficiently optimised without a proper Intermediate Representation (IR). When Red/System will be rewritten in Red, a simple IR will be introduced to enable the full range of possible code optimisations.

As you know, a Red/System program entry point is at the beginning of the source code. During the compilation, the source code in the global context is compiled first and all functions are collected and compiled after all global code. So the generated code is always organised the same way:

  • global code (including proper program finalization)
  • function 1
  • function 2

The results of the compilation process are:

  • a global symbol table
  • a machine code buffer
  • a global data buffer
  • a list of functions from external libraries (usually these are OS API mappings)

The compiler is able to process one or several source files this way before entering the linking phase.

3) Linking

The linking process goal is to create an executable file that could be loaded by the target platform. So the executable file needs to conform to the target ABI for that, like PE for Windows or ELF for Linux.

During the linking, the global symbol table is used to “patch” the code and data buffer (see linker/resolve-symbol-refs), inserting the final memory address for the pointed resources (variable, function, global data). The different parts to assemble are grouped into so-called “sections”, that can be themselves be grouped into “segments” (as, e.g., in ELF).

Finally, some headers describing the file and its sections/segments are inserted to complete the file. The file is then written down on disk, marking the end of the whole process.

Static linking of external libraries (*.lib, *.a,…) will be added at some point in the future (when the need for such feature will appear).

I hope this short description gives you a better picture on how Red/System compiler works, even if it is probably obvious for the most experienced readers. Feel free to ask for more in the comments, or better, on the Google Groups mailing-list.