Go Compiler - Part 1

Tutorial about building compilers

In this tutorial we will use the OCaml programing language to build a simple compiler. This will be a walk-through tutorial.

Here we will be building a mini Go compiler. The Syntax is the minimum to compile this:

// Defining a function
// with 2 params and 1 return
func add(a int, b int) int {
  return a + b
}

func main() {
  // one way to define variables
  var a int = 10 
  // another way to define variables
  b := 25

  // calling functions
  add(a, b)
}

We will be using the dune build system for OCaml. Dune help us to create, build and execute large OCaml projects with ease.

To create a project using dune we can run the following:

dune init project gocompiler

The dune init is a command that must follow one of these: executable; library; project; or test. In this case, what we want is to create a project, and then we give it the project's name.

After running the command inside your terminal, you can enter the folder and see that your project structure looks like this:

.
├── bin
│   ├── dune
│   └── main.ml
├── dune-project
├── gocompiler.opam
├── lib
│   └── dune
└── test
    ├── dune
    └── gocompiler.ml

3 directories, 7 files

The bin folder is where we store our executable.
Personally, I enjoy renaming the bin folder to src.

The code inside the main.ml should look like this:

let () = print_endline "Hello, World!"

we can execute our project by running:

dune exec gocompiler

Now that we have our project created, let's begin by defining an AST (Abstract Syntax Tree).

In OCaml we can use contructors with data to define our AST. Let's create the ast.ml file and write the following.

type typ =
  | TInt
  | TVoid

type expr =
  (* Can be integers *)
  | Const of int
  (* Arithmetic operators: + - * / *)
  | Op of expr * string * expr
  (* Variables *)
  | Var of string
  (* Apply to the function this expr list *)
  | Apply of string * expr list
and stmt =
  | Return of expr
  (* Define variables with name, optional typ and a value *)
  | Let of string * typ option * expr
  (* Name | Params with name type | Return type | Statements for the body function *)
  | Func of string * (string * typ) list * typ * stmt list

After our AST is defined, we can create a simple example to demonstrate how to use it. The following OCaml code will represent the previous Golang code.

let code = 
  [
    Func
      ( "add",
        [ ("a", TInt); ("b", TInt) ],
        TInt,
        [ Return (Op (Var "a", "+", Var "b")) ] );
    Func
      ( "main",
        [],
        TVoid,
        [
          Let ("a", Some TInt, Const 10);
          Let ("b", None, Const 25);
          Let ("_", None, Apply ("add", [ Var "a"; Var "b" ]));
        ] );
  ];;

That is a list of statements, the first element is the add function with all the params with their types, the return type, and the body is just a return with the + operator. Lastly, the element is the main function with empty params, the return type, and the statements to execute.

Do we have to write this all the time? No, we will write a parser and lexer. Both working together, they will transform all the text into tokens and then parse them into those OCaml expressions.

At this stage, we need to create the tokens we want to parse and lex.

We need to add the menhir module. Edit your dune-project file and add the follow.

(lang dune 3.5)

(name gocompiler)

(using menhir 2.1) ; new line

...

Note: Just one line added

Inside the dune file, let's add this:

...

(ocamllex
 (modules lexer))

(menhir
 (modules parser))

That code tells that the module Parser and Lexer are generated by ocamllex and menhir.

Create a file named parser.mly. Note: the file name is the same as the one specified dune file.

%token LET FUNC
%token RETURN
%token <string> NAME
%token <int> VALUE
%token ADD EQUAL
%token LBRACE RBRACE
%token LPARENT RPARENT
%token COMMA
%token EOF

These tokens are a representation of the individual chars from the source code. The NAME and VALUE tokens can store a value of string and int respectively.

And now create a new file named lexer.mll.

{
  open Parser
}

rule token = parse
  | ['\n' ' ' '\t'] { token lexbuf }
  | '+' { ADD }
  | '=' { EQUAL }
  | '{' { LBRACE }
  | '}' { RBRACE }
  | '(' { LPARENT }
  | ')' { RPARENT }
  | ',' { COMMA }
  | "return" { RETURN }
  | "var" { LET }
  | "func" { FUNC }
  | ['0'-'9']+ as s { VALUE (int_of_string s) }
  | ['A'-'Z' 'a'-'z']+ as name
    { NAME name }
  | eof { EOF }
  | _ as c { failwith (Format.sprintf "Char %c is invalid" c) }

The lexer.mll file is trying to convert each char from the source code to the respective token (if possible). First, we open the Parser module (created previously) to get access to all these tokens. The rule is ignoring all white spaces, end of lines, and tabs. It's parsing all chars to its tokens and checks for the EOF. If some char does not match any of these conditions then it's an invalid one.

The brackets are a way to say "can match with any one of the lists". Example: ['A'-'Z' 'a'-'z']+ is just matching for any sequence of letters from A to Z or a to z.

Back to parser.mly.

%{
  open Ast
%}

...

Open the AST module to get access to out AST.

...

%start code

%type <stmt list> code

%%

param:
  | n = NAME t = NAME { n, (typ_of_str t) }

params:
  | LPARENT RPARENT { [] }
  | LPARENT p = separated_nonempty_list(COMMA, param) RPARENT 
    { p }

op:
  | v1 = expr ADD v2 = expr { Op (v1, "+", v2) }

apply:
  | n = NAME LPARENT e = separated_nonempty_list(COMMA, expr) RPARENT
    { 
      Apply (n, e)
    } 

expr:
  | LPARENT e = expr RPARENT { e }
  | v = VALUE { Const v }
  | n = NAME { Var n }
  | o = op { o }
  | a = apply { a }

stmt:
  | RETURN e = expr { Return e }
  | LET n = NAME EQUAL e = expr { Let (n, None, e) }
  | LET n = NAME t = NAME EQUAL e = expr 
    {
      let t = typ_of_str t in
      Let (n, Some t, e)
    }
  | FUNC n = NAME p = params t = NAME? LBRACE s = list(stmt) RBRACE 
    { 
      let t = match t with | Some t -> typ_of_str t | None -> TVoid in
      Func (n, p, t, s) 
    }
  | e = expr {Let ("_", None, e)}

code: s = list(stmt) EOF { s } 

WOW, this is a lot of code! The first two lines set the entry point and the type of its return.

Try and read it from bottom to top, and do it like this: Starting in code, we can read as "list of stmt followed by EOF we just return the list of stmt".

Example of how to read stmt:

| LET n = NAME t = NAME EQUAL e = expr 
  {
    let t = typ_of_str t in
    Let (n, Some t, e)
  }
  • If we have the LET token followed by a NAME with any that is being named "n" value followed by type that is just another NAME, followed by the equal sign and then any expression of type t.

The stmt match for all the stmt in our AST, inside stmt, the FUNC has some parameters, and the way it checks for those is by looking for a list separated by a comma. Note: stmt and expr are recursive.

stmt:
  ...
  | a = apply {Let ("_", None, a)}

Is just a way to accept any expr as stmt.

The expr is doing the same as stmt but for expr.

Let's start using the modules we just created. Inside the main.ml

open Ast 

let code = 
  [
    Func
      ( "add",
        [ ("a", TInt); ("b", TInt) ],
        TInt,
        [ Return (Op (Var "a", "+", Var "b")) ] );
    Func
      ( "main",
        [],
        TVoid,
        [
          Let ("a", Some TInt, Const 10);
          Let ("b", None, Const 25);
          Let ("_", None, Apply ("add", [ Var "a"; Var "b" ]));
        ] );
  ];;

let _ =
  let source_code = 
"func add(a int, b int) int {
  return a + b
}

func main() {
  var a int = 10 
  var b = 25

  add(a, b)
}"
  in
  let lex = Lexing.from_string source_code in
  assert (code = (Parser.code Lexer.token lex));;

Accordingly, the code we just wrote parses and lexes the Golang code, as a result, is returning the list of tokens we expected!

However, we want to parse a file and not hardcoded text. Let's write code to parse the main.go file.

let () = 
  let lex = Lexing.from_channel (open_in "main.go") in
  let code_from_source = Parser.code Lexer.token lex in
  assert (code = code_from_source)

The open_in is a OCaml function to open a file with the given name (string). Then we use the Parser.code to parser the text comming from the in_channel.

Suggestion: Build a CLI that receives the file name or just defaults it. Just like the go cli works.