Go 编译器介绍

Feb 12, 2022 17:00 · 2415 words · 5 minute read

来自 Go 官方文档 https://github.com/golang/go/blob/go1.16.14/src/cmd/compile/README.md

译文

cmd/compile 路径包含了构成 Go 编译器主要的代码包。编译器在逻辑上被划分成四个阶段。

当谈到编译器你可能有时会听到“前端”和“后端”这两个术语。粗略地讲,这些转化为我们将列出的前两个和后两个阶段。而第三个术语——“中端”(middle-end)通常是指发生在第二个阶段的许多工作。

注意 go/* 系列的代码包,比如 go/parser 还有 go/types,和编译器没啥关系。由于编译器最初是用 C 语言编写的,开发 go/* 软件包是为了开发处理 Go 代码的工具,如 gofmt 和 vet。

需要澄清的是,gc 这个名称代表 Go 编译器(Go Compiler),与大写的 GC 无关,后者代表垃圾收集(Garbage Collection)。

1. 解析

  • cmd/compile/internal/syntax(词法分析器,解析器,语法树)

在编译的第一阶段,源码被序列化(词法分析),解析(语法分析),为每个源代码文件构建语法树。

每个语法树都是对各自源文件的精确表示,其节点对应源码的各种元素,例如表达式、声明和语句。语法树还包括位置信息,用于错误报告和创建调试信息。

2. 类型检查和抽象语法树(AST)转换

  • cmd/compile/internal/gc(创建编译器 AST,类型检查,AST 转换)

gc 代码包包括了一个从 C 语言编写时延续下来的 AST 定义。所有的代码都是用它写的,所以 gc 包必须做的第一件事就是把语法包的语法树转换为编译器的 AST 表示。这个额外的步骤将来可能会被重构掉。

AST 接下来做类型检查。第一步是名称解析和类型推理,确定那个对象属于哪个标识符,以及每个表达式具有什么类型。类型检查包括一些额外的检查,比如“声明但未使用”还有确定一个函数是否终止。

某些转换也是在 AST 上进行的。一些节点根据类型信息被细化,例如字符串加法被从算术加法节点类型中分离出来。AST 还会消除死代码,函数调用内联以及逃逸分析。

3. 通用静态单一赋值(SSA)

  • cmd/compile/internal/gc(转换至 SSA)
  • cmd/compile/internal/ssa(SSA 传递和规则)

在这个阶段,AST 被转换至静态单一赋值(SSA)的形式,这是一种具有特定属性的较为低级的中间表示法,使其更容易实现优化并最终生成机器码。

在这个转换过程中,应用了内部函数。这些都是特殊的函数,编译器被调教成根据具体情况用重度优化过的代码来替换。

在 AST 转换至 SSA 的过程中,某些节点也被降级为更简单的组件,这样编译器的其他部分就可以和他们一起工作。例如,内建的 copy 被内存移动代替,range 循环被改写成 for 循环。由于历史原因,有些在转换到 SSA 之前就发生了,但是长期计划是将它们全都放到这里。

然后一系列与机器无关的操作和规则被应用。它们不涉及任何单一的计算机架构,因此可以在所有的 GOARCH 变量上运行。

这些通用的操作包括删除死代码,去掉不需要的 nil 检查和去掉不使用的分支。通用重写规则主要涉及表达式,比如用常量值替换一些表达式,还有优化乘法和浮点运算。

4. 生成机器码

  • cmd/compile/internal/ssa(SSA 降级和架构特定的传递)
  • cmd/internal/obj(机器码生成)

从降级(lower)操作开始,编译器依赖于机器,它将通用值改写为特定机器的变量。例如,在 amd64 支持内存操作数,所以很多加载-存储操作可能被合并。

要注意的是降级操作运行所有机器特定的重写规则,因为在这也应用了很多优化。

一旦 SSA 被降级并更接近目标架构,将运行最后的代码优化。包括另一个消除死代码的操作,时值距离它们被使用的地方更近,去掉不会被读取的局部变量,以及寄存器分配。其他重要的工作包含布局堆栈框架,将栈偏移量分配给局部变量,以及指针活性分析,它会计算在每个 GC 安全点上哪些栈中的指针是活的。

在 SSA 生成阶段结束时,Go 函数已被转换为一系列的 obj.Prog 指令,它们会被传递给汇编器(cmd/internal/obj),汇编器将其转换为机器码并输出最终的对象文件。该文件还包含了反射数据、导出数据和调试信息。


原文

cmd/compile contains the main packages that form the Go compiler. The compiler may be logically split in four phases, which we will briefly describe alongside the list of packages that contain their code.

You may sometimes hear the terms “front-end” and “back-end” when referring to the compiler. Roughly speaking, these translate to the first two and last two phases we are going to list here. A third term, “middle-end”, often refers to much of the work that happens in the second phase.

Note that the go/* family of packages, such as go/parser and go/types, have no relation to the compiler. Since the compiler was initially written in C, the go/* packages were developed to enable writing tools working with Go code, such as gofmt and vet.

It should be clarified that the name “gc” stands for “Go compiler”, and has little to do with uppercase “GC”, which stands for garbage collection.

1. Parsing

  • cmd/compile/internal/syntax (lexer, parser, syntax tree)

In the first phase of compilation, source code is tokenized (lexical analysis), parsed (syntax analysis), and a syntax tree is constructed for each source file.

Each syntax tree is an exact representation of the respective source file, with nodes corresponding to the various elements of the source such as expressions, declarations, and statements. The syntax tree also includes position information which is used for error reporting and the creation of debugging information.

2. Type-checking and AST transformations

  • cmd/compile/internal/gc (create compiler AST, type checking, AST transformations)

The gc package includes an AST definition carried over from when it was written in C. All of its code is written in terms of it, so the first thing that the gc package must do is convert the syntax package’s syntax tree to the compiler’s AST representation. This extra step may be refactored away in the future.

The AST is then type-checked. The first steps are name resolution and type inference, which determine which object belongs to which identifier, and what type each expression has. Type-checking includes certain extra checks, such as “declared and not used” as well as determining whether or not a function terminates.

Certain transformations are also done on the AST. Some nodes are refined based on type information, such as string additions being split from the arithmetic addition node type. Some other examples are dead code elimination, function call inlining, and escape analysis.

3. Generic SSA

  • cmd/compile/internal/gc (converting to SSA)
  • cmd/compile/internal/ssa (SSA passes and rules)

In this phase, the AST is converted into Static Single Assignment (SSA) form, a lower-level intermediate representation with specific properties that make it easier to implement optimizations and to eventually generate machine code from it.

During this conversion, function intrinsics are applied. These are special functions that the compiler has been taught to replace with heavily optimized code on a case-by-case basis.

Certain nodes are also lowered into simpler components during the AST to SSA conversion, so that the rest of the compiler can work with them. For instance, the copy builtin is replaced by memory moves, and range loops are rewritten into for loops. Some of these currently happen before the conversion to SSA due to historical reasons, but the long-term plan is to move all of them here.

Then, a series of machine-independent passes and rules are applied. These do not concern any single computer architecture, and thus run on all GOARCH variants.

Some examples of these generic passes include dead code elimination, removal of unneeded nil checks, and removal of unused branches. The generic rewrite rules mainly concern expressions, such as replacing some expressions with constant values, and optimizing multiplications and float operations.

4. Generating machine code

  • cmd/compile/internal/ssa (SSA lowering and arch-specific passes)
  • cmd/internal/obj (machine code generation)

The machine-dependent phase of the compiler begins with the “lower” pass, which rewrites generic values into their machine-specific variants. For example, on amd64 memory operands are possible, so many load-store operations may be combined.

Note that the lower pass runs all machine-specific rewrite rules, and thus it currently applies lots of optimizations too.

Once the SSA has been “lowered” and is more specific to the target architecture, the final code optimization passes are run. This includes yet another dead code elimination pass, moving values closer to their uses, the removal of local variables that are never read from, and register allocation.

Other important pieces of work done as part of this step include stack frame layout, which assigns stack offsets to local variables, and pointer liveness analysis, which computes which on-stack pointers are live at each GC safe point.

At the end of the SSA generation phase, Go functions have been transformed into a series of obj.Prog instructions. These are passed to the assembler (cmd/internal/obj), which turns them into machine code and writes out the final object file. The object file will also contain reflect data, export data, and debugging information.