Go 内联

Aug 14, 2022 14:30 · 2646 words · 6 minute read Golang

内联会使用函数体替换函数调用,虽然这种优化会增加二进制执行文件体积,但却能够提升程序的性能。当然不可能遇到函数就内联,Go 遵循一些原则。

我们从一个示例开始,理解什么是内联。

基于 Go 1.17.13,随着 Go 版本的升级内联规则有所变更!!!

func main() {
    n := []float32{120.4, -46.7, 32.50, 34.65, -67.45}
    fmt.Printf("The total is %.02f\n", sum(n))
}

func sum(s []float32) float32 {
    var t float32
    for _, v := range s {
        if t < 0 {
            t = add(t, v)
        } else {
            t = sub(t, v)
        }
    }

    return t
}

func add(a, b float32) float32 {
    return a + b
}

func sub(a, b float32) float32 {
    return a - b
}

带上 -gcflags="-m" 选项来查看编译器的优化策略:

$ go build -gcflags="-m" main.go
# command-line-arguments
./main.go:23:6: can inline add
./main.go:27:6: can inline sub
./main.go:14:11: inlining call to add
./main.go:16:11: inlining call to sub
./main.go:7:12: inlining call to fmt.Printf
./main.go:10:10: s does not escape
./main.go:6:16: []float32{...} does not escape
./main.go:7:40: sum(n) escapes to heap
./main.go:7:12: []interface {}{...} does not escape
<autogenerated>:1: leaking param content: .this

看到 addsub 两个函数均被内联。但是 sum 函数呢?带上两个 -m 来查看更详细的优化信息:

$ go build -gcflags="-m -m" main.go
# command-line-arguments
./main.go:23:6: can inline add with cost 4 as: func(float32, float32) float32 { return a + b }
./main.go:27:6: can inline sub with cost 4 as: func(float32, float32) float32 { return a - b }
./main.go:10:6: cannot inline sum: unhandled op RANGE # here
./main.go:14:11: inlining call to add func(float32, float32) float32 { return a + b }
./main.go:16:11: inlining call to sub func(float32, float32) float32 { return a - b }
./main.go:5:6: cannot inline main: function too complex: cost 148 exceeds budget 80
./main.go:7:12: inlining call to fmt.Printf func(string, ...interface {}) (int, error) { var fmt..autotmp_4 int; fmt..autotmp_4 = <nil>; var fmt..autotmp_5 error; fmt..autotmp_5 = <nil>; fmt..autotmp_4, fmt..autotmp_5 = fmt.Fprintf(io.Writer(os.Stdout), fmt.format, fmt.a...); return fmt..autotmp_4, fmt..autotmp_5 }
./main.go:10:10: s does not escape
./main.go:7:40: sum(n) escapes to heap:
./main.go:7:40:   flow: ~arg1 = &{storage for sum(n)}:
./main.go:7:40:     from sum(n) (spill) at ./main.go:7:40
./main.go:7:40:     from fmt.format, ~arg1 := "The total is %.02f\n", sum(n) (assign-pair) at ./main.go:7:12
./main.go:7:40:   flow: {storage for []interface {}{...}} = ~arg1:
./main.go:7:40:     from []interface {}{...} (slice-literal-element) at ./main.go:7:12
./main.go:7:40:   flow: fmt.a = &{storage for []interface {}{...}}:
./main.go:7:40:     from []interface {}{...} (spill) at ./main.go:7:12
./main.go:7:40:     from fmt.a = []interface {}{...} (assign) at ./main.go:7:12
./main.go:7:40:   flow: {heap} = *fmt.a:
./main.go:7:40:     from fmt.Fprintf(io.Writer(os.Stdout), fmt.format, fmt.a...) (call parameter) at ./main.go:7:12
./main.go:6:16: []float32{...} does not escape
./main.go:7:40: sum(n) escapes to heap
./main.go:7:12: []interface {}{...} does not escape
<autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
<autogenerated>:1:   flow: {heap} = *.this:
<autogenerated>:1:     from .this.file (dot of pointer) at <autogenerated>:1
<autogenerated>:1:     from .this.file.close() (call parameter) at <autogenerated>:1
<autogenerated>:1: leaking param content: .this

Go 不会内联使用 range 操作的函数。实际上,selectfordefer 还有闭包和 go 创建 goroutine 等等都会阻止内联。

编译器内联相关的代码在 src/cmd/compile/internal/inline/inl.go

func (v *hairyVisitor) doNode(n ir.Node) bool {
    //
    switch n.Op() {
    case ir.ORECOVER:
        v.reason = "call to recover"
        return true
    case ir.OCLOSURE:
        if base.Debug.InlFuncsWithClosures == 0 {
            v.reason = "not inlining functions with closures"
            return true
        }
    case ir.ORANGE,
        ir.OSELECT,
        ir.OGO,
        ir.ODEFER,
        ir.ODCLTYPE, // can't print yet
        ir.OTAILCALL:
        v.reason = "unhandled op " + n.Op().String()
        return true
    case ir.OFOR, ir.OFORUNTIL:
        n := n.(*ir.ForStmt)
        if n.Label != nil {
            v.reason = "labeled control"
            return true
        }
    case ir.OSWITCH:
        n := n.(*ir.SwitchStmt)
        if n.Label != nil {
            v.reason = "labeled control"
            return true
        }
    }
}

当解析 AST 语法树时,Go 只为内联的函数分配 80 个节点的预算。举个栗子,a = a + 1 表达式有 5 个节点:ASNAMEADDNAMELITERAL。如果函数 AST 语法树节点数超过了预算,也会阻止内联:

const (
    inlineMaxBudget       = 80
)

func (v *hairyVisitor) tooHairy(fn *ir.Func) bool {
    v.do = v.doNode // cache closure
    if ir.DoChildren(fn, v.do) {
        return true
    }
    if v.budget < 0 {
        v.reason = fmt.Sprintf("function too complex: cost %d exceeds budget %d", inlineMaxBudget-v.budget, inlineMaxBudget)
        return true
    }
    return false
}

既然 add 函数被内联,它的 AST 语法树节点肯定没超过 80,我们来看一下 SSA dump:

$ GOSSAFUNC=add go build
# runtime
dumped SSA to ./ssa.html

. RETURN tc(1) # main.go:24 . RETURN-Results . . AS tc(1) # main.go:24 . . . NAME-main.~r2 esc(no) tc(1) Class:PPARAMOUT Offset:0 OnStack float32 # main.go:23 . . . ADD tc(1) float32 # main.go:24 float32 . . . . NAME-main.a esc(no) tc(1) Class:PPARAM Offset:0 OnStack Used float32 # main.go:23 . . . . NAME-main.b esc(no) tc(1) Class:PPARAM Offset:0 OnStack Used float32 # main.go:23 buildssa-exit

一共 7 个节点在 inlineMaxBudget 80 个预算内。

内联移除了一些函数调用,也就意味着程序被修改了。但是发生 panic 时,开发者需要知道准确的调用栈来获得 panic 所在的文件名和行号。

我们修改上面的程序添加一个 panic:

func add(a, b float32) float32 {
    if b < 0 {
        panic(`Do not add negative number`)
    }
    return a + b
}

然后运行程序:

$ go run main.go
panic: Do not add negative number

goroutine 1 [running]:
main.add(...)
        /home/workspace/src/github.com/crazytaxii/go-test/main.go:25
main.sum({0xc00009af4c, 0xc000094000, 0x0})
        /home/workspace/src/github.com/crazytaxii/go-test/main.go:14 +0x65
main.main()
        /home/workspace/src/github.com/crazytaxii/go-test/main.go:7 +0x85
exit status 2

尽管这段函数代码被内联,但却输出了正确的行号,这是怎么回事?

Go 内部维护了一个内联函数的映射,它会生成一棵内联树,通过选项 -gcflags="-d pctab=pctoinline" 就能看出来:

$ go build -gcflags="-d pctab=pctoinline" main.go
funcpctab "".sum [valfunc=pctoinline]
     0     -1 00000 (main.go:10)      TEXT    "".sum(SB), ABIInternal, $24-24
     0        00000 (main.go:10)      TEXT    "".sum(SB), ABIInternal, $24-24
     0     -1 00000 (main.go:10)      CMPQ    SP, 16(R14)
     4        00004 (main.go:10)      PCDATA  $0, $-2
     4        00004 (main.go:10)      JLS     102
     6        00006 (main.go:10)      PCDATA  $0, $-1
     6        00006 (main.go:10)      SUBQ    $24, SP
     a        00010 (main.go:10)      MOVQ    BP, 16(SP)
     f        00015 (main.go:10)      LEAQ    16(SP), BP
    14        00020 (main.go:10)      MOVQ    AX, "".s+32(FP)
    19        00025 (main.go:10)      FUNCDATA        $0, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
    19        00025 (main.go:10)      FUNCDATA        $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
    19        00025 (main.go:10)      FUNCDATA        $5, "".sum.arginfo1(SB)
    19        00025 (main.go:12)      XORL    CX, CX
    1b        00027 (main.go:12)      XORPS   X0, X0
    1e        00030 (main.go:12)      NOP
    20        00032 (main.go:12)      JMP     37
    22        00034 (main.go:12)      INCQ    CX
    25        00037 (main.go:12)      CMPQ    BX, CX
    28        00040 (main.go:12)      JLE     72
    2a        00042 (main.go:12)      MOVSS   (AX)(CX*4), X1
    2f        00047 (main.go:13)      XORPS   X2, X2
    32        00050 (main.go:13)      UCOMISS X0, X2
    35        00053 (main.go:13)      JLS     66
    37        00055 (<unknown line number>)     NOP
    37      0 00055 (main.go:14)      UCOMISS X1, X2
    3a        00058 (main.go:14)      JHI     82
    3c        00060 (main.go:14)      ADDSS   X1, X0
    40     -1 00064 (main.go:14)      JMP     34
    42        00066 (<unknown line number>)     NOP
    42      1 00066 (main.go:16)      SUBSS   X1, X0
    46     -1 00070 (main.go:16)      JMP     34
    48        00072 (main.go:20)      MOVQ    16(SP), BP
    4d        00077 (main.go:20)      ADDQ    $24, SP
    51        00081 (main.go:20)      RET
    52      0 00082 (main.go:14)      LEAQ    type.string(SB), AX
    59        00089 (main.go:14)      LEAQ    ""..stmp_0(SB), BX
    60        00096 (main.go:14)      PCDATA  $1, $1
    60        00096 (main.go:14)      CALL    runtime.gopanic(SB)
    65        00101 (main.go:14)      XCHGL   AX, AX
    66        00102 (main.go:14)      NOP
    66        00102 (main.go:10)      PCDATA  $1, $-1
    66        00102 (main.go:10)      PCDATA  $0, $-2
    66     -1 00102 (main.go:10)      MOVQ    AX, 8(SP)
    6b        00107 (main.go:10)      MOVQ    BX, 16(SP)
    70        00112 (main.go:10)      MOVQ    CX, 24(SP)
    75        00117 (main.go:10)      CALL    runtime.morestack_noctxt(SB)
    7a        00122 (main.go:10)      MOVQ    8(SP), AX
    7f        00127 (main.go:10)      MOVQ    16(SP), BX
    84        00132 (main.go:10)      MOVQ    24(SP), CX
    89        00137 (main.go:10)      PCDATA  $0, $-1
    89        00137 (main.go:10)      JMP     0
    8e done
wrote 15 bytes to 0xc0000db740
 00 37 02 09 01 02 04 04 03 0c 02 14 01 28 00
-- inlining tree for "".sum:
0 | -1 | "".add (main.go:14:11) pc=64
1 | -1 | "".sub (main.go:16:11) pc=70
--

还可以通过 -gcflags="-d pctab=pctoline" 选项可视化行号:

$ go build -gcflags="-d pctab=pctoline" main.go
funcpctab "".sum [valfunc=pctoline]
     0     -1 00000 (main.go:10)      TEXT    "".sum(SB), ABIInternal, $24-24
     0        00000 (main.go:10)      TEXT    "".sum(SB), ABIInternal, $24-24
     0     10 00000 (main.go:10)      CMPQ    SP, 16(R14)
     4        00004 (main.go:10)      PCDATA  $0, $-2
     4        00004 (main.go:10)      JLS     102
     6        00006 (main.go:10)      PCDATA  $0, $-1
     6        00006 (main.go:10)      SUBQ    $24, SP
     a        00010 (main.go:10)      MOVQ    BP, 16(SP)
     f        00015 (main.go:10)      LEAQ    16(SP), BP
    14        00020 (main.go:10)      MOVQ    AX, "".s+32(FP)
    19        00025 (main.go:10)      FUNCDATA        $0, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
    19        00025 (main.go:10)      FUNCDATA        $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
    19        00025 (main.go:10)      FUNCDATA        $5, "".sum.arginfo1(SB)
    19     12 00025 (main.go:12)      XORL    CX, CX
    1b        00027 (main.go:12)      XORPS   X0, X0
    1e        00030 (main.go:12)      NOP
    20        00032 (main.go:12)      JMP     37
    22        00034 (main.go:12)      INCQ    CX
    25        00037 (main.go:12)      CMPQ    BX, CX
    28        00040 (main.go:12)      JLE     72
    2a        00042 (main.go:12)      MOVSS   (AX)(CX*4), X1
    2f     13 00047 (main.go:13)      XORPS   X2, X2
    32        00050 (main.go:13)      UCOMISS X0, X2
    35        00053 (main.go:13)      JLS     66
    37        00055 (<unknown line number>)     NOP
    37     24 00055 (main.go:14)      UCOMISS X1, X2
    3a        00058 (main.go:14)      JHI     82
    3c     28 00060 (main.go:14)      ADDSS   X1, X0
    40     14 00064 (main.go:14)      JMP     34
    42        00066 (<unknown line number>)     NOP
    42     32 00066 (main.go:16)      SUBSS   X1, X0
    46     16 00070 (main.go:16)      JMP     34
    48     20 00072 (main.go:20)      MOVQ    16(SP), BP
    4d        00077 (main.go:20)      ADDQ    $24, SP
    51        00081 (main.go:20)      RET
    52     25 00082 (main.go:14)      LEAQ    type.string(SB), AX
    59        00089 (main.go:14)      LEAQ    ""..stmp_0(SB), BX
    60        00096 (main.go:14)      PCDATA  $1, $1
    60        00096 (main.go:14)      CALL    runtime.gopanic(SB)
    65        00101 (main.go:14)      XCHGL   AX, AX
    66        00102 (main.go:14)      NOP
    66        00102 (main.go:10)      PCDATA  $1, $-1
    66        00102 (main.go:10)      PCDATA  $0, $-2
    66     10 00102 (main.go:10)      MOVQ    AX, 8(SP)
    6b        00107 (main.go:10)      MOVQ    BX, 16(SP)
    70        00112 (main.go:10)      MOVQ    CX, 24(SP)
    75        00117 (main.go:10)      CALL    runtime.morestack_noctxt(SB)
    7a        00122 (main.go:10)      MOVQ    8(SP), AX
    7f        00127 (main.go:10)      MOVQ    16(SP), BX
    84        00132 (main.go:10)      MOVQ    24(SP), CX
    89        00137 (main.go:10)      PCDATA  $0, $-1
    89        00137 (main.go:10)      JMP     0
    8e done

这样就对生成的指令有了正确的映射:

PC Instruction func line
3c ADDSS X1, X0 0 add L28
40 JMP 34 -1 sum 14
42 SUBSS X1, X0 1 sub L32
46 JMP 34 -1 sum 16

这张表内嵌入二进制文件中并在运行时读取以生成准确的堆栈追踪。

内联的作用在于提升程序性能,因为函数调用是有开销的——创建新的栈帧,保存和恢复寄存器。但凡事都有两面性,复制代码而非调用函数不可避免地会增加二进制文件的体积。使用基准测试套件 go1 测试内联带来的性能提升:

$ go test -gcflags=-l -bench=. -run=^# -count=5 | tee old.txt
$ go test -bench=. -run=^# -count=5 | tee new.txt
$ benchstat old.txt new.txt
name                     old time/op    new time/op    delta
BinaryTree17-6              1.73s ± 4%     1.70s ± 4%     ~     (p=0.421 n=5+5)
Fannkuch11-6                2.08s ± 5%     2.09s ± 6%     ~     (p=1.000 n=5+5)
FmtFprintfEmpty-6          24.7ns ± 5%    22.7ns ± 3%   -8.30%  (p=0.008 n=5+5)
FmtFprintfString-6         49.2ns ± 3%    41.0ns ± 2%  -16.73%  (p=0.008 n=5+5)
FmtFprintfInt-6            55.3ns ± 6%    49.7ns ± 6%  -10.08%  (p=0.016 n=5+5)
FmtFprintfIntInt-6         81.8ns ± 5%    74.0ns ± 4%   -9.61%  (p=0.008 n=5+5)
FmtFprintfPrefixedInt-6    85.2ns ± 5%    78.9ns ± 6%   -7.40%  (p=0.032 n=5+5)
FmtFprintfFloat-6           135ns ± 5%     132ns ± 6%     ~     (p=0.548 n=5+5)
FmtManyArgs-6               342ns ± 5%     323ns ± 1%   -5.63%  (p=0.008 n=5+5)
GobDecode-6                3.42ms ± 7%    3.31ms ± 6%     ~     (p=0.421 n=5+5)
GobEncode-6                2.53ms ± 4%    2.32ms ± 7%   -8.16%  (p=0.016 n=5+5)
Gzip-6                      165ms ± 4%     156ms ± 2%   -5.56%  (p=0.008 n=5+5)
Gunzip-6                   22.6ms ± 5%    21.8ms ± 5%     ~     (p=0.095 n=5+5)
HTTPClientServer-6          105µs ±12%      95µs ± 8%     ~     (p=0.095 n=5+5)
JSONEncode-6               6.57ms ± 1%    5.97ms ± 5%   -9.14%  (p=0.008 n=5+5)
JSONDecode-6               27.9ms ± 6%    26.6ms ± 1%   -4.79%  (p=0.008 n=5+5)
Mandelbrot200-6            3.22ms ± 5%    3.20ms ± 7%     ~     (p=0.310 n=5+5)
GoParse-6                  2.23ms ± 3%    2.19ms ± 1%     ~     (p=0.310 n=5+5)
RegexpMatchEasy0_32-6      42.5ns ± 4%    42.5ns ± 1%     ~     (p=0.651 n=5+5)
RegexpMatchEasy0_1K-6       130ns ± 7%     118ns ± 0%   -9.00%  (p=0.008 n=5+5)
RegexpMatchEasy1_32-6      39.3ns ± 4%    35.1ns ± 2%  -10.76%  (p=0.008 n=5+5)
RegexpMatchEasy1_1K-6       185ns ± 3%     179ns ± 0%   -3.13%  (p=0.008 n=5+5)
RegexpMatchMedium_32-6      650ns ± 5%     668ns ± 1%     ~     (p=0.548 n=5+5)
RegexpMatchMedium_1K-6     21.0µs ± 6%    19.5µs ±10%     ~     (p=0.095 n=5+5)
RegexpMatchHard_32-6       1.04µs ± 5%    0.90µs ± 2%  -13.09%  (p=0.008 n=5+5)
RegexpMatchHard_1K-6       29.4µs ± 3%    27.4µs ± 2%   -7.00%  (p=0.008 n=5+5)
Revcomp-6                   270ms ± 6%     268ms ± 2%     ~     (p=0.690 n=5+5)
Template-6                 34.9ms ± 3%    35.0ms ± 3%     ~     (p=0.841 n=5+5)
TimeParse-6                 162ns ± 2%     162ns ± 5%     ~     (p=0.730 n=5+5)
TimeFormat-6                198ns ± 5%     191ns ± 1%     ~     (p=0.310 n=5+5)