Go 内联
Aug 14, 2022 14:30 · 2646 words · 6 minute read
内联会使用函数体替换函数调用,虽然这种优化会增加二进制执行文件体积,但却能够提升程序的性能。当然不可能遇到函数就内联,Go 遵循一些原则。
我们从一个示例开始,理解什么是内联。
基于 Go 1.17.13,随着 Go 版本的升级内联规则有所变更!!!
func main() {
n := []float32{120.4, -46.7, 32.50, 34.65, -67.45}
fmt.Printf("The total is %.02f\n", sum(n))
}
func sum(s []float32) float32 {
var t float32
for _, v := range s {
if t < 0 {
t = add(t, v)
} else {
t = sub(t, v)
}
}
return t
}
func add(a, b float32) float32 {
return a + b
}
func sub(a, b float32) float32 {
return a - b
}
带上 -gcflags="-m"
选项来查看编译器的优化策略:
$ go build -gcflags="-m" main.go
# command-line-arguments
./main.go:23:6: can inline add
./main.go:27:6: can inline sub
./main.go:14:11: inlining call to add
./main.go:16:11: inlining call to sub
./main.go:7:12: inlining call to fmt.Printf
./main.go:10:10: s does not escape
./main.go:6:16: []float32{...} does not escape
./main.go:7:40: sum(n) escapes to heap
./main.go:7:12: []interface {}{...} does not escape
<autogenerated>:1: leaking param content: .this
看到 add
、sub
两个函数均被内联。但是 sum
函数呢?带上两个 -m
来查看更详细的优化信息:
$ go build -gcflags="-m -m" main.go
# command-line-arguments
./main.go:23:6: can inline add with cost 4 as: func(float32, float32) float32 { return a + b }
./main.go:27:6: can inline sub with cost 4 as: func(float32, float32) float32 { return a - b }
./main.go:10:6: cannot inline sum: unhandled op RANGE # here
./main.go:14:11: inlining call to add func(float32, float32) float32 { return a + b }
./main.go:16:11: inlining call to sub func(float32, float32) float32 { return a - b }
./main.go:5:6: cannot inline main: function too complex: cost 148 exceeds budget 80
./main.go:7:12: inlining call to fmt.Printf func(string, ...interface {}) (int, error) { var fmt..autotmp_4 int; fmt..autotmp_4 = <nil>; var fmt..autotmp_5 error; fmt..autotmp_5 = <nil>; fmt..autotmp_4, fmt..autotmp_5 = fmt.Fprintf(io.Writer(os.Stdout), fmt.format, fmt.a...); return fmt..autotmp_4, fmt..autotmp_5 }
./main.go:10:10: s does not escape
./main.go:7:40: sum(n) escapes to heap:
./main.go:7:40: flow: ~arg1 = &{storage for sum(n)}:
./main.go:7:40: from sum(n) (spill) at ./main.go:7:40
./main.go:7:40: from fmt.format, ~arg1 := "The total is %.02f\n", sum(n) (assign-pair) at ./main.go:7:12
./main.go:7:40: flow: {storage for []interface {}{...}} = ~arg1:
./main.go:7:40: from []interface {}{...} (slice-literal-element) at ./main.go:7:12
./main.go:7:40: flow: fmt.a = &{storage for []interface {}{...}}:
./main.go:7:40: from []interface {}{...} (spill) at ./main.go:7:12
./main.go:7:40: from fmt.a = []interface {}{...} (assign) at ./main.go:7:12
./main.go:7:40: flow: {heap} = *fmt.a:
./main.go:7:40: from fmt.Fprintf(io.Writer(os.Stdout), fmt.format, fmt.a...) (call parameter) at ./main.go:7:12
./main.go:6:16: []float32{...} does not escape
./main.go:7:40: sum(n) escapes to heap
./main.go:7:12: []interface {}{...} does not escape
<autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
<autogenerated>:1: flow: {heap} = *.this:
<autogenerated>:1: from .this.file (dot of pointer) at <autogenerated>:1
<autogenerated>:1: from .this.file.close() (call parameter) at <autogenerated>:1
<autogenerated>:1: leaking param content: .this
Go 不会内联使用 range
操作的函数。实际上,select
、for
、defer
还有闭包和 go
创建 goroutine 等等都会阻止内联。
编译器内联相关的代码在 src/cmd/compile/internal/inline/inl.go:
func (v *hairyVisitor) doNode(n ir.Node) bool {
//
switch n.Op() {
case ir.ORECOVER:
v.reason = "call to recover"
return true
case ir.OCLOSURE:
if base.Debug.InlFuncsWithClosures == 0 {
v.reason = "not inlining functions with closures"
return true
}
case ir.ORANGE,
ir.OSELECT,
ir.OGO,
ir.ODEFER,
ir.ODCLTYPE, // can't print yet
ir.OTAILCALL:
v.reason = "unhandled op " + n.Op().String()
return true
case ir.OFOR, ir.OFORUNTIL:
n := n.(*ir.ForStmt)
if n.Label != nil {
v.reason = "labeled control"
return true
}
case ir.OSWITCH:
n := n.(*ir.SwitchStmt)
if n.Label != nil {
v.reason = "labeled control"
return true
}
}
}
当解析 AST 语法树时,Go 只为内联的函数分配 80 个节点的预算。举个栗子,a = a + 1
表达式有 5 个节点:AS
、NAME
、ADD
、NAME
、LITERAL
。如果函数 AST 语法树节点数超过了预算,也会阻止内联:
const (
inlineMaxBudget = 80
)
func (v *hairyVisitor) tooHairy(fn *ir.Func) bool {
v.do = v.doNode // cache closure
if ir.DoChildren(fn, v.do) {
return true
}
if v.budget < 0 {
v.reason = fmt.Sprintf("function too complex: cost %d exceeds budget %d", inlineMaxBudget-v.budget, inlineMaxBudget)
return true
}
return false
}
既然 add
函数被内联,它的 AST 语法树节点肯定没超过 80,我们来看一下 SSA dump:
$ GOSSAFUNC=add go build
# runtime
dumped SSA to ./ssa.html
. RETURN tc(1) # main.go:24 . RETURN-Results . . AS tc(1) # main.go:24 . . . NAME-main.~r2 esc(no) tc(1) Class:PPARAMOUT Offset:0 OnStack float32 # main.go:23 . . . ADD tc(1) float32 # main.go:24 float32 . . . . NAME-main.a esc(no) tc(1) Class:PPARAM Offset:0 OnStack Used float32 # main.go:23 . . . . NAME-main.b esc(no) tc(1) Class:PPARAM Offset:0 OnStack Used float32 # main.go:23 buildssa-exit
一共 7 个节点在 inlineMaxBudget
80 个预算内。
内联移除了一些函数调用,也就意味着程序被修改了。但是发生 panic 时,开发者需要知道准确的调用栈来获得 panic 所在的文件名和行号。
我们修改上面的程序添加一个 panic:
func add(a, b float32) float32 {
if b < 0 {
panic(`Do not add negative number`)
}
return a + b
}
然后运行程序:
$ go run main.go
panic: Do not add negative number
goroutine 1 [running]:
main.add(...)
/home/workspace/src/github.com/crazytaxii/go-test/main.go:25
main.sum({0xc00009af4c, 0xc000094000, 0x0})
/home/workspace/src/github.com/crazytaxii/go-test/main.go:14 +0x65
main.main()
/home/workspace/src/github.com/crazytaxii/go-test/main.go:7 +0x85
exit status 2
尽管这段函数代码被内联,但却输出了正确的行号,这是怎么回事?
Go 内部维护了一个内联函数的映射,它会生成一棵内联树,通过选项 -gcflags="-d pctab=pctoinline"
就能看出来:
$ go build -gcflags="-d pctab=pctoinline" main.go
funcpctab "".sum [valfunc=pctoinline]
0 -1 00000 (main.go:10) TEXT "".sum(SB), ABIInternal, $24-24
0 00000 (main.go:10) TEXT "".sum(SB), ABIInternal, $24-24
0 -1 00000 (main.go:10) CMPQ SP, 16(R14)
4 00004 (main.go:10) PCDATA $0, $-2
4 00004 (main.go:10) JLS 102
6 00006 (main.go:10) PCDATA $0, $-1
6 00006 (main.go:10) SUBQ $24, SP
a 00010 (main.go:10) MOVQ BP, 16(SP)
f 00015 (main.go:10) LEAQ 16(SP), BP
14 00020 (main.go:10) MOVQ AX, "".s+32(FP)
19 00025 (main.go:10) FUNCDATA $0, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
19 00025 (main.go:10) FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
19 00025 (main.go:10) FUNCDATA $5, "".sum.arginfo1(SB)
19 00025 (main.go:12) XORL CX, CX
1b 00027 (main.go:12) XORPS X0, X0
1e 00030 (main.go:12) NOP
20 00032 (main.go:12) JMP 37
22 00034 (main.go:12) INCQ CX
25 00037 (main.go:12) CMPQ BX, CX
28 00040 (main.go:12) JLE 72
2a 00042 (main.go:12) MOVSS (AX)(CX*4), X1
2f 00047 (main.go:13) XORPS X2, X2
32 00050 (main.go:13) UCOMISS X0, X2
35 00053 (main.go:13) JLS 66
37 00055 (<unknown line number>) NOP
37 0 00055 (main.go:14) UCOMISS X1, X2
3a 00058 (main.go:14) JHI 82
3c 00060 (main.go:14) ADDSS X1, X0
40 -1 00064 (main.go:14) JMP 34
42 00066 (<unknown line number>) NOP
42 1 00066 (main.go:16) SUBSS X1, X0
46 -1 00070 (main.go:16) JMP 34
48 00072 (main.go:20) MOVQ 16(SP), BP
4d 00077 (main.go:20) ADDQ $24, SP
51 00081 (main.go:20) RET
52 0 00082 (main.go:14) LEAQ type.string(SB), AX
59 00089 (main.go:14) LEAQ ""..stmp_0(SB), BX
60 00096 (main.go:14) PCDATA $1, $1
60 00096 (main.go:14) CALL runtime.gopanic(SB)
65 00101 (main.go:14) XCHGL AX, AX
66 00102 (main.go:14) NOP
66 00102 (main.go:10) PCDATA $1, $-1
66 00102 (main.go:10) PCDATA $0, $-2
66 -1 00102 (main.go:10) MOVQ AX, 8(SP)
6b 00107 (main.go:10) MOVQ BX, 16(SP)
70 00112 (main.go:10) MOVQ CX, 24(SP)
75 00117 (main.go:10) CALL runtime.morestack_noctxt(SB)
7a 00122 (main.go:10) MOVQ 8(SP), AX
7f 00127 (main.go:10) MOVQ 16(SP), BX
84 00132 (main.go:10) MOVQ 24(SP), CX
89 00137 (main.go:10) PCDATA $0, $-1
89 00137 (main.go:10) JMP 0
8e done
wrote 15 bytes to 0xc0000db740
00 37 02 09 01 02 04 04 03 0c 02 14 01 28 00
-- inlining tree for "".sum:
0 | -1 | "".add (main.go:14:11) pc=64
1 | -1 | "".sub (main.go:16:11) pc=70
--
还可以通过 -gcflags="-d pctab=pctoline"
选项可视化行号:
$ go build -gcflags="-d pctab=pctoline" main.go
funcpctab "".sum [valfunc=pctoline]
0 -1 00000 (main.go:10) TEXT "".sum(SB), ABIInternal, $24-24
0 00000 (main.go:10) TEXT "".sum(SB), ABIInternal, $24-24
0 10 00000 (main.go:10) CMPQ SP, 16(R14)
4 00004 (main.go:10) PCDATA $0, $-2
4 00004 (main.go:10) JLS 102
6 00006 (main.go:10) PCDATA $0, $-1
6 00006 (main.go:10) SUBQ $24, SP
a 00010 (main.go:10) MOVQ BP, 16(SP)
f 00015 (main.go:10) LEAQ 16(SP), BP
14 00020 (main.go:10) MOVQ AX, "".s+32(FP)
19 00025 (main.go:10) FUNCDATA $0, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
19 00025 (main.go:10) FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
19 00025 (main.go:10) FUNCDATA $5, "".sum.arginfo1(SB)
19 12 00025 (main.go:12) XORL CX, CX
1b 00027 (main.go:12) XORPS X0, X0
1e 00030 (main.go:12) NOP
20 00032 (main.go:12) JMP 37
22 00034 (main.go:12) INCQ CX
25 00037 (main.go:12) CMPQ BX, CX
28 00040 (main.go:12) JLE 72
2a 00042 (main.go:12) MOVSS (AX)(CX*4), X1
2f 13 00047 (main.go:13) XORPS X2, X2
32 00050 (main.go:13) UCOMISS X0, X2
35 00053 (main.go:13) JLS 66
37 00055 (<unknown line number>) NOP
37 24 00055 (main.go:14) UCOMISS X1, X2
3a 00058 (main.go:14) JHI 82
3c 28 00060 (main.go:14) ADDSS X1, X0
40 14 00064 (main.go:14) JMP 34
42 00066 (<unknown line number>) NOP
42 32 00066 (main.go:16) SUBSS X1, X0
46 16 00070 (main.go:16) JMP 34
48 20 00072 (main.go:20) MOVQ 16(SP), BP
4d 00077 (main.go:20) ADDQ $24, SP
51 00081 (main.go:20) RET
52 25 00082 (main.go:14) LEAQ type.string(SB), AX
59 00089 (main.go:14) LEAQ ""..stmp_0(SB), BX
60 00096 (main.go:14) PCDATA $1, $1
60 00096 (main.go:14) CALL runtime.gopanic(SB)
65 00101 (main.go:14) XCHGL AX, AX
66 00102 (main.go:14) NOP
66 00102 (main.go:10) PCDATA $1, $-1
66 00102 (main.go:10) PCDATA $0, $-2
66 10 00102 (main.go:10) MOVQ AX, 8(SP)
6b 00107 (main.go:10) MOVQ BX, 16(SP)
70 00112 (main.go:10) MOVQ CX, 24(SP)
75 00117 (main.go:10) CALL runtime.morestack_noctxt(SB)
7a 00122 (main.go:10) MOVQ 8(SP), AX
7f 00127 (main.go:10) MOVQ 16(SP), BX
84 00132 (main.go:10) MOVQ 24(SP), CX
89 00137 (main.go:10) PCDATA $0, $-1
89 00137 (main.go:10) JMP 0
8e done
这样就对生成的指令有了正确的映射:
PC | Instruction | func | line |
---|---|---|---|
3c | ADDSS X1, X0 | 0 add | L28 |
40 | JMP 34 | -1 sum | 14 |
42 | SUBSS X1, X0 | 1 sub | L32 |
46 | JMP 34 | -1 sum | 16 |
这张表内嵌入二进制文件中并在运行时读取以生成准确的堆栈追踪。
内联的作用在于提升程序性能,因为函数调用是有开销的——创建新的栈帧,保存和恢复寄存器。但凡事都有两面性,复制代码而非调用函数不可避免地会增加二进制文件的体积。使用基准测试套件 go1 测试内联带来的性能提升:
$ go test -gcflags=-l -bench=. -run=^# -count=5 | tee old.txt
$ go test -bench=. -run=^# -count=5 | tee new.txt
$ benchstat old.txt new.txt
name old time/op new time/op delta
BinaryTree17-6 1.73s ± 4% 1.70s ± 4% ~ (p=0.421 n=5+5)
Fannkuch11-6 2.08s ± 5% 2.09s ± 6% ~ (p=1.000 n=5+5)
FmtFprintfEmpty-6 24.7ns ± 5% 22.7ns ± 3% -8.30% (p=0.008 n=5+5)
FmtFprintfString-6 49.2ns ± 3% 41.0ns ± 2% -16.73% (p=0.008 n=5+5)
FmtFprintfInt-6 55.3ns ± 6% 49.7ns ± 6% -10.08% (p=0.016 n=5+5)
FmtFprintfIntInt-6 81.8ns ± 5% 74.0ns ± 4% -9.61% (p=0.008 n=5+5)
FmtFprintfPrefixedInt-6 85.2ns ± 5% 78.9ns ± 6% -7.40% (p=0.032 n=5+5)
FmtFprintfFloat-6 135ns ± 5% 132ns ± 6% ~ (p=0.548 n=5+5)
FmtManyArgs-6 342ns ± 5% 323ns ± 1% -5.63% (p=0.008 n=5+5)
GobDecode-6 3.42ms ± 7% 3.31ms ± 6% ~ (p=0.421 n=5+5)
GobEncode-6 2.53ms ± 4% 2.32ms ± 7% -8.16% (p=0.016 n=5+5)
Gzip-6 165ms ± 4% 156ms ± 2% -5.56% (p=0.008 n=5+5)
Gunzip-6 22.6ms ± 5% 21.8ms ± 5% ~ (p=0.095 n=5+5)
HTTPClientServer-6 105µs ±12% 95µs ± 8% ~ (p=0.095 n=5+5)
JSONEncode-6 6.57ms ± 1% 5.97ms ± 5% -9.14% (p=0.008 n=5+5)
JSONDecode-6 27.9ms ± 6% 26.6ms ± 1% -4.79% (p=0.008 n=5+5)
Mandelbrot200-6 3.22ms ± 5% 3.20ms ± 7% ~ (p=0.310 n=5+5)
GoParse-6 2.23ms ± 3% 2.19ms ± 1% ~ (p=0.310 n=5+5)
RegexpMatchEasy0_32-6 42.5ns ± 4% 42.5ns ± 1% ~ (p=0.651 n=5+5)
RegexpMatchEasy0_1K-6 130ns ± 7% 118ns ± 0% -9.00% (p=0.008 n=5+5)
RegexpMatchEasy1_32-6 39.3ns ± 4% 35.1ns ± 2% -10.76% (p=0.008 n=5+5)
RegexpMatchEasy1_1K-6 185ns ± 3% 179ns ± 0% -3.13% (p=0.008 n=5+5)
RegexpMatchMedium_32-6 650ns ± 5% 668ns ± 1% ~ (p=0.548 n=5+5)
RegexpMatchMedium_1K-6 21.0µs ± 6% 19.5µs ±10% ~ (p=0.095 n=5+5)
RegexpMatchHard_32-6 1.04µs ± 5% 0.90µs ± 2% -13.09% (p=0.008 n=5+5)
RegexpMatchHard_1K-6 29.4µs ± 3% 27.4µs ± 2% -7.00% (p=0.008 n=5+5)
Revcomp-6 270ms ± 6% 268ms ± 2% ~ (p=0.690 n=5+5)
Template-6 34.9ms ± 3% 35.0ms ± 3% ~ (p=0.841 n=5+5)
TimeParse-6 162ns ± 2% 162ns ± 5% ~ (p=0.730 n=5+5)
TimeFormat-6 198ns ± 5% 191ns ± 1% ~ (p=0.310 n=5+5)