Golang 内存剖析

Aug 8, 2020 09:00 · 6611 words · 14 minute read Golang

介绍

在上一篇博文中,我通过与协程的栈共享值讲解了逃逸分析的基础知识。还有其他造成逃逸的情景,为了帮助你理解,我来调试一个有丶意思的分配内存的程序。

程序

我想了解 io 包因此快速搞了一个项目。给定一个字符序列,写个函数来找到字符串 elvis 并用大写开头的 Elvis 来替换。Elvis 是猫王的名字,当然是大写的。

这是代码链接:https://play.golang.org/p/n_SzF4Cer4

这是压力测试链接:https://play.golang.org/p/TnXrxJVfLV

代码中有两个不同的函数来解决这个问题。这篇博文将重点关注使用 io 包的 algOne 函数。algTwo 给你自己试验内存和 CPU 分析。

Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis

下面是完整的 algOne 函数

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

我想知道这个函数的性能以及它给堆带来的压力,我们需要压力测试一下。

压力测试

这是我写的压测函数,在内部调用 algOne 函数去处理数据流。

15 func BenchmarkAlgorithmOne(b *testing.B) {
16     var output bytes.Buffer
17     in := assembleInputStream()
18     find := []byte("elvis")
19     repl := []byte("Elvis")
20
21     b.ResetTimer()
22
23     for i := 0; i < b.N; i++ {
24         output.Reset()
25         algOne(in, find, repl, &output)
26     }
27 }

我们跑下压力测试函数,运行 go test 并带上 -bench-benchtime-benchmem 选项。

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	2000000 	     2522 ns/op       117 B/op  	      2 allocs/op

压力测试跑完之后我们看到 algOne 方法分配了两次值,每次 117 字节。这很棒,但是我们要弄清楚造成分配的代码,这就需要生成压力测试的分析数据了。

性能分析

要生成分析数据,我们要带上 -memprofile 开关再跑一下压力测试。

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

跑完后测试工具会生成两份新文件。

~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go

源码在 memcpu 路径下,algOne 函数在 stream.go 文件中,压力测试函数在 stream_test.go 文件中。mem.outmemcpu.out 是新生成的文件。mem.out 包含分析数据和 memcpu.test 文件的内容,还有我们查看分析时需要访问的符号二进制文件。

有了分析数据和二进制测试文件,我们就可以用 pprof 工具来琢磨了。

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _

当分析内存数据时,用 -alloc_space 选项代替默认的 -inuse_space 选项,就能看到每一次分配在哪里发生,无论内存中还有没有。

在 pprof 的提示下,我们使用 list 命令检查 algOne 函数。这个命令可以使用正则表达式找到你想要的。

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _
---

通过内存分析,我们知道了 `input``buf` 数组在堆中分配。即使 `input` 是指针变量,分析数据表明 `input` 指向的 `bytes.Buffer` 值已经分配了。我们先看 `input` 并理解为啥已经分配了。

我们假设是因为调用 `bytes.NewBuffer` 函数时在栈上共享了 `bytes.Buffer` 值。但是 `flat` 列(pprof 报告的第一列)告诉我们是因为 `algOne` 共享造成了它的逃逸。

我知道 `flat` 列表示在函数中的分配是因为 `list` 命令显示 `Benchmark` 函数中调用了 `aglOne`
```bash
(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _

cum 列(第二列)只有一个值,我知道了 Benchmark 没有直接分配。所有的内存分配都在函数调用的循环里发生。你可以看到内存分配次数和两把 list 调用是匹配的。

我们还是不知道为啥 bytes.Buffer 被分配了。这时候在 go build 时带上 -gcflags "-m -m" 就舒爽了。分析数据只能告诉你那些值逃逸了,但是编译命令可以揭示背后的真相。

编译器报告

我们来看看编译器的逃逸分析思路:

$ go build -gcflags "-m -m"

这个命令输出了不少东西。我们只要搜索关键字 stream.go:83,因为 stream.go 是文件名,第 83 行包含 bytes.Buffer 的值:

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

第一行就有丶意思。

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

实锤了 bytes.Buffer 值并没有逃逸,因为它传递给了调用栈。这是因为 bytes.NewBuffer 从没被调用,函数内的代码就被内联了。

下面是我写的代码片段:

83     input := bytes.NewBuffer(data)

因为编译器选择内联 bytes.NewBuffer 函数调用,上面的代码就被转换成:

input := &bytes.Buffer{buf: data}

这说明 algOne 函数直接构造 bytes.Buffer 值。造成逃逸的答案就在编译器报告的另外五行中。

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

93 行的代码造成了逃逸,input 变量被赋值给一个接口变量。

接口

我完全不记得在代码给接口变量赋值,但是 93 行非常清楚的看到发生了啥。

93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
94         output.Write(buf[:n])
95         return
96     }

io.ReadFull 的调用导致接口赋值。查看 io.ReadFull 函数的定义,你可以看到一个接口类型是如何接收 input 值的。

type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}

显然传递 bytes.Buffer 的地址到调用栈,在 Reader 接口变量中存储造成了逃逸。现在我们知道了使用接口变量是有代价的:分配和重定向。所以使用接口不能明显使代码更好的话,就别用了。是都在代码中使用接口我有自己的原则:

使用接口:

  • 用户 API 需要提供实现细节的时候
  • 要维护多种 API 实现
  • 部分 API 更改已被识别并需要解耦

不使用接口:

  • 为使用接口而使用接口
  • 概括算法
  • 当用户可以自定义接口时

现在我们扪心自问,这个算法真的需要 io.ReadFull 方法吗?答案是否定的因为 bytes.Buffer 类型本身有个方法我们就能用。使用方法而不是调用函数可以防止重新分配内存。

我们改改代码,删掉 io 包,直接用 Read 方法而不是 input 变量。

新的代码不需要 io 包,为了保持原来的行号,我用空标识符导入 io 包,这样就保持队形了。

 12 import (
 13     "bytes"
 14     "fmt"
 15     _ "io"
 16 )

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := input.Read(buf[:end]); err != nil || n < end {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := input.Read(buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := input.Read(buf[:end]); err != nil || n < end {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

我们压测新的代码会发现 bytes.Buffer 值的内存分配没了。

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

性能大约提升了 29%,程序从 2570 ns/op 降到了 1814 ns/op。解决了这个问题后我们现在可以着手 buf 切片数组的内存分配了。再,我们应该能意识到问题:

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) list algOne
Total: 7.50MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
     11MB       11MB (flat, cum)   100% of Total
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
     11MB       11MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := input.Read(buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])

只剩下在 89 行的对切片分配内存。

栈帧

想知道造成 buf 数组切片分配内存的原因?再来 go build 并带上 -gcflags "-m -m",搜索 stream.go:89 关键字。

$ go build -gcflags "-m -m"
./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89

报告显示对于栈来说数组太大了。这个信息误导了我们,并不是数组太大了,而是编译器在编译时不知道数组的大小。

只有在编译器知道值的大小才会将其置于栈中。这是因为每个函数的栈帧容量是在编译时算出来的。如果编译器不知道它的大小,就会将其置于堆上。

为了验证这个,我们临时将切片大小硬编码为 5,再跑压力测试。

89     buf := make([]byte, 5)

这把压力测试,分配消失了。

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op

再看一眼编译器的报告,没啥东西逃逸了。

$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape

很明显我们自己没法确定切片的大小,所以算法中要一次内存分配。

分配和性能

比较一下我们重构带来的性能提升:

Before any optimization
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op

通过避免 bytes.Buffer 中的内存分配,我们获得了 29% 左右的性能提升,删掉所有内存分配能够获得大约 33% 的性能提升。内存分配是影响应用程序性能的因素之一。

结论

Go 有些神奇的工具可以让你更好地理解编译器的逃逸决策。基于这些信息,你可以通过重构代码来尽量使值被分配到栈中。不是让你干掉所有内存分配而是尽可能最小化。

也就是说,写代码时永远不要把性能作为第一优先级,因为你也不想猜测性能。写正确的代码才是最高优先级,这意味着我们首要关注完整性、可读性和简洁。当你有了可执行的程序,再来确定程序是否足够快。如果确实不够快,用语言提供的工具来找到并修复性能问题。

原文

Introduction

In the previous post, I taught the basics on escape analysis by using an example that shared a value up a goroutine’s stack. What I did not show you are other scenarios that can cause values to escape. To help you with this, I am going to debug a program that is allocating in surprising ways.

I wanted to learn more about the io package so I gave myself a quick project. Given a stream of bytes, write a function that can find the string elvis and replace it with the capitalized version of the string Elvis. We are talking about the King here, so his name should always be capitalized.

Here is a link to the solution: https://play.golang.org/p/n_SzF4Cer4

Here is a link to the benchmarks: https://play.golang.org/p/TnXrxJVfLV

The code listing has two different functions that solve this problem. This post will focus its attention on the algOne function since it is using the io package. Use the algTwo function to experiment with memory and cpu profiles on your own.

Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis

Here is a listing the of algOne function in its entirety.

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

What I want to know is how well this function performs and what kind of pressure is it placing on the heap. To learn this we are going to run a benchmark.

Benchmarking

Here is the benchmark function I wrote that calls into the algOne function to perform the data stream processing.

15 func BenchmarkAlgorithmOne(b *testing.B) {
16     var output bytes.Buffer
17     in := assembleInputStream()
18     find := []byte("elvis")
19     repl := []byte("Elvis")
20
21     b.ResetTimer()
22
23     for i := 0; i < b.N; i++ {
24         output.Reset()
25         algOne(in, find, repl, &output)
26     }
27 }

With this benchmark function in place, we can run it through go test using the -bench, -benchtime and -benchmem switches.

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	2000000 	     2522 ns/op       117 B/op  	      2 allocs/op

After running the benchmark we can see that the algOne function is allocating 2 values worth a total of 117 bytes per operation. This is great, but we need to know what lines of code in the function are causing these allocations. To learn this, we need to generate profiling data for this benchmark.

Profiling

To generate the profile data, we are going to run the benchmark again but this time ask for a memory profile by using the -memprofile switch.

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

Once the benchmark is finished, the test tool has produced two new files.

~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go

The source code is in a folder called memcpu with the algOne function inside of stream.go and the benchmark function inside of stream_test.go. The two new files that were produced are called mem.out and memcpu.test. The mem.out file contains the profile data and the memcpu.test file, named after the folder, contains a test binary we need to have access to symbols when looking at the profile data.

With the profile data and test binary in place, we can now run the pprof tool to study the profile data.

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _

When profiling memory and looking for “low hanging fruit”, you want to use the -alloc_space option instead of the default -inuse_space option. This will show you where every allocation is happening regardless if it is still in memory or not at the time you take the profile.

From the (pprof) prompt, we can inspect the algOne function using the list command. This command takes a regular expression as an argument to find the function(s) you want to look at.

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _

Based on this profile, we now know input and the backing array of the buf slice is allocating to the heap. Since input is a pointer variable, the profile is really saying that the bytes.Buffer value that the input pointer points to is allocating. So let’s focus on the input allocation first and understand why it is allocating.

We could assume it is allocating because the function call to bytes.NewBuffer is sharing the bytes.Buffer value it creates up the call stack. However, the existence of a value in the flat column (the first column in the pprof output) tells me that the value is allocating because the algOne function is sharing it in a way to cause it to escape.

I know the flat column represents in function allocations because look at what the list command shows for the Benchmark function which is calling into algOne.

(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _

Since there is only a value in the cum column (the second column), this is telling me that the Benchmark function is allocating nothing directly. All allocations are happening from function calls that are being made inside that loop. You can see all the allocation numbers from these two list calls match.

We still don’t know why the bytes.Buffer value is allocating. This is where the -gcflags "-m -m"" switch on the go build command will come in handy. The profiler can only tell you what values are escaping, but the build command can tell you why.

Compiler Reporting

Let’s ask the compiler what decisions it’s making as it relates to escape analysis on the code.

$ go build -gcflags "-m -m"

This command produces a lot of output. We just need to search the output for anything that has stream.go:83 since stream.go is the name of the file that contains this code and line 83 contains the construction of the bytes.buffer value. After the search we find 6 lines.

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

The first line we found for stream.go:83 is interesting.

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

It confirms that the bytes.Buffer value didn’t escape because it was passed up the call stack. This is because the call to bytes.NewBuffer was never called, the code inside the function was inlined.

So this is the piece of code I wrote:

83     input := bytes.NewBuffer(data)

Due to the compiler choosing to inline the bytes.NewBuffer function call, the code I wrote is converted to this:

input := &bytes.Buffer{buf: data}

That means the algOne function is constructing the bytes.Buffer value directly. So now the question is, what is causing the value to escape from the algOne stack frame? That answer is in the other 5 lines we found in the report.

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

What these lines are telling us, is that the code at line 93 is causing the escape. The input variable is being assigned to an interface value.

Interfaces

I don’t remember making an assignment to an interface value at all in the code. However, if you look at line 93, it becomes clear what is happening.

93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
94         output.Write(buf[:n])
95         return
96     }

The call to io.ReadFull is causing the interface assignment. If you look the definition of the io.ReadFull function, you can see how it is accepting the input value through an interface type.

type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}

It appears that passing the bytes.Buffer address down the call stack and storing it inside of the Reader interface value is causing an escape. Now we know there is a cost to using an interface: allocation and indirection. So, if it’s not clear how an interface makes the code better, you probably don’t want to use one. Here are some guidelines that I follow to validate the use of interfaces in my code.

Use an interface when:

  • users of the API need to provide an implementation detail.
  • API’s have multiple implementations they need to maintain internally.
  • parts of the API that can change have been identified and require decoupling.

Don’t use an interface:

  • parts of the API that can change have been identified and require decoupling. Don’t use an interface:
  • to generalize an algorithm.
  • when users can declare their own interfaces.

Now we can ask ourselves, does this algorithm really need the io.ReadFull function? The answer is no because the bytes.Buffer type has a method set that we can use. Using methods against a value that a function owns can prevent allocations.

Let’s make the code change to remove the io package and use the Read method directly against the input variable.

This code change removes the need to import the io package, so to keep all the line numbers the same, I am using the blank identifier against the io package import. This will allow the import to stay in the list.

 12 import (
 13     "bytes"
 14     "fmt"
 15     _ "io"
 16 )

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := input.Read(buf[:end]); err != nil || n < end {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := input.Read(buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := input.Read(buf[:end]); err != nil || n < end {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

When we run the benchmark against this code change, we can see that the allocation for the bytes.Buffer value is gone.

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

We also see there was a performance improvement of about ~29%. The code went from 2570 ns/op to 1814 ns/op. With this solved, we can now focus on the allocation of the backing array for the buf slice. If we use the profiler once more against the new profile data we just produced, we should be able to identify what is causing the remaining allocation.

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) list algOne
Total: 7.50MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
     11MB       11MB (flat, cum)   100% of Total
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
     11MB       11MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := input.Read(buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])

The only allocation left is on line 89, which is for the backing array of the slice.

Stack Frames

We want to know why the backing array for buf is allocating? Let’s run go build again using the -gcflags "-m -m" option and search for stream.go:89.

./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89

The report says the backing array is “too large for stack”. This message is very misleading. It’s not that the backing array is too large, but that the compiler doesn’t know what the size of the backing array is compile time.

Values can only be placed on the stack if the compiler knows the size of the value at compile time. This is because the size of every stack frame, for every function, is calculated at compile time. If the compiler doesn’t know the size of a value, it is placed on the heap.

To show this, let’s temporarily hard code the size of the slice to 5 and run the benchmark again.

89     buf := make([]byte, 5)

This time when we run the benchmark the allocations are gone.

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op

If you look at the compiler report once more, you will see nothing is escaping.

$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape

Obviously we can’t hard code the size of the slice, so we will need to live with the 1 allocation for this algorithm.

Allocations and Performance

Compare the different performance gains we achieved throughout each refactoring.

Before any optimization
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op

We got ~29% increase in performance by removing the allocation of the bytes.Buffer and ~33% speed up when all the allocations were removed. Allocations are a place where application performance can suffer.

Conclusion

Go has some amazing tooling that allows you to understand the decisions the compiler is making as it relates to escape analysis. Based on this information, you can refactor code to be sympathetic with keeping values on the stack that don’t need to be on the heap. You are not going to write zero allocation software but you want to minimize allocations when possible.

That being said, never write code with performance as your first priority because you don’t want to be guessing about performance. Write code that optimizes for correctness as your first priority. This means focus on integrity, readability and simplicity first. Once you have a working program, identify if the program is fast enough. If it’s not fast enough, then use the tooling the language provides to find and fix your performance issues.