Go:串行?并发?并行?
Jan 2, 2021 12:00 · 7609 words · 16 minute read
译文
在解决一个问题的时候,尤其还是一个新问题,我不会一上来就并发。我会先找串行的解决方案并确认它是可行的。在可读性和技术评审后,我会思考并发是否合理和实用。有时候明显并发是更好的选择,但也不全是。
在这个系列的第一部分我解释了操作系统调度器的机制和语义。在第二部分中,我又解释了 Go 调度器的语义,在这篇博客中,我把前两者结合在一起,更深入地理解什么是并发。
什么是并发
并发意味着“不按顺序”执行。把一组原本要顺序执行的指令,想办法乱序执行,仍然能产生同样的结果。这么做必须要显著产生价值,比如使用复杂度换取足够的性能增益。有时候乱序执行是不可能的,甚至没有意义。
理解并发不是并行很重要。并行是同时执行多条指令,和并发是不同的概念。只有多条硬件线程可用时并行才有可能。
图 1,有两个逻辑处理器(P),每个都连接独立的 OS 线程。两个 Goroutine(G1 和 G2)正在并行跑,同时在各自的硬件线程上执行指令。在每个逻辑处理器中,三条 Goroutine 轮流分享它们的操作系统线程。所有的 Goroutine 并发运行,不分先后地执行指令,并在 OS 线程共享时间。
这就是问题所在,有时候利用并发而不是并行实际上会减小吞吐量。更有意思的是,有时候并行的并发并不会给你带来梦寐以求的性能提升。
工作负载
在考虑并发的时候,有两种很重要的工作负载:
- CPU 密集型:Goroutine 从不处于等待状态,就是不停地在计算。算 π 就是。
- IO 密集型:会使 Goroutine 自然地进入等待状态,或是通过网络访问资源、或是进行系统调用、或是等待事件。
对于 CPU 密集型工作负载,你需要并行的并发。单个硬件线程处理多条 Goroutine 并不高效,因为有 Goroutine 会进入等待状态。Goroutine 比硬件线程数量更多会适得其反,因为在 OS 线程上移动 Goroutine 有时间开销。上下文切换使整个世界都静止了,在这期间你的工作负载是不会被执行的。
而 IO 密集型工作负载并不需要并行的并发。单个硬件线程可以高效地处理多条 Goroutine,Goroutine 自然而然地进出等待状态。Goroutine 比硬件线程数量更多能够加速,因为在硬件线程上移动 Goroutine 不会使世界静止,这使得不同的 Goroutine 可以有效利用同一个硬件线程而不是让其闲置。
你怎么知道每个硬件线程上有多少条 Goroutine 能提供最好的吞吐量呢?Goroutine 太少那就闲置了;Goroutine 太多上下文切换的延迟也更多。这是你要思考的东西,但是超出了本文的范围。
添加成员
https://play.golang.org/p/r9LdqUsEzEz
36 func add(numbers []int) int {
37 var v int
38 for _, n := range numbers {
39 v += n
40 }
41 return v
42 }
add
函数对数据集求和。
那么问题来了,add
方法的工作负载适合乱序执行吗?我觉得是的。整数集合可以被分解小一点的列表,这些列表可以被并发处理。当所有的列表都算完了,再一起求和。
但是,要搞出多少个较小的列表呢才能获得最大的吞吐量呢?要回答这个问题,首先得弄清楚 add
是哪种工作负载。add
函数是 CPU 密集型负载,因为这是纯数学问题,没啥会导致 Goroutine 进入等待状态的。这意味着每个硬件线程使用一个 Goroutine 就可以获得好的吞吐量。
实现方式不固定,不要纠结我的特定实现。
https://play.golang.org/p/r9LdqUsEzEz
44 func addConcurrent(goroutines int, numbers []int) int {
45 var v int64
46 totalNumbers := len(numbers)
47 lastGoroutine := goroutines - 1
48 stride := totalNumbers / goroutines
49
50 var wg sync.WaitGroup
51 wg.Add(goroutines)
52
53 for g := 0; g < goroutines; g++ {
54 go func(g int) {
55 start := g * stride
56 end := start + stride
57 if g == lastGoroutine {
58 end = totalNumbers
59 }
60
61 var lv int
62 for _, n := range numbers[start:end] {
63 lv += n
64 }
65
66 atomic.AddInt64(&v, int64(lv))
67 wg.Done()
68 }(g)
69 }
70
71 wg.Wait()
72
73 return int(v)
74 }
addConcurrent
函数是并发版的 add
,从 5 行涨到了 26 行。
48 行:每条 Goroutine 都被分配唯一的一组数据来求和。数组元素的数量由数据集的大小除以 Goroutine 的数量计算出。 57-59 行:最后一条 Goroutine 会处理数据的剩余部分,可能比其他 Goroutine 要少一点。 66 行:累加。
并发的肯定比普通版复杂多了,但这值得吗?最好做个基准测试来看看。我搞了一个 1000W 数据的集合,还关掉了 GC。
func BenchmarkSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
add(numbers)
}
}
func BenchmarkConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
addConcurrent(runtime.NumCPU(), numbers)
}
}
以下是所有 Goroutine 只有一条硬件线程可用时的结果。串行版就用一条 Goroutine 而并发版根据 runtime.NumCPU
函数动态获取在我的机器上有 8 条 Goroutine。这个案例中只有并发没有并行。
10 Million Numbers using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITHOUT Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 1 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/cpu-bound
BenchmarkSequential 1000 5720764 ns/op : ~10% Faster
BenchmarkConcurrent 1000 6387344 ns/op
BenchmarkSequentialAgain 1000 5614666 ns/op : ~13% Faster
在自己的机器上跑基准测试,有太多的变量会导致测试不准确。尽量确保你的机器没啥负载,多跑几次。确保结果一致。
当 Goroutine 只有一条硬件线程可用时,串行版大概比并发版快 10% 到 13%。这符合预期因为并发版在上下文切换有开销。
以下是每条 Goroutine 独占一个硬件线程的结果。串行版 1 条 Goroutine 而并发版在我的机器上 8 条。这时并发版是并行的。
10 Million Numbers using 8 goroutines with 8 cores
2.9 GHz Intel 4 Core i7
Concurrency WITH Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 8 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/cpu-bound
BenchmarkSequential-8 1000 5910799 ns/op
BenchmarkConcurrent-8 2000 3362643 ns/op : ~43% Faster
BenchmarkSequentialAgain-8 1000 5933444 ns/op
BenchmarkConcurrentAgain-8 2000 3477253 ns/op : ~41% Faster
当每条 Goroutine 独占一个硬件线程时,并发版比串行版快了 41% 到 43%。这也符合预期,因为现在所有 Groutine 都在并行跑,8 条同时在工作。
排序
并不是所有 CPU 密集型的工作负载都适合并发的,主要是分工或将所有结果组合起来的成本很高的情况下。使用 Go 实现冒泡排序 https://play.golang.org/p/S0Us1wYBqG6:
01 package main
02
03 import "fmt"
04
05 func bubbleSort(numbers []int) {
06 n := len(numbers)
07 for i := 0; i < n; i++ {
08 if !sweep(numbers, i) {
09 return
10 }
11 }
12 }
13
14 func sweep(numbers []int, currentPass int) bool {
15 var idx int
16 idxNext := idx + 1
17 n := len(numbers)
18 var swap bool
19
20 for idxNext < (n - currentPass) {
21 a := numbers[idx]
22 b := numbers[idxNext]
23 if a > b {
24 numbers[idx] = b
25 numbers[idxNext] = a
26 swap = true
27 }
28 idx++
29 idxNext = idx + 1
30 }
31 return swap
32 }
33
34 func main() {
35 org := []int{1, 3, 2, 4, 8, 6, 7, 2, 3, 0}
36 fmt.Println(org)
37
38 bubbleSort(org)
39 fmt.Println(org)
40 }
算法不说了,懂的自然懂。
那么问题来了,bubbleSort
函数适合乱序执行吗?我觉得不是。整数集可以被拆解成更小的数组,又可以并发地对这些数组排序。但是无法高效地将它们组合起来:
01 func bubbleSortConcurrent(goroutines int, numbers []int) {
02 totalNumbers := len(numbers)
03 lastGoroutine := goroutines - 1
04 stride := totalNumbers / goroutines
05
06 var wg sync.WaitGroup
07 wg.Add(goroutines)
08
09 for g := 0; g < goroutines; g++ {
10 go func(g int) {
11 start := g * stride
12 end := start + stride
13 if g == lastGoroutine {
14 end = totalNumbers
15 }
16
17 bubbleSort(numbers[start:end])
18 wg.Done()
19 }(g)
20 }
21
22 wg.Wait()
23
24 // Ugh, we have to sort the entire list again.
25 bubbleSort(numbers)
26 }
bubbleSortConcurrent
函数是并发版的冒泡排序。它使用了多条 Goroutine 并发地对数组的部分内容排序。但是,你得到了一堆有序列表。对 36 个数字排序,分成 12 组:
Before:
25 51 15 57 87 10 10 85 90 32 98 53
91 82 84 97 67 37 71 94 26 2 81 79
66 70 93 86 19 81 52 75 85 10 87 49
After:
10 10 15 25 32 51 53 57 85 87 90 98
2 26 37 67 71 79 81 82 84 91 94 97
10 19 49 52 66 70 75 81 85 86 87 93
冒泡排序的本质是扫码数组,25 行调用 bubbleSort
会前功尽弃。
读文件
前面两个都是 CPU 密集型负载,再来个 IO 密集型?
https://play.golang.org/p/8gFe5F8zweN
42 func find(topic string, docs []string) int {
43 var found int
44 for _, doc := range docs {
45 items, err := read(doc)
46 if err != nil {
47 continue
48 }
49 for _, item := range items {
50 if strings.Contains(item.Description, topic) {
51 found++
52 }
53 }
54 }
55 return found
56 }
43 行,found
变量用于统计文档中找到指定主题的次数。然后遍历读取所有文档。在 49 到 53 行利用 strings.Contains
函数来检查字符串中是否存在主题字串。找到的话就 found
+1。
以下是 find
调用的 read
函数的实现:
https://play.golang.org/p/8gFe5F8zweN
33 func read(doc string) ([]item, error) {
34 time.Sleep(time.Millisecond) // Simulate blocking disk read.
35 var d document
36 if err := xml.Unmarshal([]byte(file), &d); err != nil {
37 return nil, err
38 }
39 return d.Channel.Items, nil
40 }
read
函数以 time.Sleep
调用来模拟实际环境中进行系统调用从磁盘读取文档可能产生的延迟。这个延迟的一致性对测算两种版本的 find
性能很重要。39 到 59 行,存储在全局变量 file
中的 xml 文档会被反序列化成可被程序处理的结构。最终在 39 行返回给调用者。
再来看看并发版的 https://play.golang.org/p/8gFe5F8zweN:
58 func findConcurrent(goroutines int, topic string, docs []string) int {
59 var found int64
60
61 ch := make(chan string, len(docs))
62 for _, doc := range docs {
63 ch <- doc
64 }
65 close(ch)
66
67 var wg sync.WaitGroup
68 wg.Add(goroutines)
69
70 for g := 0; g < goroutines; g++ {
71 go func() {
72 var lFound int64
73 for doc := range ch {
74 items, err := read(doc)
75 if err != nil {
76 continue
77 }
78 for _, item := range items {
79 if strings.Contains(item.Description, topic) {
80 lFound++
81 }
82 }
83 }
84 atomic.AddInt64(&found, lFound)
85 wg.Done()
86 }()
87 }
88
89 wg.Wait()
90
91 return int(found)
92 }
findConcurrent
函数就是并发版的 find
,相比 13 行代码的前者,用了 30 行代码。目标是控制用于处理文档的 Goroutine 数量,我选择了池化模式,即用一个 channel 来供给 Goroutine 池。
61 到 64 行:搞个带缓冲的 channel 并把要处理的所有文档都推进去。
65 行:关掉 channel,当所有文档都被处理后 Goroutine 能够自然退出。
70 行:创建 Goroutine 池。
73 到 83 行:每个 Goroutine 从 channel 读取一个文档,将文档读至内存并检查关键字。如果匹配到了,内部的 found
变量就自增。
84 行:汇聚所有 Goroutine 的计算结果。
并发的比串行的要复杂多了,搞这么多花头值得吗?最好的办法是做个基准测试。我用了 1000 个文档的集合并关掉 GC。
func BenchmarkSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
find("test", docs)
}
}
func BenchmarkConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
findConcurrent(runtime.NumCPU(), "test", docs)
}
}
以下是所有 Goroutine 只有一条硬件线程可用时的结果。串行版就用一条 Goroutine 而并发版根据 runtime.NumCPU
函数动态获取在我的机器上有 8 条 Goroutine。这个案例中只有并发没有并行。
10 Thousand Documents using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITHOUT Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 1 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/io-bound
BenchmarkSequential 3 1483458120 ns/op
BenchmarkConcurrent 20 188941855 ns/op : ~87% Faster
BenchmarkSequentialAgain 2 1502682536 ns/op
BenchmarkConcurrentAgain 20 184037843 ns/op : ~88% Faster
基准测试显示,当所有 Goroutine 只有一个硬件线程可用时,并发版比串行快了大概 87% 到 88%。这符合我的预期,因为所有 Goroutine 都在高效地共享单条操作系统线程。每条 Goroutine 在 read
调用时会自然地上下文切换,随着时间的推移更多的作业就在单条硬件线程上完成了。
再来看看并行式并发:
10 Thousand Documents using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITH Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/io-bound
BenchmarkSequential-8 3 1490947198 ns/op
BenchmarkConcurrent-8 20 187382200 ns/op : ~88% Faster
BenchmarkSequentialAgain-8 3 1416126029 ns/op
BenchmarkConcurrentAgain-8 20 185965460 ns/op : ~87% Faster
额外的硬件线程并不能够提供更好的性能。
结论
本文试图提供不同类型算法及工作负载的案例,你要看到语义上的差别和在工程决策上做到因地制宜,灵活运用。
对于 IO 密集型的工作负载,并不需要并行来获得大幅的性能提升,这与 CPU 密集型的作业是相反的。像冒泡算法这样的并发使用并发反而事倍功半。重要的是确定你的工作负载是否适合并发。
原文
When I’m solving a problem, especially if it’s a new problem, I don’t initially think about whether concurrency is a good fit or not. I look for a sequential solution first and make sure that it’s working. Then after readability and technical reviews, I will begin to ask the question if concurrency is reasonable and practical. Sometimes it’s obvious that concurrency is a good fit and other times it’s not so clear.
In the first part of this series, I explained the mechanics and semantics of the OS scheduler that I believe are important if you plan on writing multithreaded code. In the second part, I explained the semantics of the Go scheduler that I believe are important for understanding how to write concurrent code in Go. In this post, I will begin to bring the mechanics and semantics of the OS and Go schedulers together to provide a deeper understanding on what concurrency is and isn’t.
What is Concurrency
Concurrency means “out of order” execution. Taking a set of instructions that would otherwise be executed in sequence and finding a way to execute them out of order and still produce the same result. For the problem in front of you, it has to be obvious that out of order execution would add value. When I say value, I mean add enough of a performance gain for the complexity cost. Depending on your problem, out of order execution may not be possible or even make sense.
It’s also important to understand that concurrency is not the same as parallelism. Parallelism means executing two or more instructions at the same time. This is a different concept from concurrency. Parallelism is only possible when you have at least 2 operating system (OS) and hardware threads available to you and you have at least 2 Goroutines, each executing instructions independently on each OS/hardware thread.
In figure 1, you see a diagram of two logical processors (P) each with their independent OS thread (M) attached to an independent hardware thread (Core) on the machine. You can see two Goroutines (G1 and G2) are executing in parallel, executing their instructions on their respective OS/hardware thread at the same time. Within each logical processor, three Goroutines are taking turns sharing their respective OS thread. All these Goroutines are running concurrently, executing their instructions in no particular order and sharing time on the OS thread.
Here’s the rub, sometimes leveraging concurrency without parallelism can actually slow down your throughput. What’s also interesting is, sometimes leveraging concurrency with parallelism doesn’t give you a bigger performance gain than you might otherwise think you can achieve.
Workloads
How do you know when out of order execution may be possible or make sense? Understanding the type of workload your problem is handling is a great place to start. There are two types of workloads that are important to understand when thinking about concurrency.
- CPU-Bound: This is a workload that never creates a situation where Goroutines naturally move in and out of waiting states. This is work that is constantly making calculations. A Thread calculating Pi to the Nth digit would be CPU-Bound.
- IO-Bound: This is a workload that causes Goroutines to naturally enter into waiting states. This is work that consists in requesting access to a resource over the network, or making system calls into the operating system, or waiting for an event to occur. A Goroutine that needs to read a file would be IO-Bound. I would include synchronization events (mutexes, atomic), that cause the Goroutine to wait as part of this category.
With CPU-Bound workloads you need parallelism to leverage concurrency. A single OS/hardware thread handling multiple Goroutines is not efficient since the Goroutines are not moving in and out of waiting states as part of their workload. Having more Goroutines than there are OS/hardware threads can slow down workload execution because of the latency cost (the time it takes) of moving Goroutines on and off the OS thread. The context switch is creating a “Stop The World” event for your workload since none of your workload is being executed during the switch when it otherwise could be.
With IO-Bound workloads you don’t need parallelism to use concurrency. A single OS/hardware thread can handle multiple Goroutines with efficiency since the Goroutines are naturally moving in and out of waiting states as part of their workload. Having more Goroutines than there are OS/hardware threads can speed up workload execution because the latency cost of moving Goroutines on and off the OS thread is not creating a “Stop The World” event. Your workload is naturally stopped and this allows a different Goroutine to leverage the same OS/hardware thread efficiently instead of letting the OS/hardware thread sit idle.
How do you know how many Goroutines per hardware thread provides the best throughput? Too few Goroutines and you have more idle time. Too many Goroutines and you have more context switch latency time. This is something for you to think about but beyond the scope of this particular post.
Adding Numbers
We don’t need complex code to visualize and understand these semantics. Look at the following function named add that sums a collection of integers.
https://play.golang.org/p/r9LdqUsEzEz
36 func add(numbers []int) int {
37 var v int
38 for _, n := range numbers {
39 v += n
40 }
41 return v
42 }
In listing 1 on line 36, a function named add
is declared that takes a collection of integers and returns the sum of the collection. It starts on line 37 with the declaration of the v
variable to contain the sum. Then on line 38, the function traverses the collection linearly and each number is added to the current sum on line 39. Finally on line 41, the function returns the final sum back to the caller.
Question: is the add
function a workload that is suitable for out of order execution? I believe the answer is yes. The collection of integers could be broken up into smaller lists and those lists could be processed concurrently. Once all the smaller lists are summed, the set of sums could be added together to produce the same answer as the sequential version.
However, there is another question that comes to mind. How many smaller lists should be created and processed independently to get the best throughput? To answer this question you must know what kind of workload add
is performing. The add
function is performing a CPU-Bound workload because the algorithm is performing pure math and nothing it does would cause the goroutine to enter into a natural waiting state. This means using one Goroutine per OS/hardware thread is all that is needed for good throughput.
Listing 2 below is my concurrent version of add
.
Note: There are several ways and options you can take when writing a concurrent version of add. Don’t get hung up on my particular implementation at this time. If you have a more readable version that performs the same or better I would love for you to share it.
https://play.golang.org/p/r9LdqUsEzEz
44 func addConcurrent(goroutines int, numbers []int) int {
45 var v int64
46 totalNumbers := len(numbers)
47 lastGoroutine := goroutines - 1
48 stride := totalNumbers / goroutines
49
50 var wg sync.WaitGroup
51 wg.Add(goroutines)
52
53 for g := 0; g < goroutines; g++ {
54 go func(g int) {
55 start := g * stride
56 end := start + stride
57 if g == lastGoroutine {
58 end = totalNumbers
59 }
60
61 var lv int
62 for _, n := range numbers[start:end] {
63 lv += n
64 }
65
66 atomic.AddInt64(&v, int64(lv))
67 wg.Done()
68 }(g)
69 }
70
71 wg.Wait()
72
73 return int(v)
74 }
In Listing 2, the addConcurrent
function is presented which is the concurrent version of the add
function. The concurrent version uses 26 lines of code as opposed to the 5 lines of code for the non-concurrent version. There is a lot of code so I will only highlight the important lines to understand.
Line 48: Each Goroutine will get their own unique but smaller list of numbers to add. The size of the list is calculated by taking the size of the collection and dividing it by the number of Goroutines.
Line 53: The pool of Goroutines are created to perform the adding work.
Line 57-59: The last Goroutine will add the remaining list of numbers which may be greater than the other Goroutines.
Line 66: The sum of the smaller lists are summed together into a final sum.
The concurrent version is definitely more complex than the sequential version but is the complexity worth it? The best way to answer that question is to create a benchmark. For these benchmarks I have used a collection of 10 million numbers with the garbage collector turned off. There is a sequential version that uses the add
function and a concurrent version that uses the addConcurrent
function.
func BenchmarkSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
add(numbers)
}
}
func BenchmarkConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
addConcurrent(runtime.NumCPU(), numbers)
}
}
Listing 3 shows the benchmark functions. Here are the results when only a single OS/hardware thread is available for all Goroutines. The sequential version is using 1 Goroutine and the concurrent version is using runtime.NumCPU
or 8 Goroutines on my machine. In this case, the concurrent version is leveraging concurrency without parallelism.
10 Million Numbers using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITHOUT Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 1 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/cpu-bound
BenchmarkSequential 1000 5720764 ns/op : ~10% Faster
BenchmarkConcurrent 1000 6387344 ns/op
BenchmarkSequentialAgain 1000 5614666 ns/op : ~13% Faster
Note: Running a benchmark on your local machine is complicated. There are so many factors that can cause your benchmarks to be inaccurate. Make sure your machine is as idle as possible and run benchmarks a few times. You want to make sure you see consistency in the results. Having the benchmark run twice by the testing tool is giving this benchmark the most consistent results.
The benchmark in listing 4 shows that the Sequential version is approximately 10 to 13 percent faster than the Concurrent when only a single OS/hardware thread is available for all Goroutines. This is what I would have expected since the concurrent version has the overhead of context switches on that single OS thread and the management of the Goroutines.
Here are the results when an individual OS/hardware thread is available for each Goroutine. The sequential version is using 1 Goroutine and the concurrent version is using runtime.NumCPU
or 8 Goroutines on my machine. In this case, the concurrent version is leveraging concurrency with parallelism.
10 Million Numbers using 8 goroutines with 8 cores
2.9 GHz Intel 4 Core i7
Concurrency WITH Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 8 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/cpu-bound
BenchmarkSequential-8 1000 5910799 ns/op
BenchmarkConcurrent-8 2000 3362643 ns/op : ~43% Faster
BenchmarkSequentialAgain-8 1000 5933444 ns/op
BenchmarkConcurrentAgain-8 2000 3477253 ns/op : ~41% Faster
The benchmark in listing 5 shows that the concurrent version is approximately 41 to 43 percent faster than the sequential version when an individual OS/hardware thread is available for each Goroutine. This is what I would have expected since all the Goroutines are now running in parallel, eight Goroutines executing their concurrent work at the same time.
Sorting
It’s important to understand that not all CPU-bound workloads are suitable for concurrency. This is primarily true when it’s very expensive to either break work up, and/or combine all the results. An example of this can be seen with the sorting algorithm called Bubble sort. Look at the following code that implements Bubble sort in Go.
https://play.golang.org/p/S0Us1wYBqG6
01 package main
02
03 import "fmt"
04
05 func bubbleSort(numbers []int) {
06 n := len(numbers)
07 for i := 0; i < n; i++ {
08 if !sweep(numbers, i) {
09 return
10 }
11 }
12 }
13
14 func sweep(numbers []int, currentPass int) bool {
15 var idx int
16 idxNext := idx + 1
17 n := len(numbers)
18 var swap bool
19
20 for idxNext < (n - currentPass) {
21 a := numbers[idx]
22 b := numbers[idxNext]
23 if a > b {
24 numbers[idx] = b
25 numbers[idxNext] = a
26 swap = true
27 }
28 idx++
29 idxNext = idx + 1
30 }
31 return swap
32 }
33
34 func main() {
35 org := []int{1, 3, 2, 4, 8, 6, 7, 2, 3, 0}
36 fmt.Println(org)
37
38 bubbleSort(org)
39 fmt.Println(org)
40 }
In listing 6, there is an example of Bubble sort written in Go. This sorting algorithm sweeps through a collection of integers swapping values on every pass. Depending on the ordering of the list, it may require multiple passes through the collection before everything is sorted.
Question: is the bubbleSort
function a workload that is suitable for out of order execution? I believe the answer is no. The collection of integers could be broken up into smaller lists and those lists could be sorted concurrently. However, after all the concurrent work is done there is no efficient way to sort the smaller lists together. Here is an example of a concurrent version of Bubble sort.
01 func bubbleSortConcurrent(goroutines int, numbers []int) {
02 totalNumbers := len(numbers)
03 lastGoroutine := goroutines - 1
04 stride := totalNumbers / goroutines
05
06 var wg sync.WaitGroup
07 wg.Add(goroutines)
08
09 for g := 0; g < goroutines; g++ {
10 go func(g int) {
11 start := g * stride
12 end := start + stride
13 if g == lastGoroutine {
14 end = totalNumbers
15 }
16
17 bubbleSort(numbers[start:end])
18 wg.Done()
19 }(g)
20 }
21
22 wg.Wait()
23
24 // Ugh, we have to sort the entire list again.
25 bubbleSort(numbers)
26 }
In Listing 8, the bubbleSortConcurrent
function is presented which is a concurrent version of the bubbleSort function. It uses multiple Goroutines to sort portions of the list concurrently. However, what you are left with is a list of sorted values in chunks. Given a list of 36 numbers, split in groups of 12, this would be the resulting list if the entire list is not sorted once more on line 25.
Before:
25 51 15 57 87 10 10 85 90 32 98 53
91 82 84 97 67 37 71 94 26 2 81 79
66 70 93 86 19 81 52 75 85 10 87 49
After:
10 10 15 25 32 51 53 57 85 87 90 98
2 26 37 67 71 79 81 82 84 91 94 97
10 19 49 52 66 70 75 81 85 86 87 93
Since the nature of Bubble sort is to sweep through the list, the call to bubbleSort on line 25 will negate any potential gains from using concurrency. With Bubble sort, there is no performance gain by using concurrency.
Reading Files
Two CPU-Bound workloads have been presented, but what about an IO-Bound workload? Are the semantics different when Goroutines are naturally moving in and out of waiting states? Look at an IO-Bound workload that reads files and performs a text search.
This first version is a sequential version of a function called find
.
https://play.golang.org/p/8gFe5F8zweN
42 func find(topic string, docs []string) int {
43 var found int
44 for _, doc := range docs {
45 items, err := read(doc)
46 if err != nil {
47 continue
48 }
49 for _, item := range items {
50 if strings.Contains(item.Description, topic) {
51 found++
52 }
53 }
54 }
55 return found
56 }
In listing 10, you see the sequential version of the find
function. On line 43, a variable named found
is declared to maintain a count for the number of times the specified topic
is found inside a given document. Then on line 44, the documents are iterated over and each document is read on line 45 using the read function. Finally on line 49-53, the Contains
function from the strings
package is used to check if the topic can be found inside the collection of items read from the document. If the topic is found, the found
variable is incremented by one.
Here is the implementation of the read
function that is being called by find
.
https://play.golang.org/p/8gFe5F8zweN
33 func read(doc string) ([]item, error) {
34 time.Sleep(time.Millisecond) // Simulate blocking disk read.
35 var d document
36 if err := xml.Unmarshal([]byte(file), &d); err != nil {
37 return nil, err
38 }
39 return d.Channel.Items, nil
40 }
The read
function in listing 11 starts with a time.Sleep
call for one millisecond. This call is being used to mock the latency that could be produced if we performed an actual system call to read the document from disk. The consistency of this latency is important for accurately measuring the performance of the sequential version of find
against the concurrent version. Then on lines 35-39, the mock xml document stored in the global variable file
is unmarshaled into a struct value for processing. Finally, a collection of items is returned back to the caller on line 39.
With the sequential version in place, here is the concurrent version.
Note: There are several ways and options you can take when writing a concurrent version of find. Don’t get hung up on my particular implementation at this time. If you have a more readable version that performs the same or better I would love for you to share it.
https://play.golang.org/p/8gFe5F8zweN
58 func findConcurrent(goroutines int, topic string, docs []string) int {
59 var found int64
60
61 ch := make(chan string, len(docs))
62 for _, doc := range docs {
63 ch <- doc
64 }
65 close(ch)
66
67 var wg sync.WaitGroup
68 wg.Add(goroutines)
69
70 for g := 0; g < goroutines; g++ {
71 go func() {
72 var lFound int64
73 for doc := range ch {
74 items, err := read(doc)
75 if err != nil {
76 continue
77 }
78 for _, item := range items {
79 if strings.Contains(item.Description, topic) {
80 lFound++
81 }
82 }
83 }
84 atomic.AddInt64(&found, lFound)
85 wg.Done()
86 }()
87 }
88
89 wg.Wait()
90
91 return int(found)
92 }
In Listing 12, the findConcurrent
function is presented which is the concurrent version of the find
function. The concurrent version uses 30 lines of code as opposed to the 13 lines of code for the non-concurrent version. My goal in implementing the concurrent version was to control the number of Goroutines that are used to process the unknown number of documents. A pooling pattern where a channel is used to feed the pool of Goroutines was my choice.
There is a lot of code so I will only highlight the important lines to understand.
Lines 61-64: A channel is created and populated with all the documents to process.
Line 65: The channel is closed so the pool of Goroutines naturally terminate when all the documents are processed.
Line 70: The pool of Goroutines is created.
Line 73-83: Each Goroutine in the pool receives a document from the channel, reads the document into memory and checks the contents for the topic. When there is a match, the local found variable is incremented.
Line 84: The sum of the individual Goroutine counts are summed together into a final count.
The concurrent version is definitely more complex than the sequential version but is the complexity worth it? The best way to answer this question again is to create a benchmark. For these benchmarks I have used a collection of 1 thousand documents with the garbage collector turned off. There is a sequential version that uses the find
function and a concurrent version that uses the findConcurrent
function.
func BenchmarkSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
find("test", docs)
}
}
func BenchmarkConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
findConcurrent(runtime.NumCPU(), "test", docs)
}
}
Listing 13 shows the benchmark functions. Here are the results when only a single OS/hardware thread is available for all Goroutines. The sequential is using 1 Goroutine and the concurrent version is using runtime.NumCPU or 8 Goroutines on my machine. In this case, the concurrent version is leveraging concurrency without parallelism.
10 Thousand Documents using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITHOUT Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 1 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/io-bound
BenchmarkSequential 3 1483458120 ns/op
BenchmarkConcurrent 20 188941855 ns/op : ~87% Faster
BenchmarkSequentialAgain 2 1502682536 ns/op
BenchmarkConcurrentAgain 20 184037843 ns/op : ~88% Faster
The benchmark in listing 14 shows that the concurrent version is approximately 87 to 88 percent faster than the sequential version when only a single OS/hardware thread is available for all Goroutines. This is what I would have expected since all the Goroutines are efficiently sharing the single OS/hardware thread. The natural context switch happening for each Goroutine on the read
call is allowing more work to get done over time on the single OS/hardware thread.
Here is the benchmark when using concurrency with parallelism.
10 Thousand Documents using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITH Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/io-bound
BenchmarkSequential-8 3 1490947198 ns/op
BenchmarkConcurrent-8 20 187382200 ns/op : ~88% Faster
BenchmarkSequentialAgain-8 3 1416126029 ns/op
BenchmarkConcurrentAgain-8 20 185965460 ns/op : ~87% Faster
The benchmark in listing 15 shows that bringing in the extra OS/hardware threads don’t provide any better performance.
Conclusion
The goal of this post was to provide guidance on the semantics you must consider to determine if a workload is suitable for using concurrency. I tried to provide examples of different types of algorithms and workloads so you could see the differences in semantics and the different engineering decisions that needed to be considered.
You can clearly see that with IO-Bound workloads parallelism was not needed to get a big bump in performance. Which is the opposite of what you saw with the CPU-Bound work. When it comes to an algorithm like Bubble sort, the use of concurrency would add complexity without any real benefit of performance. It’s important to determine if your workload is suitable for concurrency and then identify the type of workload you have to use the right semantics.