Go

Full Category Index

Posts in “Go”

How Do They Do It: Timers in Go

This article covers the internal implementation of timers in Go. Note that there are a lot of links to Go repo in this article, I recommend to follow them to understand the material better.

Article was part of Go Advent 2016 series: https://blog.gopheracademy.com/advent-2016/go-timers/

Timers

Timers in Go just do something after a period of time. They are located in the standard package time. In particular, timers are time.Timer, time.Ticker, and less obvious timer time.Sleep. It’s not clear from the documentation how timers work exactly. Some people think that each timer spawns its own goroutine which exists until the timer’s deadline is reached because that’s how we’d implement timers in “naive” way in Go. We can check that assumption with the small program:

package main

import (
	"fmt"
	"os"
	"runtime/debug"
	"time"
)

func main() {
	debug.SetTraceback("system")
	if len(os.Args) == 1 {
		panic("before timers")
	}
	for i := 0; i < 10000; i++ {
		time.AfterFunc(time.Duration(5*time.Second), func() {
			fmt.Println("Hello!")
		})
	}
	panic("after timers")
}

It prints all goroutine traces before timers spawned if run without arguments and after timers spawned if any argument is passed. We need those shady panics because otherwise there is no easy way to see runtime goroutines - they’re excluded from runtime.NumGoroutines and runtime.Stack, so the only way to see them is crash(refer to golang/go#9791 for reasons). Let’s see how many goroutines Go spawns before spawning any timers:

go run afterfunc.go 2>&1 | grep "^goroutine" | wc -l
4

and after spawning 10k timers:

go run afterfunc.go after 2>&1 | grep "^goroutine" | wc -l
5

Whoa! It’s only one goroutine, in my case its trace looks like:

goroutine 5 [syscall]:
runtime.notetsleepg(0x5014b8, 0x12a043838, 0x0)
        /home/moroz/go/src/runtime/lock_futex.go:205 +0x42 fp=0xc42002bf40 sp=0xc42002bf10
runtime.timerproc()
        /home/moroz/go/src/runtime/time.go:209 +0x2ec fp=0xc42002bfc0 sp=0xc42002bf40
runtime.goexit()
        /home/moroz/go/src/runtime/asm_amd64.s:2160 +0x1 fp=0xc42002bfc8 sp=0xc42002bfc0
created by runtime.addtimerLocked
        /home/moroz/go/src/runtime/time.go:116 +0xed

Let’s take a look at why there’s only one additional goroutine.

runtime.timer

All timers are based on the same data structure - runtime.timer. To add new timer, you need to instantiate runtime.timer and pass it to the function runtime.startTimer. Here is example from time package:

func NewTimer(d Duration) *Timer {
    c := make(chan Time, 1)
    t := &Timer{
        C: c,
        r: runtimeTimer{
            when: when(d),
            f:    sendTime,
            arg:  c,
        },
    }
    startTimer(&t.r)
    return t
}

So, here we’re converting duration to an exact timestamp when timer should call function f with argument c. There are three types of function f used in package time:

sendTime - sends current time to the channel or discards it if send blocks. Used in time.Timer and time.Ticker.
goFunc - executes some function in a goroutine. Used in time.AfterFunc.
goroutineReady - wakes up a specific goroutine. Used in runtime.timeSleep which is linked to time.Sleep.

Note that each new timer takes at least 40 bytes of memory. Large amount of timers can significantly increase the memory footprint of your program.

So, now we understand what timers look like in the runtime and what they are supposed to do. Now let’s see how the runtime stores timers and calls functions when it’s time to call them.

runtime.timers

runtime.timers is just a Heap data structure. Heap is very useful when you want to repeatedly find extremum (minimum or maximum) among some elements. In our case extremum is a timer with closest when to the current time. Very convenient, isn’t it? So, let’s see what algorithmic complexity the operations with timers for the worst case:

add new timer - O(log(n))
delete timer - O(log(n))
spawning timers functions - O(log(n))

So, if you have 1 million timers, the number of operations with heap will usually be less than 1000(log(1kk) ~= 20, but spawning can require multiple minimum deletions, because multiple timers can reach their deadline at about the same time). It’s very fast and all the work is happening in a separate goroutine, so it doesn’t block. The siftupTimer and siftdownTimer functions are used for maintaining heap properties. But data structures don’t work on their own; something should use them. In our case it’s just one goroutine with the function timerproc. It’s spawned on first timer start.

runtime.timerproc

It’s kinda hard to describe what’s going on without source code, so this section will be in the form of commented Go code. Code is a direct copy from the src/runtime/time.go file with added comments.

// Add a timer to the heap and start or kick the timerproc if the new timer is
// earlier than any of the others.
func addtimerLocked(t *timer) {
	// when must never be negative; otherwise timerproc will overflow
	// during its delta calculation and never expire other runtime·timers.
	if t.when < 0 {
		t.when = 1<<63 - 1
	}
	t.i = len(timers.t)
	timers.t = append(timers.t, t)
	// maintain heap invariant
	siftupTimer(t.i)
	// new time is on top
	if t.i == 0 {
		// siftup moved to top: new earliest deadline.
		if timers.sleeping {
			// wake up sleeping goroutine, put to sleep with notetsleepg in timerproc()
			timers.sleeping = false
			notewakeup(&timers.waitnote)
		}
		if timers.rescheduling {
			// run parked goroutine, put to sleep with goparkunlock in timerproc()
			timers.rescheduling = false
			goready(timers.gp, 0)
		}
	}
	if !timers.created {
		// run timerproc() goroutine only once
		timers.created = true
		go timerproc()
	}
}

// Timerproc runs the time-driven events.
// It sleeps until the next event in the timers heap.
// If addtimer inserts a new earlier event, addtimerLocked wakes timerproc early.
func timerproc() {
	// set timer goroutine
	timers.gp = getg()
	// forever loop
	for {
		lock(&timers.lock)
		// mark goroutine not sleeping
		timers.sleeping = false
		now := nanotime()
		delta := int64(-1)
		// iterate over timers in heap starting from [0]
		for {
			// there are no more timers, exit iterating loop
			if len(timers.t) == 0 {
				delta = -1
				break
			}
			t := timers.t[0]
			delta = t.when - now
			if delta > 0 {
				break
			}
			// t.period means that it's ticker, so change when and move down
			// in heap to execute it again after t.period.
			if t.period > 0 {
				// leave in heap but adjust next time to fire
				t.when += t.period * (1 + -delta/t.period)
				siftdownTimer(0)
			} else {
				// remove from heap
				// this is just removing from heap operation:
				// - swap first(extremum) with last
				// - set last to nil
				// - maintain heap: move first to its true place with siftdownTimer.
				last := len(timers.t) - 1
				if last > 0 {
					timers.t[0] = timers.t[last]
					timers.t[0].i = 0
				}
				timers.t[last] = nil
				timers.t = timers.t[:last]
				if last > 0 {
					siftdownTimer(0)
				}
				// set i to -1, so concurrent deltimer won't do anything to
				// heap.
				t.i = -1 // mark as removed
			}
			f := t.f
			arg := t.arg
			seq := t.seq
			unlock(&timers.lock)
			if raceenabled {
				raceacquire(unsafe.Pointer(t))
			}
			// call timer function without lock
			f(arg, seq)
			lock(&timers.lock)
		}
		// if delta < 0 - timers is empty, set "rescheduling" and park timers
		// goroutine. It will sleep here until "goready" call in addtimerLocked.
		if delta < 0 || faketime > 0 {
			// No timers left - put goroutine to sleep.
			timers.rescheduling = true
			goparkunlock(&timers.lock, "timer goroutine (idle)", traceEvGoBlock, 1)
			continue
		}
		// At least one timer pending. Sleep until then.
		// If we have some timers in heap, we're sleeping until it's time to
		// spawn soonest of them. notetsleepg will sleep for `delta` period or
		// until notewakeup in addtimerLocked.
		// notetsleepg fills timers.waitnote structure and put goroutine to sleep for some time.
		// timers.waitnote can be used to wakeup this goroutine with notewakeup.
		timers.sleeping = true
		noteclear(&timers.waitnote)
		unlock(&timers.lock)
		notetsleepg(&timers.waitnote, delta)
	}
}

There are two variables which I think deserve explanation: rescheduling and sleeping. They both indicate that the goroutine was put to sleep, but different synchronization mechanisms are used, let’s discuss them.

sleeping is set when all “current” timers are processed, but there are more which we need to spawn in future. It uses OS-based synchronization, so it calls some OS syscalls to put to sleep and wake up the goroutine and syscalls means it spawns OS threads for this. It uses note structure and next functions for synchronization:

noteclear - resets note state.
notetsleepg - puts goroutine to sleep until notewakeup is called or after some period of time (in case of timers it’s time until next timer). This func fills timers.waitnote with “pointer to timer goroutine”.
notewakeup - wakes up goroutine which called notetsleepg.

notewakeup might be called in addtimerLocked if the new timer is “earlier” than the previous “earliest” timer.

rescheduling is set when there are no timers in our heap, so nothing to do. It uses the go scheduler to put the goroutine to sleep with function goparkunlock. Unlike notetsleepg it does not consume any OS resources, but also does not support “wakeup timeout” so it can’t be used instead of notetsleepg in the sleeping branch. The goready function is used for waking up the goroutine when a new timer is added with addTimerLocked.

Conclusion

We learned how Go timers work “under the hood” - the runtime neither uses one goroutine per timer, nor are timers “free” to use. It’s important to understand how things work to avoid premature optimizations. Also, we learned that it’s quite easy to read runtime code and you shouldn’t be afraid to do so. I hope you enjoyed this reading and will share info to your fellow Gophers.

Go Benchmarks

Benchmarks

Benchmarks are tests for performance. It’s pretty useful to have them in project and compare results from commit to commit. Go has very good tooling for writing and executing benchmarks. In this article I’ll show how to use package testing for writing benchmarks.

How to write benchmark

It’s pretty easy in Go. Here is a simple benchmark:

func BenchmarkSample(b *testing.B) {
    for i := 0; i < b.N; i++ {
        if x := fmt.Sprintf("%d", 42); x != "42" {
            b.Fatalf("Unexpected string: %s", x)
        }
    }
}

Save this code to bench_test.go and run go test -bench=. bench_test.go. You’ll see something like this:

testing: warning: no tests to run
PASS
BenchmarkSample 10000000               206 ns/op
ok      command-line-arguments  2.274s

We see here that one iteration takes 206 nanoseconds. That was easy, indeed. There are couple of things more about benchmarks in Go, though.

What you can benchmark?

By default go test -bench=. tests only speed of your code, however you can add flag -benchmem, which will also test a memory consumption and an allocations count. It’ll look like:

PASS
BenchmarkSample 10000000               208 ns/op              32 B/op          2 allocs/op

Here we have bytes per operation and allocations per operation. Pretty useful information as for me. You can also enable those reports per-benchmark with b.ReportAllocs() method. But that’s not all, you can also specify a throughput of one operation with b.SetBytes(n int64) method. For example:

func BenchmarkSample(b *testing.B) {
    b.SetBytes(2)
    for i := 0; i < b.N; i++ {
        if x := fmt.Sprintf("%d", 42); x != "42" {
            b.Fatalf("Unexpected string: %s", x)
        }
    }
}

Now output will be:

testing: warning: no tests to run
PASS
BenchmarkSample  5000000               324 ns/op           6.17 MB/s          32 B/op          2 allocs/op
ok      command-line-arguments  1.999s

You can see now throughput column, which is 6.17 MB/s in my case.

Benchmark setup

What if you need to prepare your operation for an each iteration? You definitely don’t want to include time of setup in a benchmark result. I wrote very simple Set datastructure for benchmarking:

type Set struct {
    set map[interface{}]struct{}
    mu  sync.Mutex
}

func (s *Set) Add(x interface{}) {
    s.mu.Lock()
    s.set[x] = struct{}{}
    s.mu.Unlock()
}

func (s *Set) Delete(x interface{}) {
    s.mu.Lock()
    delete(s.set, x)
    s.mu.Unlock()
}

and benchmark for its Delete method:

func BenchmarkSetDelete(b *testing.B) {
    var testSet []string
    for i := 0; i < 1024; i++ {
        testSet = append(testSet, strconv.Itoa(i))
    }
    for i := 0; i < b.N; i++ {
        b.StopTimer()
        set := Set{set: make(map[interface{}]struct{})}
        for _, elem := range testSet {
            set.Add(elem)
        }
        for _, elem := range testSet {
            set.Delete(elem)
        }
    }
}

Here we have couple of problems:

time and allocs of testSet creation included in first iteration (which isn’t big problem here, because there will be a lot of iterations).
time and allocs of Add to set included in each iteration

For such cases we have b.ResetTimer(), b.StopTimer() and b.StartTimer(). Here those methods used in same benchmark:

func BenchmarkSetDelete(b *testing.B) {
    var testSet []string
    for i := 0; i < 1024; i++ {
        testSet = append(testSet, strconv.Itoa(i))
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        b.StopTimer()
        set := Set{set: make(map[interface{}]struct{})}
        for _, elem := range testSet {
            set.Add(elem)
        }
        b.StartTimer()
        for _, elem := range testSet {
            set.Delete(elem)
        }
    }
}

Now those initializations won’t be counted in benchmark results and we’ll see only results of Delete calls.

Benchmarks comparison

Of course there is nothing to do with benchmark if you can’t compare them on different code.

Here is an example code of marshaling struct to json and benchhmark for it:

type testStruct struct {
    X int
    Y string
}

func (t *testStruct) ToJSON() ([]byte, error) {
    return json.Marshal(t)
}

func BenchmarkToJSON(b *testing.B) {
    tmp := &testStruct{X: 1, Y: "string"}
    js, err := tmp.ToJSON()
    if err != nil {
        b.Fatal(err)
    }
    b.SetBytes(int64(len(js)))
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        if _, err := tmp.ToJSON(); err != nil {
            b.Fatal(err)
        }
    }
}

It’s commited in git already, now I want to try cool trick and measure its performance. I slightly modify ToJSON method:

func (t *testStruct) ToJSON() ([]byte, error) {
    return []byte(`{"X": ` + strconv.Itoa(t.X) + `, "Y": "` + t.Y + `"}`), nil
}

Now it’s time to run our bechmarks, let’s save their results in files this time:

go test -bench=. -benchmem bench_test.go > new.txt
git stash
go test -bench=. -benchmem bench_test.go > old.txt

Now we can compare those results with benchcmp utility. You can install it with go get golang.org/x/tools/cmd/benchcmp. Here is result of comparison:

# benchcmp old.txt new.txt
benchmark           old ns/op     new ns/op     delta
BenchmarkToJSON     1579          495           -68.65%

benchmark           old MB/s     new MB/s     speedup
BenchmarkToJSON     12.66        46.41        3.67x

benchmark           old allocs     new allocs     delta
BenchmarkToJSON     2              2              +0.00%

benchmark           old bytes     new bytes     delta
BenchmarkToJSON     184           48            -73.91%

It’s very good to see such tables, they also can add weight to your opensource contributions.

Writing profiles

Also you can write cpu and memory profiles from benchmarks:

go test -bench=. -benchmem -cpuprofile=cpu.out -memprofile=mem.out bench_test.go

You can read how to analyze profiles in awesome blog post on blog.golang.org here.

Conclusion

Benchmarks is awesome instrument for programmer. And in Go you to writing and analyzing becnhmarks is extremely easy. New benchmarks allows you to find performance bottlenecks, weird code (efficient code is often simpler and more readable) or usage of wrong instruments. Old benchmarks allow you to be more confident in your changes and could be another +1 in review process. So, writing writing benchmarks has enormous benefits for programmer and code and I encourage you to write more. It’s fun!

Mystery of finalizers in Go

Finalizers

Finalizer is basically a function which will be called when your object will lost all references and will be found by GC. In Go you can add finalizers to your objects with runtime.SetFinalizer function. Let’s see how it works.

package main

import (
    "fmt"
    "runtime"
    "time"
)

type Test struct {
    A int
}

func test() {
    // create pointer
    a := &Test{}
    // add finalizer which just prints
    runtime.SetFinalizer(a, func(a *Test) { fmt.Println("I AM DEAD") })
}

func main() {
    test()
    // run garbage collection
    runtime.GC()
    // sleep to switch to finalizer goroutine
    time.Sleep(1 * time.Millisecond)
}

Output obviously will be:

I AM DEAD

So, we created object a which is pointer and set simple finalizer to it. When code left test function - all references to it disappeared and therefore garbage collector was able to collect a and call finalizer in its own goroutine. You can try to modify test() function to return *Test an print it in main(), then you’ll see that finalizer won’t be called. Also if you remove A field from Test type, because then Test became just empty struct and empty struct allocates no memory and can’t be collected by GC.

Finalizers examples

Let’s try to find finalizers usage in standard library. There it is used only for for closing file descriptors like this in net package:

runtime.SetFinalizer(fd, (*netFD).Close)

So, you’ll never leak fd even if you forget to Close net.Conn.

So probably finalizers not so good idea if even in standard library it has so limited usage. Let’s see what problems can be.

Why you should avoid finalizers

Finalizers is pretty tempting idea if you come from languages without GC or where you’re not expecting users to write proper code. In Go we have both GC and pro-users :) So, in my opinion explicit call of Close is always better than magic finalizer. For example there is finalizer for fd in os:

// NewFile returns a new File with the given file descriptor and name.
func NewFile(fd uintptr, name string) *File {
    fdi := int(fd)
    if fdi < 0 {
        return nil
    }
    f := &File{&file{fd: fdi, name: name}}
    runtime.SetFinalizer(f.file, (*file).close)
    return f
}

and NewFile is called by OpenFile which is called by Open, so if you’re opening file you’ll hit that code. Problem with finalizers that you have no control over them, and more than that you’re not expecting them. Look at code:

func getFd(path string) (int, error) {
    f, err := os.Open(path)
    if err != nil {
        return -1, err
    }
    return f.Fd(), nil
}

It’s pretty common operation to get file descriptor from path when you’re writing some stuff for linux. But that code is unreliable, because when you’re return from getFd f loses its last reference and so your file is doomed to be closed sooner or later (when next GC cycle will come). Here is problem not that file will be closed, but that it’s not documented and not expected at all.

Conclusion

I think it’s better to suppose that users are smart enough to cleanup object themselves. At least all methods which call SetFinalizer should document this, but I personally don’t see any value in this method for me.

Unprivileged containers in Go, Part4: Network namespace

Network namespace

From man namespaces:

Network  namespaces  provide  isolation of the system resources associated with
networking: network devices, IPv4 and IPv6 protocol stacks, IP routing tables,
firewalls, the /proc/net directory, the /sys/class/net directory, port numbers
(sockets), and so on.  A physical network device can live in exactly one
network namespace.
A  virtual  network  device ("veth") pair provides a pipe-like abstraction that
can be used to create tunnels between network namespaces, and can be used to
create a bridge to a physical network device in another namespace.

Network namespace allows you to isolate a network stack for your container. Note that it’s not include hostname - it’s tasks of UTS namespace.

We can create network namespace with flag syscall.CLONE_NEWNET in SysProcAttr.Cloneflags. After namespace creation there are only autogenerated network namespaces(in most cases only loopback interface). So we need to inject some network interface into namespace, which allow container to talk to other containers. We will use veth-pairs for this as it was mentioned in man-page. It’s not only way and probably not best, but it is most known and used in Docker by default.

Unet

For interfaces creation we will need new binary with suid bit set, because it’s pretty privileged operations. We can create them with awesome iproute2 set of utilities, but I decided to write all in Go, because it’s fun and I want to promote awesome netlink library - with this library you can do any operations on networking stuff.

I called new binary unet, you can find it in unc repo: https://github.com/LK4D4/unc/tree/master/unet

Bridge

First of all we need to create a bridge. Here is sample code from unet, which sets up bridge:

const (
    bridgeName = "unc0"
    ipAddr     = "10.100.42.1/24"
)

func createBridge() error {
    // try to get bridge by name, if it already exists then just exit
    _, err := net.InterfaceByName(bridgeName)
    if err == nil {
        return nil
    }
    if !strings.Contains(err.Error(), "no such network interface") {
        return err
    }
    // create *netlink.Bridge object
    la := netlink.NewLinkAttrs()
    la.Name = bridgeName
    br := &netlink.Bridge{la}
    if err := netlink.LinkAdd(br); err != nil {
        return fmt.Errorf("bridge creation: %v", err)
    }
    // set up ip addres for bridge
    addr, err := netlink.ParseAddr(ipAddr)
    if err != nil {
        return fmt.Errorf("parse address %s: %v", ipAddr, err)
    }
    if err := netlink.AddrAdd(br, addr); err != nil {
        return fmt.Errorf("add address %v to bridge: %v", addr, err)
    }
    // sets up bridge ( ip link set dev unc0 up )
    if err := netlink.LinkSetUp(br); err != nil {
        return err
    }
    return nil
}

I hardcoded bridge name and IP address for simplicity. Then we need to create veth-pair and attach one side of it to bridge and put another side to our namespace. Namespace we will identify by PID:

const vethPrefix = "uv"

func createVethPair(pid int) error {
    // get bridge to set as master for one side of veth-pair
    br, err := netlink.LinkByName(bridgeName)
    if err != nil {
        return err
    }
    // generate names for interfaces
    x1, x2 := rand.Intn(10000), rand.Intn(10000)
    parentName := fmt.Sprintf("%s%d", vethPrefix, x1)
    peerName := fmt.Sprintf("%s%d", vethPrefix, x2)
    // create *netlink.Veth
    la := netlink.NewLinkAttrs()
    la.Name = parentName
    la.MasterIndex = br.Attrs().Index
    vp := &netlink.Veth{LinkAttrs: la, PeerName: peerName}
    if err := netlink.LinkAdd(vp); err != nil {
        return fmt.Errorf("veth pair creation %s <-> %s: %v", parentName, peerName, err)
    }
    // get peer by name to put it to namespace
    peer, err := netlink.LinkByName(peerName)
    if err != nil {
        return fmt.Errorf("get peer interface: %v", err)
    }
    // put peer side to network namespace of specified PID
    if err := netlink.LinkSetNsPid(peer, pid); err != nil {
        return fmt.Errorf("move peer to ns of %d: %v", pid, err)
    }
    if err := netlink.LinkSetUp(vp); err != nil {
        return err
    }
    return nil
}

After all this we will have “pipe” from container to bridge unc0. But all not so easy, don’t forget that we talking about unprivileged containers, so we need to run all code from unprivileged user, but that particular part must be executed with root rights. We can set suid bit for this, this will allow unprivileged user to run that binary as privileged. I did next:

$ go get github.com/LK4D4/unc/unet
$ sudo chown root:root $(go env GOPATH)/bin/unet
$ sudo chmod u+s $(go env GOPATH)/bin/unet
$ sudo ln -s $(go env GOPATH)/bin/unet /usr/bin/unet

That’s all you need to run this binary. Actually you don’t need to run it, unc will do this :)

Waiting for interface

Now we can create interfaces in namespaces of specified PID. But process expects that network already ready when it starts, so we need somehow to wait until interface will be created by unet in fork part of program, before calling syscall.Exec. I decided to use pretty simple idea for this: just poll an interface list until first veth device is appear. Let’s modify our container.Start to put interface in namespace after we start fork-process:

-       return cmd.Run()
+       if err := cmd.Start(); err != nil {
+               return err
+       }
+       logrus.Debugf("container PID: %d", cmd.Process.Pid)
+       if err := putIface(cmd.Process.Pid); err != nil {
+               return err
+       }
+       return cmd.Wait()

Function putIface just calls unet with PID as argument:

const suidNet = "unet"

func putIface(pid int) error {
    cmd := exec.Command(suidNet, strconv.Itoa(pid))
    out, err := cmd.CombinedOutput()
    if err != nil {
        return fmt.Errorf("unet: out: %s, err: %v", out, err)
    }
    return nil
}

Now let’s see code for waiting interface inside fork-process:

func waitForIface() (netlink.Link, error) {
    logrus.Debugf("Starting to wait for network interface")
    start := time.Now()
    for {
        fmt.Printf(".")
        if time.Since(start) > 5*time.Second {
            fmt.Printf("\n")
            return nil, fmt.Errorf("failed to find veth interface in 5 seconds")
        }
        // get list of all interfaces
        lst, err := netlink.LinkList()
        if err != nil {
            fmt.Printf("\n")
            return nil, err
        }
        for _, l := range lst {
            // if we found "veth" interface - it's time to continue setup
            if l.Type() == "veth" {
                fmt.Printf("\n")
                return l, nil
            }
        }
        time.Sleep(100 * time.Millisecond)
    }
}

We need to put this function before execProc in fork. So, now we have veth interface and we can continue with its setup and process execution.

Network setup

Now easiest part: we just need to set IP to our new interface and set it up. I added IP field to Cfg struct:

type Cfg struct {
        Hostname string
        Mounts   []Mount
        Rootfs   string
+       IP       string
 }

and filled it with pseudorandom IP from bridge subnet(10.100.42.0/24):

const ipTmpl = "10.100.42.%d/24"
defaultCfg.IP = fmt.Sprintf(ipTmpl, rand.Intn(253)+2)

Code for network setup:

func setupIface(link netlink.Link, cfg Cfg) error {
    // up loopback
    lo, err := netlink.LinkByName("lo")
    if err != nil {
        return fmt.Errorf("lo interface: %v", err)
    }
    if err := netlink.LinkSetUp(lo); err != nil {
        return fmt.Errorf("up veth: %v", err)
    }
    addr, err := netlink.ParseAddr(cfg.IP)
    if err != nil {
        return fmt.Errorf("parse IP: %v", err)
    }
    return netlink.AddrAdd(link, addr)
}

That’s all, now we can exec our process.

Talking containers

Let’s try to connect our containers. I presume here that we’re in directory with busybox rootfs:

$ unc sh
$ ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
475: uv5185@if476: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether e2:2b:71:19:73:73 brd ff:ff:ff:ff:ff:ff
    inet 10.100.42.24/24 scope global uv5185
       valid_lft forever preferred_lft forever
    inet6 fe80::e02b:71ff:fe19:7373/64 scope link
       valid_lft forever preferred_lft forever

$ unc ping -c 1 10.100.42.24
PING 10.100.42.24 (10.100.42.24): 56 data bytes
64 bytes from 10.100.42.24: seq=0 ttl=64 time=0.071 ms

--- 10.100.42.24 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.071/0.071/0.071 ms

They can talk! It’s like magic, right? You can find all code under tag netns.

You can install latest versions of unc and unet with

go get github.com/LK4D4/unc/...

Binaries will be created in $(go env GOPATH)/bin/.

The end

This is last post about unprivileged containers(at least about namespaces). We created an isolated environment for process, which you can run under unprivileged user. Containers though is little more than just isolation - also you want to specify what process can do inside container (Linux capabilities), how much resources process can use (Cgroups) and you can imagine many other things. I invite you to look what we have in runc/libcontainer, it’s not very easy code, but I hope that after my posts you will be able to understand it. If you have any questions feel free to write me, I’m always happy to share my humble knowledge about containers.

Previous parts:

Unprivileged containers in Go, Part3: Mount namespace

Mount namespace

From man namespaces:

Mount namespaces isolate the set of filesystem mount points, meaning that
processes in different mount namespaces can have different views of the
filesystem hierarchy. The set of mounts in a mount namespace is modified using
mount(2) and umount(2).

So, mount namespace allows you to give your process different set of mounts. You can have separate /proc, /dev etc. It’s easy just like pass one more flag to SysProcAttr.Cloneflags: syscall.CLONE_NEWNS. It has such weird name because it was first introduced namespace and nobody could think that there will be more. So, if you see CLONE_NEWNS, know - this is mount namespace. Let’s try to enter our container with new mount namespace. We’ll see all the same mounts as in host. That’s because new mount namespace receives copy of parent host namespace as initial mount table. In our case we’re pretty restricted in what we can do with this mounts, for example we can’t unmount anything:

$ umount /proc
umount: /proc: not mounted

That’s because we use “unprivileged” namespace. But we can mount new /proc over old:

mount -t proc none /proc

Now you can see, that ps shows you only your process. So, to get rid of host mounts and have nice clean mount table we can use pivot_root syscall to change root from host root to some another. But first we need to write some code to really mount something into new rootfs.

Mounting inside root file system

So, for next steps we will need some root filesystem for tests. I will use busybox one, because it’s very small, but useful. Busybox rootfs from Docker official image you can take here. Just unpack it to directory busybox somewhere:

$ mkdir busybox
$ cd busybox
$ wget https://github.com/jpetazzo/docker-busybox/raw/master/rootfs.tar
$ tar xvf rootfs.tar

Now when we have rootfs, we need to mount some stuff inside it, let’s create datastructure for describing mounts:

type Mount struct {
    Source string
    Target string
    Fs     string
    Flags  int
    Data   string
}

It is just arguments to syscall.Mount in form of structure. Now we can add some mounts and path to rootfs(it will be just current directory for unc) in addition to hostname to our Cfg structure:

type Cfg struct {
    Path     string
    Args     []string
    Hostname string
    Mounts   []Mount
    Rootfs   string
}

For start I added /proc(to see process tree from new PID namespaces, btw you can’t mount /proc without PID namespace) and /dev:

    Mounts: []Mount{
        {
            Source: "proc",
            Target: "/proc",
            Fs:     "proc",
            Flags:  defaultMountFlags,
        },
        {
            Source: "tmpfs",
            Target: "/dev",
            Fs:     "tmpfs",
            Flags:  syscall.MS_NOSUID | syscall.MS_STRICTATIME,
            Data:   "mode=755",
        },
    },

Mounting function looks very easy, we just iterate over mounts and call syscall.Mount:

func mount(cfg Cfg) error {
    for _, m := range cfg.Mounts {
        target := filepath.Join(cfg.Rootfs, m.Target)
        if err := syscall.Mount(m.Source, target, m.Fs, uintptr(m.Flags), m.Data); err != nil {
            return fmt.Errorf("failed to mount %s to %s: %v", m.Source, target, err)
        }
    }
    return nil
}

Now we have something mounted inside our new rootfs. Time to pivot_root to it.

Pivot root

From man 2 pivot_root:

int pivot_root(const char *new_root, const char *put_old);
...
pivot_root() moves the root filesystem of the calling process to the directory
put_old and makes new_root the new root filesystem of the calling process.

...

       The following restrictions apply to new_root and put_old:

       -  They must be directories.

       -  new_root and put_old must not be on the same filesystem as the current root.

       -  put_old must be underneath new_root, that is, adding a nonzero number
          of /.. to the string pointed to by put_old must yield the same directory as new_root.

       -  No other filesystem may be mounted on put_old.

So, it’s taking current root, moves it to old_root with all mounts and makes new_root as new root. pivot_root is more secure than chroot, it’s pretty hard to escape from it. Sometimes pivot_root isn’t working(for example on Android systems, because of special kernel loading process), then you need to use mount to “/” with MS_MOVE flag and chroot there, here we won’t discuss this case.

Here is the function which we will use for changing root:

func pivotRoot(root string) error {
    // we need this to satisfy restriction:
    // "new_root and put_old must not be on the same filesystem as the current root"
    if err := syscall.Mount(root, root, "bind", syscall.MS_BIND|syscall.MS_REC, ""); err != nil {
        return fmt.Errorf("Mount rootfs to itself error: %v", err)
    }
    // create rootfs/.pivot_root as path for old_root
    pivotDir := filepath.Join(root, ".pivot_root")
    if err := os.Mkdir(pivotDir, 0777); err != nil {
        return err
    }
    // pivot_root to rootfs, now old_root is mounted in rootfs/.pivot_root
    // mounts from it still can be seen in `mount`
    if err := syscall.PivotRoot(root, pivotDir); err != nil {
        return fmt.Errorf("pivot_root %v", err)
    }
    // change working directory to /
    // it is recommendation from man-page
    if err := syscall.Chdir("/"); err != nil {
        return fmt.Errorf("chdir / %v", err)
    }
    // path to pivot root now changed, update
    pivotDir = filepath.Join("/", ".pivot_root")
    // umount rootfs/.pivot_root(which is now /.pivot_root) with all submounts
    // now we have only mounts that we mounted ourselves in `mount`
    if err := syscall.Unmount(pivotDir, syscall.MNT_DETACH); err != nil {
        return fmt.Errorf("unmount pivot_root dir %v", err)
    }
    // remove temporary directory
    return os.Remove(pivotDir)
}

I hope that all is clear from comments, let me know if not. It is all code that you need to have your own unprivileged container with its own rootfs. You can try to find other rootfs among docker images sources, for example alpine linux is pretty exciting distribution. Also you can try to mount something more inside container.

That’s all for today. Tag for this article on github is mnt_ns. Remember that you should run unc from unprivileged user and from directory, which contains rootfs. Here is examples of some commands inside container(excluding logging):

$ unc cat /etc/os-release
NAME=Buildroot
VERSION=2015.02
ID=buildroot
VERSION_ID=2015.02
PRETTY_NAME="Buildroot 2015.02"

$ unc mount
/dev/sda3 on / type ext4 (rw,noatime,nodiratime,nobarrier,errors=remount-ro,commit=600)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,nodev,mode=755,uid=1000,gid=1000)

$ unc ps awxu
PID   USER     COMMAND
    1 root     ps awxu

Looks pretty “container-ish” I think :)

Previous parts:

Unprivileged containers in Go, Part2: UTS namespace (setup namespaces)

Setup namespaces

In previous part we created some namespaces and executed process in them. It was cool, but in real world we need to setup namespaces before process starts. For example setup mounts, make chroot, set hostname, create network interfaces etc. We need this because we can’t expect from user process that it will do it, it want all ready to execute.

So, in our case we want to insert some code after namespaces creation, but before process execution. In C it’s pretty easy to do, because there is clone call there. Not so easy in Go(but easy, really). In Go we need to spawn new process with our code in new namespaces. We can do it with executing our own binary again with different arguments.

Look at code:

    cmd := &exec.Cmd{
        Path: os.Args[0],
        Args: append([]string{"unc-fork"}, os.Args[1:]...),
    }

Here we create *exec.Cmd which will call same binary with same arguments as caller, but will replace os.Args[0] with string unc-fork (yes, you can specify any os.Args[0], not only program name). It will be our keyword, which indicates that we want to setup namespaces and execute process.

So, let’s insert at the top of main() function next lines:

    if os.Args[0] == "unc-fork" {
        if err := fork(); err != nil {
            log.Fatalf("Fork error: %v", err)
        }
        os.Exit(0)
    }

It means next: execute function fork() and exit in special case of os.Args[0].

Let’s write fork() now:

func fork() error {
    fmt.Println("Start fork")
    path, err := exec.LookPath(os.Args[1])
    if err != nil {
        return fmt.Errorf("LookPath: %v", err)
    }
    fmt.Printf("Execute %s\n", append([]string{path}, os.Args[2:]...))
    return syscall.Exec(path, os.Args[1:], os.Environ())
}

It’s simplest fork() function, it’s just prints some messages before starting process. Let’s look at os.Args array here. For example if we wanted to spawn sh -c "echo hello" in namespaces, then now array looks like ["fork", "sh", "-c", "echo hello"]. We resolving "sh" as "/bin/sh" and call

syscall.Exec("/bin/sh", []string{"sh", "-c", "echo hello"}, os.Environ())

syscall.Exec calls execve syscall, you can read about it more in man execve. It receives path to binary, arguments and array of environmental variables. Here we just passing all variables down to process, but we can change them in fork() too.

UTS namespace

Let’s do some real work in our new shiny function. Let’s try to setup hostname for our “container” (by default it inherits hostname of host). Let’s add next lines to fork():

    if err := syscall.Sethostname([]byte("unc")); err != nil {
        return fmt.Errorf("Sethostname: %v", err)
    }

If we try to execute this code we’ll get:

Fork error: Sethostname: operation not permitted

because we’re trying to change hostname in host’s UTS namespace.

From man namespaces:

UTS  namespaces  provide  isolation  of two system identifiers: the hostname and the NIS domain name.

So let’s isolate our hostname from host’s hostname. We can create our own UTS namespace by adding syscall.CLONE_NEWUTS to Cloneflags. Now we’ll see successfully changed hostname:

$ unc hostname
unc

Code

Tag on github for this article is uts_setup, it can be found here. I added some functions to separate steps, created Cfg structure in container.go file, so later we can change container configuration in one place. Also I added logging with awesome library logrus.

Thanks for reading! I hope to see you next week in part about mount namespaces, it’ll be very interesting.

Previous parts:

Part 1

Unprivileged containers in Go, Part1: User and PID namespaces

Unprivileged namespaces

Unprivileged(or user) namespaces are Linux namespaces, which can be created by an unprivileged(non-root) user. It is possible only with a usage of user namespaces. Exhaustive info about user namespaces you can find in manpage man user_namespaces. Basically for creating your namespaces you need to create user namespace first. The kernel can take a job of creating namespaces in the right order for you, so you can just pass a bunch of flags to clone and user namespace always created first and is a parent for other namespaces.

User namespace

In user namespace you can map user and groups from host to this namespace, so for example, your user with uid 1000 can be 0(root) in a namespace.

Mrunal Patel introduced to Go support for user and groups and go 1.4.0 including it. Unfortunately, there was security fix to linux kernel 3.18, which prevents group mappings from the unprivileged user without disabling setgroups syscall. It was fixed by me and will be released in 1.5.0 (UPD: Already released!).

For executing process in new user namespace, you need to create *exec.Cmd like this:

cmd := exec.Command(binary, args...)
cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUSER
        UidMappings: []syscall.SysProcIDMap{
            {
                ContainerID: 0,
                HostID:      Uid,
                Size:        1,
            },
        },
        GidMappings: []syscall.SysProcIDMap{
            {
                ContainerID: 0,
                HostID:      Gid,
                Size:        1,
            },
        },
    }

Here you can see syscall.CLONE_NEWUSER flag in SysProcAttr.Cloneflags, which means just “please, create new user namespace for this process”, another namespace can be specified there too. Mappings fields talk for themselves. Size means a size of a range of mapped IDs, so you can remap many IDs without specifying each.

PID namespaces

From man pid_namespaces:

PID namespaces isolate the process ID number space

That is it, your process in this namespace has PID 1, which is sorta cool: You are like systemd, but better. In our first part ps awux won’t show only our process, because we need mount namespace and remount /proc, but still you can see PID 1 with echo $$.

First unprivileged container

I am pretty bad at writing big texts, so I decided to split container creation to several parts. Today we will see only user and PID namespace creation, which still pretty impressive. So, for adding PID namespace we need to modify Cloneflags:

    Cloneflags: syscall.CLONE_NEWUSER | syscall.CLONE_NEWPID

For this articles, I created a project on Github: https://github.com/LK4D4/unc. unc means “unprivileged container” and has nothing in common with runc(maybe only a little). I will tag code for each article in a repo. Tag for this article is user_pid. Just compile it with go1.5 or above and try to run different commands from an unprivileged user in namespaces:

$ unc sh
$ unc whoami
$ unc sh -c "echo \$\$"

It is doing nothing fancy, but just connects your standard streams to executed process and execute it in new namespaces with a remapping current user and group to root user and the group inside user namespace. Please read all code, there is not much for now.

Next steps

Most interesting part of containers is mount namespace. It allows you to have mounts separate from host(/proc for example). Another interesting namespace is a network, it is little tough for an unprivileged user, because you need to create network interfaces on host first, so for this you need some superpowers from the root. In next article, I hope to cover mount namespace - so it a real container with own root filesystem.

Thanks for reading! I am learning all this stuff myself right now by writing this articles, so if you have something to say, please feel free to comment!

30 days of hacking Docker

Prelude

Yesterday I finished my first 30-day streak on GitHub. Most of contributions were to Docker – the biggest opensource project on Go. I learned a lot in this month, and it was really cool. I think that this is mostly because of Go language. I’ve been programming on Python for five years and I was never so excited about open source, because Python is not even half so fun as Go.

1. Tools

There are a lot of tools for go, some of them just are “must have”.

Goimports - like go fmt but with cool imports handling, I really think that go fmt needs to be replaced with Goimports in future Go versions.

Vet - analyzes code for some suspicious constructs. You can find with it: bad format strings, unreachable code, passing mutex by value and etc. PR about vet erros in Docker.

Golint - checks code for google style guide.

2. Editor

I love my awesome vim with awesome vim-go plugin, which is integrated with tools mentioned above. It formats code for me, adds needed imports, removes unused imports, shows documentation, supports tagbar and more. And my favourite - go to definition. I really suffered without it :) With vim-go my development rate became faster than I could imagine. You can see my config in my dotfiles repo.

3. Race detector

This is one of the most important and one of the most underestimated thing. Very useful and very easy to use. You can find description and examples here. I’ve found many race conditions with this tool (#1, #2, #3, #4, #5).

4. Docker specific

Docker has very smart and friendly community. You can always ask for help about hacking in #docker-dev on Freenode. But I’ll describe some simple tasks that appears when you try to hack docker first time.

Tests

There are three kinds of tests in docker repo:

unit - unit tests(ah, we all know what unit tests are, right?). These tests spreaded all over repository and can be run by make test-unit. You can run tests for one directory, specifying it in TESTDIRS variable. For example
```
TESTDIRS="daemon" make test-unit
```
will run tests only for daemon directory.
integration-cli - integration tests, that use external docker commands (for example docker build, docker run, etc.). It is very easy to write this kind of tests and you should do it if you think that your changes can change Docker’s behavior from client’s point of view. These tests are located in integration-cli directory and can be run by make test-integration-cli. You can run one or more specific tests with setting TESTFLAGS variable. For example
```
TESTFLAGS="-run TestBuild" make test-integration-cli
```
will run all tests whose names starts with TestBuild.
integration - integration tests, that use internal docker datastructures. It is deprecated now, so if you want to write tests you should prefer integration-cli or unit. These tests are located in integration directory and can be run by make test-integration.

All tests can be run by make test.

Build and run tests on host

All make commands execute in docker container, it can be pretty annoying to build container just for running unit tests for example.

So, for running unit test on host machine you need canonical Go workspace. When it’s ready you can just do symlink to docker repo in src/github.com/dotcloud/docker. But we still need right $GOPATH, here is the trick:

export GOPATH=<workspace>/src/github.com/dotcloud/docker/vendor:<workspace>

And then, for example you can run:

go test github.com/dotcloud/docker/daemon/networkdriver/ipallocator

Some tests require external libs for example libdevmapper, you can disable it with DOCKER_BUILDTAGS environment variable. For example:

export DOCKER_BUILDTAGS='exclude_graphdriver_devicemapper exclude_graphdriver_aufs'

For fast building dynamic binary you can use this snippet in docker repo:

export AUTO_GOPATH=1
export DOCKER_BUILDTAGS='exclude_graphdriver_devicemapper exclude_graphdriver_aufs'
hack/make.sh dynbinary

I use that DOCKER_BUILDTAGS for my btrfs system, so if you use aufs or devicemapper you should change it for your driver.

Race detection

To enable race detection in docker I’m using patch:

diff --git a/hack/make/binary b/hack/make/binary
index b97069a..74b202d 100755
--- a/hack/make/binary
+++ b/hack/make/binary
@@ -6,6 +6,7 @@ DEST=$1
 go build \
        -o "$DEST/docker-$VERSION" \
        "${BUILDFLAGS[@]}" \
+       -race \
        -ldflags "
                $LDFLAGS
                $LDFLAGS_STATIC_DOCKER

After that all binaries will be with race detection. Note that this will slow docker a lot.

Docker-stress

There is amazing docker-stress from Spotify for Docker load testing. Usage is pretty straightforward:

./docker-stress -c 50 -t 5

Here 50 clients are trying to run containers, which will alive for five seconds. docker-stress uses only docker run jobs for testing, so I prefer also to run in parallel sort of:

docker events
while true; do docker inspect $(docker ps -lq); done
while true; do docker build -t test test; done

and so on.

Useful links

You definitely need to read Contributing to Docker and Setting Up a Dev Environment. I really don’t think that something else is needed for Docker hacking start.

Conclusion

This is all that I wanted to tell you about my first big opensource experience. Also, just today Docker folks launched some new projects and I am very excited about it. So, I want to invite you all to the magical world of Go, Opensource and, of course, Docker.

Defer overhead in go

Prelude

This post based on real events in docker repository. When I revealed that my 20-percent-cooler refactoring made Pop function x4-x5 times slower, I did some research and concluded, that problem was in using defer statement for unlocking everywhere.

In this post I’ll write simple program and benchmarks from which we will see, that sometimes defer statement can slowdown your program a lot.

Let’s create simple queue with methods Put and Get. Next snippets shows such queue and benchmarks for it. Also I wrote duplicate methods with defer and without it.

Code

package defertest

import (
    "sync"
)

type Queue struct {
    sync.Mutex
    arr []int
}

func New() *Queue {
    return &Queue{}
}

func (q *Queue) Put(elem int) {
    q.Lock()
    q.arr = append(q.arr, elem)
    q.Unlock()
}

func (q *Queue) PutDefer(elem int) {
    q.Lock()
    defer q.Unlock()
    q.arr = append(q.arr, elem)
}

func (q *Queue) Get() int {
    q.Lock()
    if len(q.arr) == 0 {
        q.Unlock()
        return 0
    }
    res := q.arr[0]
    q.arr = q.arr[1:]
    q.Unlock()
    return res
}

func (q *Queue) GetDefer() int {
    q.Lock()
    defer q.Unlock()
    if len(q.arr) == 0 {
        return 0
    }
    res := q.arr[0]
    q.arr = q.arr[1:]
    return res
}

Benchmarks

package defertest

import (
    "testing"
)

func BenchmarkPut(b *testing.B) {
    q := New()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for j := 0; j < 1000; j++ {
            q.Put(j)
        }
    }
}

func BenchmarkPutDefer(b *testing.B) {
    q := New()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for j := 0; j < 1000; j++ {
            q.PutDefer(j)
        }
    }
}

func BenchmarkGet(b *testing.B) {
    q := New()
    for i := 0; i < 1000; i++ {
        q.Put(i)
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for j := 0; j < 2000; j++ {
            q.Get()
        }
    }
}

func BenchmarkGetDefer(b *testing.B) {
    q := New()
    for i := 0; i < 1000; i++ {
        q.Put(i)
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for j := 0; j < 2000; j++ {
            q.GetDefer()
        }
    }
}

Results

BenchmarkPut       50000             63002 ns/op
BenchmarkPutDefer  10000            143391 ns/op
BenchmarkGet       50000             72045 ns/op
BenchmarkGetDefer  10000            249029 ns/op

Conclusion

You don’t need defers in small functions with one-two exit points.

Update

Retested with go from tip as Cezar Sá Espinola suggested. So, here results:

BenchmarkPut       50000             54633 ns/op
BenchmarkPutDefer          10000            102971 ns/op
BenchmarkGet       50000             65148 ns/op
BenchmarkGetDefer          10000            180839 ns/op

Coverage for multiple packages in go

Prelude

There is awesome coverage in go. You can read about it here. But also it has some limitations. For example let’s assume that we have next code structure:

src
├── pkg1
│   ├── pkg11
│   └── pkg12
└── pkg2
    ├── pkg21
    └── pkg22

pkg2, pkg21, pkg22 uses pkg1, pkg11 and pkg12 in different cases. So question is – how we can compute overall coverage for our code base?

Generating cover profiles

Let’s consider some possible go test commands with -coveprofile:

go test -coverprofile=cover.out pkg2

tests run only for pkg1 and cover profile generated only for pkg2

go test -coverprofile=cover.out -coverpkg=./... pkg2

tests run only for pkg2 and cover profile generated for all packages

go test -coverprofile=cover.out -coverpkg=./... ./...

boo hoo: cannot use test profile flag with multiple packages

So, what we can do for running tests on all packages and get cover profile for all packages?

Merging cover profiles

Now we able to get overall profile for each package individually. It seems that we can merge this files. Profile file has next structure, according to cover code:

// First line is "mode: foo", where foo is "set", "count", or "atomic".
// Rest of file is in the format
//      encoding/base64/base64.go:34.44,37.40 3 1
// where the fields are: name.go:line.column,line.column numberOfStatements count

So, using magic of awk I found next solution to this problem:

go test -coverprofile=pkg1.cover.out -coverpkg=./... pkg1
go test -coverprofile=pkg11.cover.out -coverpkg=./... pkg1/pkg11
go test -coverprofile=pkg12.cover.out -coverpkg=./... pkg1/pkg12
go test -coverprofile=pkg2.cover.out -coverpkg=./... pkg2
go test -coverprofile=pkg21.cover.out -coverpkg=./... pkg2/pkg21
go test -coverprofile=pkg22.cover.out -coverpkg=./... pkg2/pkg22
echo "mode: set" > coverage.out && cat *.cover.out | grep -v mode: | sort -r | \
awk '{if($1 != last) {print $0;last=$1}}' >> coverage.out

The true meaning of last line I leave as an exercise for user :) Now we have profile for all code, that was executed by all tests. We can use this merged profile coverage.out for go cover tool:

go tool cover -html=coverage.out

or for generating cobertura report:

gocover-cobertura < coverage.txt > coverage.xml

Conclusion

Of course this solution is only workaround. And it works only for mode: set. Similar logic must be embedded to cover tool. I am really hope that one day we will be able to run

go test -coverprofile=cover.out -coverpkg=./... ./...

and leaning back in chair, enjoying perfect cover profiles.