LK4D4 Blog


Unprivileged containers in Go, Part4: Network namespace

Jul 30, 2015

Network namespace

From man namespaces:

Network  namespaces  provide  isolation of the system resources associated with
networking: network devices, IPv4 and IPv6 protocol stacks, IP routing tables,
firewalls, the /proc/net directory, the /sys/class/net directory, port numbers
(sockets), and so on.  A physical network device can live in exactly one
network namespace.
A  virtual  network  device ("veth") pair provides a pipe-like abstraction that
can be used to create tunnels between network namespaces, and can be used to
create a bridge to a physical network device in another namespace.

Network namespace allows you to isolate a network stack for your container. Note that it’s not include hostname - it’s tasks of UTS namespace.

We can create network namespace with flag syscall.CLONE_NEWNET in SysProcAttr.Cloneflags. After namespace creation there are only autogenerated network namespaces(in most cases only loopback interface). So we need to inject some network interface into namespace, which allow container to talk to other containers. We will use veth-pairs for this as it was mentioned in man-page. It’s not only way and probably not best, but it is most known and used in Docker by default.

Unet

For interfaces creation we will need new binary with suid bit set, because it’s pretty privileged operations. We can create them with awesome iproute2 set of utilities, but I decided to write all in Go, because it’s fun and I want to promote awesome netlink library - with this library you can do any operations on networking stuff.

I called new binary unet, you can find it in unc repo: https://github.com/LK4D4/unc/tree/master/unet

Bridge

First of all we need to create a bridge. Here is sample code from unet, which sets up bridge:

const (
    bridgeName = "unc0"
    ipAddr     = "10.100.42.1/24"
)

func createBridge() error {
    // try to get bridge by name, if it already exists then just exit
    _, err := net.InterfaceByName(bridgeName)
    if err == nil {
        return nil
    }
    if !strings.Contains(err.Error(), "no such network interface") {
        return err
    }
    // create *netlink.Bridge object
    la := netlink.NewLinkAttrs()
    la.Name = bridgeName
    br := &netlink.Bridge{la}
    if err := netlink.LinkAdd(br); err != nil {
        return fmt.Errorf("bridge creation: %v", err)
    }
    // set up ip addres for bridge
    addr, err := netlink.ParseAddr(ipAddr)
    if err != nil {
        return fmt.Errorf("parse address %s: %v", ipAddr, err)
    }
    if err := netlink.AddrAdd(br, addr); err != nil {
        return fmt.Errorf("add address %v to bridge: %v", addr, err)
    }
    // sets up bridge ( ip link set dev unc0 up )
    if err := netlink.LinkSetUp(br); err != nil {
        return err
    }
    return nil
}

I hardcoded bridge name and IP address for simplicity. Then we need to create veth-pair and attach one side of it to bridge and put another side to our namespace. Namespace we will identify by PID:

const vethPrefix = "uv"

func createVethPair(pid int) error {
    // get bridge to set as master for one side of veth-pair
    br, err := netlink.LinkByName(bridgeName)
    if err != nil {
        return err
    }
    // generate names for interfaces
    x1, x2 := rand.Intn(10000), rand.Intn(10000)
    parentName := fmt.Sprintf("%s%d", vethPrefix, x1)
    peerName := fmt.Sprintf("%s%d", vethPrefix, x2)
    // create *netlink.Veth
    la := netlink.NewLinkAttrs()
    la.Name = parentName
    la.MasterIndex = br.Attrs().Index
    vp := &netlink.Veth{LinkAttrs: la, PeerName: peerName}
    if err := netlink.LinkAdd(vp); err != nil {
        return fmt.Errorf("veth pair creation %s <-> %s: %v", parentName, peerName, err)
    }
    // get peer by name to put it to namespace
    peer, err := netlink.LinkByName(peerName)
    if err != nil {
        return fmt.Errorf("get peer interface: %v", err)
    }
    // put peer side to network namespace of specified PID
    if err := netlink.LinkSetNsPid(peer, pid); err != nil {
        return fmt.Errorf("move peer to ns of %d: %v", pid, err)
    }
    if err := netlink.LinkSetUp(vp); err != nil {
        return err
    }
    return nil
}
After all this we will have “pipe” from container to bridge unc0. But all not so easy, don’t forget that we talking about unprivileged containers, so we need to run all code from unprivileged user, but that particular part must be executed with root rights. We can set suid bit for this, this will allow unprivileged user to run that binary as privileged. I did next:

$ go get github.com/LK4D4/unc/unet
$ su
$ chown root:root $GOPATH/bin/unet
$ chmod u+s $GOPATH/bin/unet
$ ln -s $GOPATH/bin/unet /usr/bin/unet

That’s all you need to run this binary. Actually you don’t need to run it, unc will do this :)

Waiting for interface

Now we can create interfaces in namespaces of specified PID. But process expects that network already ready when it starts, so we need somehow to wait until interface will be created by unet in fork part of program, before calling syscall.Exec. I decided to use pretty simple idea for this: just poll an interface list until first veth device is appear. Let’s modify our container.Start to put interface in namespace after we start fork-process:

-       return cmd.Run()
+       if err := cmd.Start(); err != nil {
+               return err
+       }
+       logrus.Debugf("container PID: %d", cmd.Process.Pid)
+       if err := putIface(cmd.Process.Pid); err != nil {
+               return err
+       }
+       return cmd.Wait()

Function putIface just calls unet with PID as argument:

const suidNet = "unet"

func putIface(pid int) error {
    cmd := exec.Command(suidNet, strconv.Itoa(pid))
    out, err := cmd.CombinedOutput()
    if err != nil {
        return fmt.Errorf("unet: out: %s, err: %v", out, err)
    }
    return nil
}
Now let’s see code for waiting interface inside fork-process:
func waitForIface() (netlink.Link, error) {
    logrus.Debugf("Starting to wait for network interface")
    start := time.Now()
    for {
        fmt.Printf(".")
        if time.Since(start) > 5*time.Second {
            fmt.Printf("\n")
            return nil, fmt.Errorf("failed to find veth interface in 5 seconds")
        }
        // get list of all interfaces
        lst, err := netlink.LinkList()
        if err != nil {
            fmt.Printf("\n")
            return nil, err
        }
        for _, l := range lst {
            // if we found "veth" interface - it's time to continue setup
            if l.Type() == "veth" {
                fmt.Printf("\n")
                return l, nil
            }
        }
        time.Sleep(100 * time.Millisecond)
    }
}
We need to put this function before execProc in fork. So, now we have veth interface and we can continue with its setup and process execution.

Network setup

Now easiest part: we just need to set IP to our new interface and set it up. I added IP field to Cfg struct:

type Cfg struct {
        Hostname string
        Mounts   []Mount
        Rootfs   string
+       IP       string
 }

and filled it with pseudorandom IP from bridge subnet(10.100.42.0/24):

const ipTmpl = "10.100.42.%d/24"
defaultCfg.IP = fmt.Sprintf(ipTmpl, rand.Intn(253)+2)

Code for network setup:

func setupIface(link netlink.Link, cfg Cfg) error {
    // up loopback
    lo, err := netlink.LinkByName("lo")
    if err != nil {
        return fmt.Errorf("lo interface: %v", err)
    }
    if err := netlink.LinkSetUp(lo); err != nil {
        return fmt.Errorf("up veth: %v", err)
    }
    addr, err := netlink.ParseAddr(cfg.IP)
    if err != nil {
        return fmt.Errorf("parse IP: %v", err)
    }
    return netlink.AddrAdd(link, addr)
}
That’s all, now we can exec our process.

Talking containers

Let’s try to connect our containers. I presume here that we’re in directory with busybox rootfs:

$ unc sh
$ ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
475: uv5185@if476: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether e2:2b:71:19:73:73 brd ff:ff:ff:ff:ff:ff
    inet 10.100.42.24/24 scope global uv5185
       valid_lft forever preferred_lft forever
    inet6 fe80::e02b:71ff:fe19:7373/64 scope link
       valid_lft forever preferred_lft forever
$ unc ping -c 1 10.100.42.24
PING 10.100.42.24 (10.100.42.24): 56 data bytes
64 bytes from 10.100.42.24: seq=0 ttl=64 time=0.071 ms

--- 10.100.42.24 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.071/0.071/0.071 ms

They can talk! It’s like magic, right? You can find all code under tag netns.

The end

This is last post about unprivileged containers(at least about namespaces). We created an isolated environment for process, which you can run under unprivileged user. Containers though is little more than just isolation - also you want to specify what process can do inside container (Linux capabilities), how much resources process can use (Cgroups) and you can imagine many other things. I invite you to look what we have in runc/libcontainer, it’s not very easy code, but I hope that after my posts you will be able to understand it. If you have any questions feel free to write me, I’m always happy to share my humble knowledge about containers.

Previous parts:


comments powered by Disqus