LK4D4 Blog


Full Category Index

Posts in “Containers”

Unprivileged containers in Go, Part4: Network namespace

Network namespace

From man namespaces:

Network  namespaces  provide  isolation of the system resources associated with
networking: network devices, IPv4 and IPv6 protocol stacks, IP routing tables,
firewalls, the /proc/net directory, the /sys/class/net directory, port numbers
(sockets), and so on.  A physical network device can live in exactly one
network namespace.
A  virtual  network  device ("veth") pair provides a pipe-like abstraction that
can be used to create tunnels between network namespaces, and can be used to
create a bridge to a physical network device in another namespace.

Network namespace allows you to isolate a network stack for your container. Note that it’s not include hostname - it’s tasks of UTS namespace.

We can create network namespace with flag syscall.CLONE_NEWNET in SysProcAttr.Cloneflags. After namespace creation there are only autogenerated network namespaces(in most cases only loopback interface). So we need to inject some network interface into namespace, which allow container to talk to other containers. We will use veth-pairs for this as it was mentioned in man-page. It’s not only way and probably not best, but it is most known and used in Docker by default.

Unet

For interfaces creation we will need new binary with suid bit set, because it’s pretty privileged operations. We can create them with awesome iproute2 set of utilities, but I decided to write all in Go, because it’s fun and I want to promote awesome netlink library - with this library you can do any operations on networking stuff.

I called new binary unet, you can find it in unc repo: https://github.com/LK4D4/unc/tree/master/unet

Bridge

First of all we need to create a bridge. Here is sample code from unet, which sets up bridge:

const (
    bridgeName = "unc0"
    ipAddr     = "10.100.42.1/24"
)

func createBridge() error {
    // try to get bridge by name, if it already exists then just exit
    _, err := net.InterfaceByName(bridgeName)
    if err == nil {
        return nil
    }
    if !strings.Contains(err.Error(), "no such network interface") {
        return err
    }
    // create *netlink.Bridge object
    la := netlink.NewLinkAttrs()
    la.Name = bridgeName
    br := &netlink.Bridge{la}
    if err := netlink.LinkAdd(br); err != nil {
        return fmt.Errorf("bridge creation: %v", err)
    }
    // set up ip addres for bridge
    addr, err := netlink.ParseAddr(ipAddr)
    if err != nil {
        return fmt.Errorf("parse address %s: %v", ipAddr, err)
    }
    if err := netlink.AddrAdd(br, addr); err != nil {
        return fmt.Errorf("add address %v to bridge: %v", addr, err)
    }
    // sets up bridge ( ip link set dev unc0 up )
    if err := netlink.LinkSetUp(br); err != nil {
        return err
    }
    return nil
}

I hardcoded bridge name and IP address for simplicity. Then we need to create veth-pair and attach one side of it to bridge and put another side to our namespace. Namespace we will identify by PID:

const vethPrefix = "uv"

func createVethPair(pid int) error {
    // get bridge to set as master for one side of veth-pair
    br, err := netlink.LinkByName(bridgeName)
    if err != nil {
        return err
    }
    // generate names for interfaces
    x1, x2 := rand.Intn(10000), rand.Intn(10000)
    parentName := fmt.Sprintf("%s%d", vethPrefix, x1)
    peerName := fmt.Sprintf("%s%d", vethPrefix, x2)
    // create *netlink.Veth
    la := netlink.NewLinkAttrs()
    la.Name = parentName
    la.MasterIndex = br.Attrs().Index
    vp := &netlink.Veth{LinkAttrs: la, PeerName: peerName}
    if err := netlink.LinkAdd(vp); err != nil {
        return fmt.Errorf("veth pair creation %s <-> %s: %v", parentName, peerName, err)
    }
    // get peer by name to put it to namespace
    peer, err := netlink.LinkByName(peerName)
    if err != nil {
        return fmt.Errorf("get peer interface: %v", err)
    }
    // put peer side to network namespace of specified PID
    if err := netlink.LinkSetNsPid(peer, pid); err != nil {
        return fmt.Errorf("move peer to ns of %d: %v", pid, err)
    }
    if err := netlink.LinkSetUp(vp); err != nil {
        return err
    }
    return nil
}
After all this we will have “pipe” from container to bridge unc0. But all not so easy, don’t forget that we talking about unprivileged containers, so we need to run all code from unprivileged user, but that particular part must be executed with root rights. We can set suid bit for this, this will allow unprivileged user to run that binary as privileged. I did next:

$ go get github.com/LK4D4/unc/unet
$ sudo chown root:root $(go env GOPATH)/bin/unet
$ sudo chmod u+s $(go env GOPATH)/bin/unet
$ sudo ln -s $(go env GOPATH)/bin/unet /usr/bin/unet

That’s all you need to run this binary. Actually you don’t need to run it, unc will do this :)

Waiting for interface

Now we can create interfaces in namespaces of specified PID. But process expects that network already ready when it starts, so we need somehow to wait until interface will be created by unet in fork part of program, before calling syscall.Exec. I decided to use pretty simple idea for this: just poll an interface list until first veth device is appear. Let’s modify our container.Start to put interface in namespace after we start fork-process:

-       return cmd.Run()
+       if err := cmd.Start(); err != nil {
+               return err
+       }
+       logrus.Debugf("container PID: %d", cmd.Process.Pid)
+       if err := putIface(cmd.Process.Pid); err != nil {
+               return err
+       }
+       return cmd.Wait()

Function putIface just calls unet with PID as argument:

const suidNet = "unet"

func putIface(pid int) error {
    cmd := exec.Command(suidNet, strconv.Itoa(pid))
    out, err := cmd.CombinedOutput()
    if err != nil {
        return fmt.Errorf("unet: out: %s, err: %v", out, err)
    }
    return nil
}
Now let’s see code for waiting interface inside fork-process:
func waitForIface() (netlink.Link, error) {
    logrus.Debugf("Starting to wait for network interface")
    start := time.Now()
    for {
        fmt.Printf(".")
        if time.Since(start) > 5*time.Second {
            fmt.Printf("\n")
            return nil, fmt.Errorf("failed to find veth interface in 5 seconds")
        }
        // get list of all interfaces
        lst, err := netlink.LinkList()
        if err != nil {
            fmt.Printf("\n")
            return nil, err
        }
        for _, l := range lst {
            // if we found "veth" interface - it's time to continue setup
            if l.Type() == "veth" {
                fmt.Printf("\n")
                return l, nil
            }
        }
        time.Sleep(100 * time.Millisecond)
    }
}
We need to put this function before execProc in fork. So, now we have veth interface and we can continue with its setup and process execution.

Network setup

Now easiest part: we just need to set IP to our new interface and set it up. I added IP field to Cfg struct:

type Cfg struct {
        Hostname string
        Mounts   []Mount
        Rootfs   string
+       IP       string
 }

and filled it with pseudorandom IP from bridge subnet(10.100.42.0/24):

const ipTmpl = "10.100.42.%d/24"
defaultCfg.IP = fmt.Sprintf(ipTmpl, rand.Intn(253)+2)

Code for network setup:

func setupIface(link netlink.Link, cfg Cfg) error {
    // up loopback
    lo, err := netlink.LinkByName("lo")
    if err != nil {
        return fmt.Errorf("lo interface: %v", err)
    }
    if err := netlink.LinkSetUp(lo); err != nil {
        return fmt.Errorf("up veth: %v", err)
    }
    addr, err := netlink.ParseAddr(cfg.IP)
    if err != nil {
        return fmt.Errorf("parse IP: %v", err)
    }
    return netlink.AddrAdd(link, addr)
}
That’s all, now we can exec our process.

Talking containers

Let’s try to connect our containers. I presume here that we’re in directory with busybox rootfs:

$ unc sh
$ ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
475: uv5185@if476: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether e2:2b:71:19:73:73 brd ff:ff:ff:ff:ff:ff
    inet 10.100.42.24/24 scope global uv5185
       valid_lft forever preferred_lft forever
    inet6 fe80::e02b:71ff:fe19:7373/64 scope link
       valid_lft forever preferred_lft forever
$ unc ping -c 1 10.100.42.24
PING 10.100.42.24 (10.100.42.24): 56 data bytes
64 bytes from 10.100.42.24: seq=0 ttl=64 time=0.071 ms

--- 10.100.42.24 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.071/0.071/0.071 ms

They can talk! It’s like magic, right? You can find all code under tag netns.

You can install latest versions of unc and unet with

go get github.com/LK4D4/unc/...

Binaries will be created in $(go env GOPATH)/bin/.

The end

This is last post about unprivileged containers(at least about namespaces). We created an isolated environment for process, which you can run under unprivileged user. Containers though is little more than just isolation - also you want to specify what process can do inside container (Linux capabilities), how much resources process can use (Cgroups) and you can imagine many other things. I invite you to look what we have in runc/libcontainer, it’s not very easy code, but I hope that after my posts you will be able to understand it. If you have any questions feel free to write me, I’m always happy to share my humble knowledge about containers.

Previous parts:

Unprivileged containers in Go, Part3: Mount namespace

Mount namespace

From man namespaces:

Mount namespaces isolate the set of filesystem mount points, meaning that
processes in different mount namespaces can have different views of the
filesystem hierarchy. The set of mounts in a mount namespace is modified using
mount(2) and umount(2).

So, mount namespace allows you to give your process different set of mounts. You can have separate /proc, /dev etc. It’s easy just like pass one more flag to SysProcAttr.Cloneflags: syscall.CLONE_NEWNS. It has such weird name because it was first introduced namespace and nobody could think that there will be more. So, if you see CLONE_NEWNS, know - this is mount namespace. Let’s try to enter our container with new mount namespace. We’ll see all the same mounts as in host. That’s because new mount namespace receives copy of parent host namespace as initial mount table. In our case we’re pretty restricted in what we can do with this mounts, for example we can’t unmount anything:

$ umount /proc
umount: /proc: not mounted

That’s because we use “unprivileged” namespace. But we can mount new /proc over old:

mount -t proc none /proc

Now you can see, that ps shows you only your process. So, to get rid of host mounts and have nice clean mount table we can use pivot_root syscall to change root from host root to some another. But first we need to write some code to really mount something into new rootfs.

Mounting inside root file system

So, for next steps we will need some root filesystem for tests. I will use busybox one, because it’s very small, but useful. Busybox rootfs from Docker official image you can take here. Just unpack it to directory busybox somewhere:

$ mkdir busybox
$ cd busybox
$ wget https://github.com/jpetazzo/docker-busybox/raw/master/rootfs.tar
$ tar xvf rootfs.tar

Now when we have rootfs, we need to mount some stuff inside it, let’s create datastructure for describing mounts:

type Mount struct {
    Source string
    Target string
    Fs     string
    Flags  int
    Data   string
}
It is just arguments to syscall.Mount in form of structure. Now we can add some mounts and path to rootfs(it will be just current directory for unc) in addition to hostname to our Cfg structure:
type Cfg struct {
    Path     string
    Args     []string
    Hostname string
    Mounts   []Mount
    Rootfs   string
}
For start I added /proc(to see process tree from new PID namespaces, btw you can’t mount /proc without PID namespace) and /dev:
    Mounts: []Mount{
        {
            Source: "proc",
            Target: "/proc",
            Fs:     "proc",
            Flags:  defaultMountFlags,
        },
        {
            Source: "tmpfs",
            Target: "/dev",
            Fs:     "tmpfs",
            Flags:  syscall.MS_NOSUID | syscall.MS_STRICTATIME,
            Data:   "mode=755",
        },
    },

Mounting function looks very easy, we just iterate over mounts and call syscall.Mount:

func mount(cfg Cfg) error {
    for _, m := range cfg.Mounts {
        target := filepath.Join(cfg.Rootfs, m.Target)
        if err := syscall.Mount(m.Source, target, m.Fs, uintptr(m.Flags), m.Data); err != nil {
            return fmt.Errorf("failed to mount %s to %s: %v", m.Source, target, err)
        }
    }
    return nil
}

Now we have something mounted inside our new rootfs. Time to pivot_root to it.

Pivot root

From man 2 pivot_root:

int pivot_root(const char *new_root, const char *put_old);
...
pivot_root() moves the root filesystem of the calling process to the directory
put_old and makes new_root the new root filesystem of the calling process.

...

       The following restrictions apply to new_root and put_old:

       -  They must be directories.

       -  new_root and put_old must not be on the same filesystem as the current root.

       -  put_old must be underneath new_root, that is, adding a nonzero number
          of /.. to the string pointed to by put_old must yield the same directory as new_root.

       -  No other filesystem may be mounted on put_old.

So, it’s taking current root, moves it to old_root with all mounts and makes new_root as new root. pivot_root is more secure than chroot, it’s pretty hard to escape from it. Sometimes pivot_root isn’t working(for example on Android systems, because of special kernel loading process), then you need to use mount to “/” with MS_MOVE flag and chroot there, here we won’t discuss this case.

Here is the function which we will use for changing root:

func pivotRoot(root string) error {
    // we need this to satisfy restriction:
    // "new_root and put_old must not be on the same filesystem as the current root"
    if err := syscall.Mount(root, root, "bind", syscall.MS_BIND|syscall.MS_REC, ""); err != nil {
        return fmt.Errorf("Mount rootfs to itself error: %v", err)
    }
    // create rootfs/.pivot_root as path for old_root
    pivotDir := filepath.Join(root, ".pivot_root")
    if err := os.Mkdir(pivotDir, 0777); err != nil {
        return err
    }
    // pivot_root to rootfs, now old_root is mounted in rootfs/.pivot_root
    // mounts from it still can be seen in `mount`
    if err := syscall.PivotRoot(root, pivotDir); err != nil {
        return fmt.Errorf("pivot_root %v", err)
    }
    // change working directory to /
    // it is recommendation from man-page
    if err := syscall.Chdir("/"); err != nil {
        return fmt.Errorf("chdir / %v", err)
    }
    // path to pivot root now changed, update
    pivotDir = filepath.Join("/", ".pivot_root")
    // umount rootfs/.pivot_root(which is now /.pivot_root) with all submounts
    // now we have only mounts that we mounted ourselves in `mount`
    if err := syscall.Unmount(pivotDir, syscall.MNT_DETACH); err != nil {
        return fmt.Errorf("unmount pivot_root dir %v", err)
    }
    // remove temporary directory
    return os.Remove(pivotDir)
}
I hope that all is clear from comments, let me know if not. It is all code that you need to have your own unprivileged container with its own rootfs. You can try to find other rootfs among docker images sources, for example alpine linux is pretty exciting distribution. Also you can try to mount something more inside container.

That’s all for today. Tag for this article on github is mnt_ns. Remember that you should run unc from unprivileged user and from directory, which contains rootfs. Here is examples of some commands inside container(excluding logging):

$ unc cat /etc/os-release
NAME=Buildroot
VERSION=2015.02
ID=buildroot
VERSION_ID=2015.02
PRETTY_NAME="Buildroot 2015.02"

$ unc mount
/dev/sda3 on / type ext4 (rw,noatime,nodiratime,nobarrier,errors=remount-ro,commit=600)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,nodev,mode=755,uid=1000,gid=1000)

$ unc ps awxu
PID   USER     COMMAND
    1 root     ps awxu

Looks pretty “container-ish” I think :)

Previous parts:

Unprivileged containers in Go, Part2: UTS namespace (setup namespaces)

Setup namespaces

In previous part we created some namespaces and executed process in them. It was cool, but in real world we need to setup namespaces before process starts. For example setup mounts, make chroot, set hostname, create network interfaces etc. We need this because we can’t expect from user process that it will do it, it want all ready to execute.

So, in our case we want to insert some code after namespaces creation, but before process execution. In C it’s pretty easy to do, because there is clone call there. Not so easy in Go(but easy, really). In Go we need to spawn new process with our code in new namespaces. We can do it with executing our own binary again with different arguments.

Look at code:

    cmd := &exec.Cmd{
        Path: os.Args[0],
        Args: append([]string{"unc-fork"}, os.Args[1:]...),
    }

Here we create *exec.Cmd which will call same binary with same arguments as caller, but will replace os.Args[0] with string unc-fork (yes, you can specify any os.Args[0], not only program name). It will be our keyword, which indicates that we want to setup namespaces and execute process.

So, let’s insert at the top of main() function next lines:

    if os.Args[0] == "unc-fork" {
        if err := fork(); err != nil {
            log.Fatalf("Fork error: %v", err)
        }
        os.Exit(0)
    }

It means next: execute function fork() and exit in special case of os.Args[0].

Let’s write fork() now:

func fork() error {
    fmt.Println("Start fork")
    path, err := exec.LookPath(os.Args[1])
    if err != nil {
        return fmt.Errorf("LookPath: %v", err)
    }
    fmt.Printf("Execute %s\n", append([]string{path}, os.Args[2:]...))
    return syscall.Exec(path, os.Args[1:], os.Environ())
}

It’s simplest fork() function, it’s just prints some messages before starting process. Let’s look at os.Args array here. For example if we wanted to spawn sh -c "echo hello" in namespaces, then now array looks like ["fork", "sh", "-c", "echo hello"]. We resolving "sh" as "/bin/sh" and call

syscall.Exec("/bin/sh", []string{"sh", "-c", "echo hello"}, os.Environ())

syscall.Exec calls execve syscall, you can read about it more in man execve. It receives path to binary, arguments and array of environmental variables. Here we just passing all variables down to process, but we can change them in fork() too.

UTS namespace

Let’s do some real work in our new shiny function. Let’s try to setup hostname for our “container” (by default it inherits hostname of host). Let’s add next lines to fork():

    if err := syscall.Sethostname([]byte("unc")); err != nil {
        return fmt.Errorf("Sethostname: %v", err)
    }

If we try to execute this code we’ll get:

Fork error: Sethostname: operation not permitted

because we’re trying to change hostname in host’s UTS namespace.

From man namespaces:

UTS  namespaces  provide  isolation  of two system identifiers: the hostname and the NIS domain name.

So let’s isolate our hostname from host’s hostname. We can create our own UTS namespace by adding syscall.CLONE_NEWUTS to Cloneflags. Now we’ll see successfully changed hostname:

$ unc hostname
unc

Code

Tag on github for this article is uts_setup, it can be found here. I added some functions to separate steps, created Cfg structure in container.go file, so later we can change container configuration in one place. Also I added logging with awesome library logrus.

Thanks for reading! I hope to see you next week in part about mount namespaces, it’ll be very interesting.

Previous parts:

Unprivileged containers in Go, Part1: User and PID namespaces

Unprivileged namespaces

Unprivileged(or user) namespaces are Linux namespaces, which can be created by an unprivileged(non-root) user. It is possible only with a usage of user namespaces. Exhaustive info about user namespaces you can find in manpage man user_namespaces. Basically for creating your namespaces you need to create user namespace first. The kernel can take a job of creating namespaces in the right order for you, so you can just pass a bunch of flags to clone and user namespace always created first and is a parent for other namespaces.

User namespace

In user namespace you can map user and groups from host to this namespace, so for example, your user with uid 1000 can be 0(root) in a namespace.

Mrunal Patel introduced to Go support for user and groups and go 1.4.0 including it. Unfortunately, there was security fix to linux kernel 3.18, which prevents group mappings from the unprivileged user without disabling setgroups syscall. It was fixed by me and will be released in 1.5.0 (UPD: Already released!).

For executing process in new user namespace, you need to create *exec.Cmd like this:

cmd := exec.Command(binary, args...)
cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUSER
        UidMappings: []syscall.SysProcIDMap{
            {
                ContainerID: 0,
                HostID:      Uid,
                Size:        1,
            },
        },
        GidMappings: []syscall.SysProcIDMap{
            {
                ContainerID: 0,
                HostID:      Gid,
                Size:        1,
            },
        },
    }

Here you can see syscall.CLONE_NEWUSER flag in SysProcAttr.Cloneflags, which means just “please, create new user namespace for this process”, another namespace can be specified there too. Mappings fields talk for themselves. Size means a size of a range of mapped IDs, so you can remap many IDs without specifying each.

PID namespaces

From man pid_namespaces:

PID namespaces isolate the process ID number space

That is it, your process in this namespace has PID 1, which is sorta cool: You are like systemd, but better. In our first part ps awux won’t show only our process, because we need mount namespace and remount /proc, but still you can see PID 1 with echo $$.

First unprivileged container

I am pretty bad at writing big texts, so I decided to split container creation to several parts. Today we will see only user and PID namespace creation, which still pretty impressive. So, for adding PID namespace we need to modify Cloneflags:

    Cloneflags: syscall.CLONE_NEWUSER | syscall.CLONE_NEWPID

For this articles, I created a project on Github: https://github.com/LK4D4/unc. unc means “unprivileged container” and has nothing in common with runc(maybe only a little). I will tag code for each article in a repo. Tag for this article is user_pid. Just compile it with go1.5 or above and try to run different commands from an unprivileged user in namespaces:

$ unc sh
$ unc whoami
$ unc sh -c "echo \$\$"

It is doing nothing fancy, but just connects your standard streams to executed process and execute it in new namespaces with a remapping current user and group to root user and the group inside user namespace. Please read all code, there is not much for now.

Next steps

Most interesting part of containers is mount namespace. It allows you to have mounts separate from host(/proc for example). Another interesting namespace is a network, it is little tough for an unprivileged user, because you need to create network interfaces on host first, so for this you need some superpowers from the root. In next article, I hope to cover mount namespace - so it a real container with own root filesystem.

Thanks for reading! I am learning all this stuff myself right now by writing this articles, so if you have something to say, please feel free to comment!