CREATING YOUR OWN LINUX CONTAINERS

2016-07-10

I'm not trying to say that this solution is better than LXC or docker. I'm just doing this because it is very simple to get a basic container created with chroot and cgroups. Of course, docker provides much more features than this, but this really is the basis. It's easy to make containers in linux, depending on the amount of features you need.

cgroups

cgroups are a way to run a process while limiting its resources such as cpu time and memory. The it works is by creating a group (a cgroup), define the limits for various resources (you can control more than just memory and CPU time) and then run a process under that cgroup. It is important to know that every child process of that process will also run under the same same cgroup. So if you start a bash shell under a cgroup, then any program that you manually launch from the shell will be constrained to the same cgroup than the one of the shell.

If you create a cgroup with a memory limit of 10mb, this 10mb limit will be applicable to all processes running in the cgroup. The 10mb is a constraint on the sum of memory usage by all process under the same cgroup.

Configuring

On slackware 14.2 RC2, I didn't have to install or setup anything and cgroups were available to use with the tools already installed. I had to manually enable cgroups in the kernel though since I compiled my own kernel. Without going into the details (because this is covered in a million other websites) you need to make sure that:

cgroup kernel modules are built-in are loaded as modules
cgroup tools are installed
cgroup filesystem is mounted (normally accessible through /sys/fs/cgroup/)

Here's how to run a process in a cgroup


cgcreate -g memory:testgroup
# now the "/sys/fs/cgroup/memory/testgroup" exists and contains files that control the limits of the group

# assign a limit of 4mb to that cgroup
echo "4194304" > /sys/fs/cgroup/memory/testgroup/memory.limit_in_bytes

# run bash in that cgroup
cgexec -g memory:testgroup /bin/bash

Note that instead of using cgexec, you could also just write the current shell's PID into /sys/fs/cgroup///task. Then your shell, and whatever process you start from it, would execute in the cgroup.

Making your own containers

A container is nothing more than a chroot'd environment with processes confined in a cgroup. It's not difficult to write your own software to automate the environment setup. There is a "chroot" system call that already exist. For cgroups, I was wondering if there was any system calls available to create them. Using strace while running cgcreate, I found out that cgcreate only manipulates files in the cgroup file system. Then I got the rest of the information I needed from the documentation file located in the Documentation folder of the linux kernel source: Documentation/cgroups/cgroups.txt.

Creating a cgroup

To create a new cgroup, it is simply a matter of creating a new directory under the submodule folder that the cgroups needs to control. For example, to create a cgroup that controls memory and cpu usage, you just need to create a directory "AwesomeControlGroup" under /sys/fs/cgroup/memory and /sys/fs/cgroup/cpu. These directories will automatically be populated with the files needed to control the cgroup (cgroup fs is a vitual filesystem, so files do not exist on a physical medium).

Configuring a cgroup

To configure a cgroup, it is just a matter of writing parameters in the relevant file. For example: /sys/fs/cgroup/memory/testgroup/memory.limit_in_bytes

Running a process in a cgroup

To run a process in a cgroup, you need to launch the process, take its PID and write it under /sys/fs/cgroup///task My "container creator" application (let's call it the launcher) does it like this:

The launcher creates a cgroup and sets relevant parameters.
The launcher clones (instead of fork. now we have a parent and a child)
The parent waits for the child to die, after which it will destroy the cgroup.
The child writes its PID in the /sys/fs/cgroup///task file for all submodules (memory, cpu, etc)
At this point, the child runs as per the cgroup's constraints.
The child invokes execv with the application that the user wanted to have invoked in the container.

The reason I use clone() instead of fork, is that clone() can use the CLONE_NEWPID flag. This will create a new process namespace that will isolate the new process for the others that exist on the system. Indeed, when the cloned process queries its PID it will find that it is assigned PID 1. Doing a "ps" would not list other processes that run on the system since this new process is isolated.

Destroying a cgroup

To destroy a cgroup, just delete the /sys/fs/cgroup// directory

So interfacing with cgroups from userland is just a matter of manipulating files in the cgroup file system. It is really easy to do programmatically and also from the shell without any special tools or library.

My container application

My "container launcher" is a c++ application that chroots in a directory and run a process under a cgroup that it creates. To use it, I only need to type "./a.out container_path". The container_path is the path to a container directory that contains a "settings" files and a "chroot" directory. The "chroot" directory contains the environment of the container (a linux distribution maybe?) and the "settings" file contains settings about the cgroup configuration and the name of the process to be launched.

You can download my code: cgroups.tar

Example

I've extracted the slackware initrd image found in the "isolinux" folder of the dvd.


cd /tmp/slackware/chroot
gunzip < /usr/src/slackware64-current/isolinux/initrd.img  | cpio -i --make-directories

Extracting this in /tmp/slackware/chroot gives me a small linux environment that directory and I've created a settings file in /tmp/slackware. Call this folder a "container", it contains a whole linux environment under the chroot folder and a settings file to indicate under what user the container should run, what process it should start, how much ram max it can get etc. For this example, my settings file is like this:


user: 99
group: 98
memlimit: 4194304
cpupercent: 5
process:/bin/bash
arg1:-i

And running the container gives me:


[12:49:34 root@pat]# ./a.out /tmp/slackware
Mem limit: 4194304 bytes
CPU shares: 51 (5%)
Added PID 9337 in cgroup
Dropping privileges to 99:98
Starting /bin/bash -i

[17:26:10 nobody@pat:/]$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody       1  0.0  0.0  12004  3180 ?        S    17:26   0:00 /bin/bash -i
nobody       2  0.0  0.0  11340  1840 ?        R+   17:26   0:00 ps aux
[17:26:14 nobody@pat:/]$ ls
a  a.out  bin  boot  cdrom  dev  etc  floppy  init  lib  lib64  lost+found  mnt  nfs  proc  root  run  sbin  scripts  sys  tag  tmp  usr  var
[17:26:15 nobody@pat:/]$ exit
exit
Exiting container

Networking

When cloning with CLONE_NEWNET, the child process gets a separate netwrok namespace. It doesn't see the host's network interfaces anymore. So in order to get networking enabled in the container, we need to create a veth pair. I am doing all network interface manipulations with libnl (which was already installed on a stock slackware installation). The veth pair will act as a kind of tunnel between the host and the container. The host side interface will be added in a bridge so that it can be part of another lan. The container side interface will be assigned an IP and then the container will be able to communicate wil all peers that are on the same bridge. The bridge could be used to connect the container on the LAN or within a local network that consists of only other containers from a select group.

The launcher creates a "eth0" that appears in the container. The other end of the veth pair is added in a OVS bridge. An ip address is set on eth0 and the default route is set. I then have full networking functionality in my container.

Settings for networking


bridge: br0    
ip: 192.168.1.228/24
gw: 192.168.1.1

Result


Mem limit: 4194304 bytes
CPU shares: 51 (5%)
Added PID 14230 in cgroup
Dropping privileges to 0:0
Starting /bin/bash -i

[21:59:33 root@container:/]# ping google.com
PING google.com (172.217.2.142): 56 data bytes
64 bytes from 172.217.2.142: seq=0 ttl=52 time=22.238 ms
^C
--- google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 22.238/22.238/22.238 ms
[21:59:36 root@container:/]# exit
exit
Exiting container