In the previous chapter, we learned how to build a Docker image and the very basic steps required for running the resulting image within a container. In this chapter, we’ll first take a look at the history of container technology and then dive deeper into running containers and exploring the Docker commands that control the overall configuration, resources, and privileges that your container receives.
You might be familiar with virtualization systems like VMware or KVM that allow you to run a complete Linux kernel and operating system on top of a virtualized layer, commonly known as a hypervisor. This approach provides very strong isolation between workloads because each virtual machine hosts its own operating system kernel that sits in a separate memory space on top of a hardware virtualization layer.
Containers are fundamentally different, since they all share a single kernel, and isolation between workloads is implemented entirely within that one kernel. This is called operating system virtualization. The libcontainer README provides a good, short definition of a container: “A container is a self-contained execution environment that shares the kernel of the host system and which is (optionally) isolated from other containers in the system.” One of the major advantages of containers is resource efficiency, because you don’t need a whole operating system instance for each isolated workload. Since you are sharing a kernel, there is one less layer of indirection between the isolated task and the real hardware underneath. When a process is running inside a container, there is only a little bit of code that sits inside the kernel managing the container. Contrast this with a virtual machine where there would be a second layer running. In a VM, calls by the process to the hardware or hypervisor would require bouncing in and out of privileged mode on the processor twice, thereby noticeably slowing down many calls.
But the container approach does mean that you can only run processes that are compatible with the underlying kernel. For example, unlike hardware virtualization provided by technologies like VMware or KVM, Windows applications cannot run inside a Linux container on a Linux host. Windows applications can, however, run inside Windows containers on a Windows host. So containers are best thought of as an OS-specific technology where, at least for now, you can run any of your favorite applications or daemons that are compatible with the container server. When thinking of containers, you should try very hard to throw out what you might already know about virtual machines and instead conceptualize a container as a wrapper around a process that actually runs on the server.
In addition to being able to run containers inside virtual machines, it is actually completely feasible to run a virtual machine inside a container. If you do this, then it is actually possible to run a Windows application inside a Windows VM that is running inside a Linux container.
It is often the case that a revolutionary technology is an older technology that has finally arrived in the spotlight. Technology goes in waves, and some of the ideas from the 1960s are back in vogue. Similarly, Docker is a newer technology and it has an ease of use that has made it an instant hit, but it doesn’t exist in a vacuum. Much of what underpins Docker comes from work done over the last 30 years in a few different arenas. We can easily trace the conceptual evolution of containers from a simple system call that was added to the Unix kernel in the late 1970s all the way to the modern container tooling that powers many huge internet firms, like Google, Twitter, and Facebook. It’s worth taking some time for a quick tour through how the technology evolved and led to the creation of Docker, because understanding that helps you place it within the context of other things you might be familiar with.
Containers are not a new idea. They are a way to isolate and encapsulate a part of the running system. The oldest technology in this area includes the very first batch processing systems. When using these early computers, the system would literally run one program at a time, switching to run another program once the previous program had finished or a set time period had been reached. With this design there was enforced isolation: you could make sure your program didn’t step on anyone else’s program, because it was only possible to run one thing at a time. Although modern computers still switch tasks constantly, it is incredibly fast and completely unnoticeable to most users.
We would argue that the seeds for today’s containers were planted in 1979 with the addition of the chroot system call to Version 7 Unix. chroot restricts a process’s view of the underlying filesystem to a single subtree. The chroot system call is commonly used to protect the operating system from untrusted server processes like FTP, BIND, and Sendmail, which are publicly exposed and susceptible to compromise.
In the 1980s and 1990s, various Unix variants were created with mandatory access controls for security reasons.1 This meant you had tightly controlled domains running on the same Unix kernel. Processes in each domain had an extremely limited view of the system that precluded them from interacting across domains. A popular commercial version of Unix that implemented this idea was the Sidewinder firewall built on top of BSDI Unix, but this was not possible with most mainstream Unix implementations.
That changed in 2000 when FreeBSD 4.0 was released with a new command, called jail, which was designed to allow shared-environment hosting providers to easily and securely create a separation between their processes and those of their individual customers. FreeBSD jail expanded chroot’s capabilities and also restricted everything a process could do with the underlying system and other jailed processes.
In 2004, Sun released an early build of Solaris 10, which included Solaris Containers, and later evolved into Solaris Zones. This was the first major commercial implementation of container technology and is still used today to support many commercial container implementations. In 2005 OpenVZ for Linux was released by the company Virtuozzo, followed in 2007 by HP’s Secure Resource Partitions for HP-UX, which was later renamed to HP-UX Containers. Finally, in 2008, Linux Containers (LXC) were released in version 2.6.24 of the Linux kernel. The phenomenal growth of Linux Containers across the community did not really start to grow until 2013 with the inclusion of user namespaces in version 3.8 of the Linux kernel and the release of Docker one month later.
Companies, like Google, that had to deal with scaling applications for broad internet consumption started pushing container technology in the early 2000s in order to facilitate distributing their applications across global data centers full of computers. A few companies maintained their own patched Linux kernels with container support for internal use, but as the need for these features became more evident within the Linux community, Google contributed some of its own work supporting containers into the mainline Linux kernel.
So far we’ve started containers using the handy docker run command. But docker run is really a convenience command that wraps two separate steps into one. The first thing it does is create a container from the underlying image. We can accomplish this separately using the docker create command. The second thing docker run does is execute the container, which we can also do separately with the docker start command.
The docker create and docker start commands both contain all the options that pertain to how a container is initially set up. In Chapter 4, we demonstrated that with the docker run command you could map network ports in the underlying container to the host using the -p argument, and that -e could be used to pass environment variables into the container.
This only just begins to touch on the array of things that you can configure when you first create a container. So let’s take a look at some of the options that docker supports.
Let’s start by exploring some of the ways we can tell Docker to configure our container when we create it.
When you create a container, it is built from the underlying image, but various command-line arguments can affect the final settings. Settings specified in the Dockerfile are always used as defaults, but you can override many of them at creation time.
By default, Docker randomly names your container by combining an adjective with the name of a famous person. This results in names like ecstatic-babbage and serene-albattani. If you want to give your container a specific name, you can use the --name argument.
$ docker create --name="awesome-service" ubuntu:latest sleep 120
After creating this container, you could then start it by using the docker start awesome-service. It will automatically exit after 120 seconds, but you could stop it before then by running docker stop awesome-service. We will dive a bit more into each of these commands a little later in the chapter.
As mentioned in Chapter 4, labels are key/value pairs that can be applied to Docker images and containers as metadata. When new Docker containers are created, they automatically inherit all the labels from their parent image.
It is also possible to add new labels to the containers so that you can apply metadata that might be specific to that single container.
docker run -d --name has-some-labels -l deployer=Ahmed -l tester=Asako \ ubuntu:latest sleep 1000
You can then search for and filter containers based on this metadata, using commands like docker ps.
$ docker ps -a -f label=deployer=Ahmed CONTAINER ID IMAGE COMMAND ... NAMES 845731631ba4 ubuntu:latest "sleep 1000" ... has-some-labels
You can use the docker inspect command on the container to see all the labels that a container has.
$ docker inspect 845731631ba4 ...
"Labels":{"deployer":"Ahmed","tester":"Asako"},
...
Note that this container runs the command sleep 1000, so after 1,000 seconds it will stop running.
By default, when you start a container, Docker copies certain system files on the host, including /etc/hostname, into the container’s configuration directory on the host,2 and then uses a bind mount to link that copy of the file into the container. We can launch a default container with no special configuration like this:
$ docker run --rm -ti ubuntu:latest /bin/bash
This command uses the docker run command, which runs docker create and docker start in the background. Since we want to be able to interact with the container that we are going to create for demonstration purposes, we pass in a few useful arguments. The --rm argument tells Docker to delete the container when it exits, the -t argument tells Docker to allocate a pseudo-TTY, and the -i argument tells Docker that this is going to be an interactive session, and we want to keep STDIN open. The final argument in the command is the executable that we want to run within the container, which in this case is the ever-useful /bin/bash.
If we now run the mount command from within the resulting container, we’ll see something similar to this:
root@ebc8cf2d8523:/# mount overlay on / type overlay (rw,relatime,lowerdir=...,upperdir=...,workdir...) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev type tmpfs (rw,nosuid,mode=755) shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k) mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,...,ptmxmode=666) sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime) /dev/sda9 on /etc/resolv.conf type ext4 (rw,relatime,data=ordered) /dev/sda9 on /etc/hostname type ext4 (rw,relatime,data=ordered) /dev/sda9 on /etc/hosts type ext4 (rw,relatime,data=ordered) devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,...,ptmxmode=000) proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime) tmpfs on /proc/kcore type tmpfs (rw,nosuid,mode=755) root@ebc8cf2d8523:/#
When you see any examples with a prompt that looks something like root@hashID, it means that you are running a command within the container instead of on the Docker host. Note that there are occasions when a container will have been configured with a different hostname instead (e.g., using --name on the CLI), but in the default case, it’s a hash.
There are quite a few bind mounts in a container, but in this case we are interested in this one:
/dev/sda9 on /etc/hostname type ext4 (rw,relatime,data=ordered)
While the device number will be different for each container, the part we care about is that the mount point is /etc/hostname. This links the container’s /etc/hostname to the hostname file that Docker has prepared for the container, which by default contains the container’s ID and is not fully qualified with a domain name.
We can check this in the container by running the following:
root@ebc8cf2d8523:/# hostname -f ebc8cf2d8523 root@ebc8cf2d8523:/# exit
Don’t forget to exit the container shell to return to the Docker host when finished.
To set the hostname specifically, we can use the --hostname argument to pass in a more specific value.
$ docker run --rm -ti --hostname="mycontainer.example.com" \
ubuntu:latest /bin/bash
Then, from within the container, we’ll see that the fully qualified hostname is defined as requested.
root@mycontainer:/# hostname -f mycontainer.example.com root@mycontainer:/# exit
Just like /etc/hostname, the resolv.conf file that configured Domain Name Service (DNS) resolution is managed via a bind mount between the host and container.
/dev/sda9 on /etc/resolv.conf type ext4 (rw,relatime,data=ordered)
By default, this is an exact copy of the Docker host’s resolv.conf file. If you didn’t want this, you could use a combination of the --dns and --dns-search arguments to override this behavior in the container:
$ docker run --rm -ti --dns=8.8.8.8 --dns=8.8.4.4 --dns-search=example1.com \
--dns-search=example2.com ubuntu:latest /bin/bash
If you want to leave the search domain completely unset, then use --dns-search=.
Within the container, you would still see a bind mount, but the file contents would no longer reflect the host’s resolv.conf; instead, it would now look like this:
root@0f887071000a:/# more /etc/resolv.conf nameserver 8.8.8.8 nameserver 8.8.4.4 search example1.com example2.com root@0f887071000a:/# exit
Another important piece of information that you can configure is the media access control (MAC) address for the container.
Without any configuration, a container will receive a calculated MAC address that starts with the 02:42:ac:11 prefix.
If you need to specifically set this to a value, you can do so by running something similar to this:
$ docker run --rm -ti --mac-address="a2:11:aa:22:bb:33" ubuntu:latest /bin/bash
Normally you will not need to do that. But sometimes you want to reserve a particular set of MAC addresses for your containers in order to avoid other virtualization layers that use the same private block as Docker.
Be very careful when customizing the MAC address settings. It is possible to cause ARP contention on your network if two systems advertise the same MAC address. If you have a strong need to do this, try to keep your locally administered address ranges within some of the official ranges, like x2-xx-xx-xx-xx-xx, x6-xx-xx-xx-xx-xx, xA-xx-xx-xx-xx-xx, and xE-xx-xx-xx-xx-xx (with x being any valid hexidecimal character).
There are times when the default disk space allocated to a container, or the container’s ephemeral nature, is not appropriate for the job at hand, so you’ll need storage that can persist between container deployments.
Mounting storage from the Docker host is not generally advisable because it ties your container to a particular Docker host for its persistent state. But for cases like temporary cache files or other semi-ephemeral states, it can make sense.
For times like this, you can leverage the -v command to mount directories and individual files from the host server into the container. The following example mounts /mnt/session_data to /data within the container:
$ docker run --rm -ti -v /mnt/session_data:/data ubuntu:latest /bin/bash root@0f887071000a:/# mount | grep data /dev/sda9 on /data type ext4 (rw,relatime,data=ordered) root@0f887071000a:/# exit
By default, volumes are mounted read-write, but you can easily modify this command to make it mount the directory read-only:
docker run --rm -ti -v /mnt/session_data:/data:ro \ ubuntu:latest /bin/bash
Neither the host mount point nor the mount point in the container needs to preexist for this command to work properly. If the host mount point does not exist already, then it will be created as a directory. This could cause you some issues if you were trying to point to a file instead of a directory.
In the mount options, you can see that the filesystem was mounted read-write on /data as expected.
If the container application is designed to write into /data, then this data will be visible on the host filesystem in /mnt/session_data and will remain available when this container stops and a new container starts with the same volume mounted.
It is possible to tell Docker that the root volume of your container should be mounted read-only so that processes within the container cannot write anything to the root filesystem. This prevents things like logfiles, which a developer may be unaware of, from filling up the container’s allocated disk in production. When it’s used in conjunction with a mounted volume, you can ensure that data is written only into expected locations.
In the previous example, we could accomplish this simply by adding --read-only=true to the command.
$ docker run --rm -ti --read-only=true -v /mnt/session_data:/data \
ubuntu:latest /bin/bash
root@df542767bc17:/# mount | grep " / "
overlay on / type overlay (ro,relatime,lowerdir=...,upperdir=...,workdir=...)
root@df542767bc17:/# mount | grep data
/dev/sda9 on /data type ext4 (rw,relatime,data=ordered)
root@df542767bc17:/# exit
If you look closely at the mount options for the root directory, you’ll notice that they are mounted with the ro option, which makes it read-only. However, the /session_data mount is still mounted with the rw option so that our application can successfully write to the one volume to which it’s designed to write.
Sometimes it is necessary to make a directory like /tmp writeable, even when the rest of the container is read-only. For this use case, you can use the --tmpfs argument with docker run, so that you can mount a tmpfs filesystem into the container. Any data in these tmpfs directories will be lost when the container is stopped. The following example shows a container being launched with a tmpfs filesystem mounted at /tmp with the rw, noexec, nodev, nosuid, and size=256M mount options set:
$ docker run --rm -ti --read-only=true --tmpfs \
/tmp:rw,noexec,nodev,nosuid,size=256M ubuntu:latest /bin/bash
root@25b4f3632bbc:/# df -h /tmp
Filesystem Size Used Avail Use% Mounted on
tmpfs 256M 0 256M 0% /tmp
root@25b4f3632bbc:/# grep /tmp /etc/mtab
tmpfs /tmp tmpfs rw,seclabel,nosuid,nodev,noexec,relatime,size=262144k 0 0
root@25b4f3632bbc:/# exit
Containers should be designed to be stateless whenever possible. Managing storage creates undesirable dependencies and can easily make deployment scenarios much more complicated.
When people discuss the types of problems they must often cope with when working in the cloud, the “noisy neighbor” is often near the top of the list. The basic problem this term refers to is that other applications running on the same physical system as yours can have a noticeable impact on your performance and resource availability.
Virtual machines have the advantage that you can easily and very tightly control how much memory and CPU, among other resources, are allocated to the virtual machine. When using Docker, you must instead leverage the cgroup functionality in the Linux kernel to control the resources that are available to a Docker container. The docker create and docker run commands directly support configuring CPU, memory, swap, and storage I/O restrictions when you create a container.
Constraints are normally applied at the time of container creation. If you need to change them, you can use the docker container update command or deploy a new container with the adjustments.
There is an important caveat here. While Docker supports various resource limits, you must have these capabilities enabled in your kernel in order for Docker to take advantage of them. You might need to add these as command-line parameters to your kernel on startup. To figure out if your kernel supports these limits, run docker info. If you are missing any support, you will get warning messages at the bottom, like:
WARNING: No swap limit support
The details regarding getting cgroup support configured for your kernel are distribution-specific, so you should consult the Docker documentation if you need help configuring things.
Docker has several ways to limit CPU usage by applications in containers. The original method, and one still commonly used, is the concept of cpu shares. Below we’ll present other options as well.
The computing power of all the CPU cores in a system is considered to be the full pool of shares. Docker assigns the number 1024 to represent the full pool. By configuring a container’s CPU shares, you can dictate how much time the container gets to use the CPU for. If you want the container to be able to use at most half of the computing power of the system, then you would allocate it 512 shares. Note that these are not exclusive shares, meaning that assigning all 1024 shares to a container does not prevent all other containers from running. Rather, it’s a hint to the scheduler about how long each container should be able to run each time it’s scheduled. If we have one container that is allocated 1024 shares (the default) and two that are allocated 512, they will all get scheduled the same number of times. But if the normal amount of CPU time for each process is 100 microseconds, the containers with 512 shares will run for 50 microseconds each time, whereas the container with 1024 shares will run for 100 microseconds.
Let’s explore a little bit how this works in practice. For the following examples, we’ll use a new Docker image that contains the stress command for pushing a system to its limits.
When we run stress without any cgroup constraints, it will use as many resources as we tell it to. The following command creates a load average of around 5 by creating two CPU-bound processes, one I/O-bound process, and two memory allocation processes. Note that in the following code, we are running on a system with two CPUs.
$ docker run --rm -ti progrium/stress \ --cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
This should be a reasonable command to run on any modern computer system, but be aware that it is going to stress the host system. So, don’t do this in a location that can’t take the additional load, or even a possible failure, due to resource starvation.
If you run the top or htop command on the Docker host, near the end of the two-minute run, you can see how the system is affected by the load created by the stress program.
$ top -bn1 | head -n 15
top - 20:56:36 up 3 min, 2 users, load average: 5.03, 2.02, 0.75
Tasks: 88 total, 5 running, 83 sleeping, 0 stopped, 0 zombie
%Cpu(s): 29.8 us, 35.2 sy, 0.0 ni, 32.0 id, 0.8 wa, 1.6 hi, 0.6 si, 0.0 st
KiB Mem: 1021856 total, 270148 used, 751708 free, 42716 buffers
KiB Swap: 0 total, 0 used, 0 free. 83764 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
810 root 20 0 7316 96 0 R 44.3 0.0 0:49.63 stress
813 root 20 0 7316 96 0 R 44.3 0.0 0:49.18 stress
812 root 20 0 138392 46936 996 R 31.7 4.6 0:46.42 stress
814 root 20 0 138392 22360 996 R 31.7 2.2 0:46.89 stress
811 root 20 0 7316 96 0 D 25.3 0.0 0:21.34 stress
1 root 20 0 110024 4916 3632 S 0.0 0.5 0:07.32 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/0
Docker Community Edition users on non-Linux systems may discover that Docker has made the VM filesystem read-only and it does not contain many useful tools for monitoring the VM. For these demos where you want to be able to monitor the resource usage of various processes, you can work around this by doing something like this:
$ docker run -it --privileged --pid=host alpine sh / # apk update / # apk add htop / # htop -p $(pgrep stress | tr '\n' ',') / # exit
Be aware that the preceding htop command will give you an error unless stress is actively running when you launch htop, since no processes will be returned by the pgrep command.
If you want to run the exact same stress command again, with only half the amount of available CPU time, you can do so like this:
$ docker run --rm -ti --cpu-shares 512 progrium/stress \
--cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
The --cpu-shares 512 is the flag that does the magic, allocating 512 CPU shares to this container. Note that the effect might not be noticeable on a system that is not very busy. That’s because the container will continue to be scheduled for the same time-slice length whenever it has work to do, unless the system is constrained for resources. So in our case, the results of a top command on the host system will likely look exactly the same, unless you run a few more containers to give the CPU something else to do.
Unlike virtual machines, Docker’s cgroup-based constraints on CPU shares can have unexpected consequences. They are not hard limits; they are a relative limit, similar to the nice command. An example is a container that is constrained to half the CPU shares but is on a system that is not very busy. Since the CPU is not busy, the limit on the CPU shares would have only a limited effect because there is no competition in the scheduler pool. When a second container that uses a lot of CPU is deployed to the same system, suddenly the effect of the constraint on the first container will be noticeable. Consider this carefully when constraining containers and allocating resources.
It is also possible to pin a container to one or more CPU cores. This means that work for this container will be scheduled only on the cores that have been assigned to this container. That is useful if you want to hard-shard CPUs between applications or if you have applications that need to be pinned to a particular CPU for things like cache efficiency.
In the following example, we are running a stress container pinned to the first of two CPUs, with 512 CPU shares. Note that everything following the container image here are parameters to the stress command, not the docker command.
$ docker run --rm -ti --cpu-shares 512 --cpuset=0 progrium/stress \
--cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
The --cpuset argument is zero-indexed, so your first CPU core is 0. If you tell Docker to use a CPU core that does not exist on the host system, you will get a Cannot start container error. On a two-CPU example host, you could test this by using --cpuset=0,1,2.
If you run top again, you should notice that the percentage of CPU time spent in user space (us) is lower than it previously was, since we have restricted two CPU-bound processes to a single CPU.
%Cpu(s): 18.5 us, 22.0 sy, 0.0 ni, 57.6 id, 0.5 wa, 1.0 hi, 0.3 si, 0.0 st
When you use CPU pinning, additional CPU sharing restrictions on the container only take into account other containers running on the same set of cores.
Using the CPU CFS (Completely Fair Scheduler) within the Linux kernel, you can alter the CPU quota for a given container by setting the --cpu-quota flag to a valid value when launching the container with docker run.
While CPU shares were the original mechanism in Docker for managing CPU limits, Docker has evolved a great deal since and one of the ways that it now makes users’ lives easier is by greatly simplifying how CPU quotas can be set. Instead of trying to set CPU shares and quotas correctly, you can now simply tell Docker how much CPU you would like to be available to your container, and it will do the math required to set the underlying cgroups correctly.
The --cpus command can be set to a floating-point number between 0.01 and the number of CPU cores on the Docker server.
$ docker run -d --cpus=".25" progrium/stress \
--cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 60s
If you try to set the value too high, you’ll get an error message from Docker (not the stress application) that will give you the correct range of CPU cores that you have to work with.
$ docker run -d --cpus="40.25" progrium/stress \
--cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 60s
docker: Error response from daemon: Range of CPUs is from
0.01 to 4.00, as there are only 4 CPUs available.
See 'docker run --help'.
The docker update command can be used to dynamically adjust the resource limits of one of more containers. You could adjust the CPU allocation on two containers simultaneously, for example, like so:
docker update --cpus="1.5" 6b785f78b75e 92b797f12af1
We can control how much memory a container can access in a manner similar to constraining the CPU. There is, however, one fundamental difference: while constraining the CPU only impacts the application’s priority for CPU time, the memory limit is a hard limit. Even on an unconstrained system with 96 GB of free memory, if we tell a container that it may have access only to 24 GB, then it will only ever get to use 24 GB regardless of the free memory on the system. Because of the way the virtual memory system works on Linux, it’s possible to allocate more memory to a container than the system has actual RAM. In this case, the container will resort to using swap, just like a normal Linux process.
Let’s start a container with a memory constraint by passing the --memory option to the docker run command:
$ docker run --rm -ti --memory 512m progrium/stress \
--cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
When you use the --memory option alone, you are setting both the amount of RAM and the amount of swap that the container will have access to. So by using --memory 512m here, we’ve constrained the container to 512 MB of RAM and 512 MB of additional swap space. Docker supports b, k, m, or g, representing bytes, kilobytes, megabytes, or gigabytes, respectively. If your system somehow runs Linux and Docker and has multiple terabytes of memory, then unfortunately you’re going to have to specify it in gigabytes.
If you would like to set the swap separately or disable it altogether, you need to also use the --memory-swap option. This defines the total amount of memory and swap available to the container. If we rerun our previous command, like so:
$ docker run --rm -ti --memory 512m --memory-swap=768m progrium/stress \
--cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
Then we’re telling the kernel that this container can have access to 512 MB of memory and 256 MB of additional swap space. Setting the --memory-swap option to -1 will disable the swap completely within the container. We can also specifically limit the amount of kernel memory available to a container by using the --kernel-memory argument to docker run or docker create.
Again, unlike CPU shares, memory is a hard limit! This is good, because the constraint doesn’t suddenly have a noticeable effect on the container when another container is deployed to the system. But it does mean that you need to be careful that the limit closely matches your container’s needs because there is no wiggle room. An out-of-memory container causes the kernel to behave just like it would if the system were out of memory. It will try to find a process to kill in order to free up space. This is a common failure case where containers have their memory limits set too low. The telltale sign of this issue is a container exit code of 137 and kernel out-of-memory (OOM) messages in the dmesg output.
So, what happens if a container reaches its memory limit? Well, let’s give it a try by modifying one of our previous commands and lowering the memory significantly:
$ docker run --rm -ti --memory 100m progrium/stress --cpu 2 --io 1 --vm 2 \
--vm-bytes 128M --timeout 120s
While all of our other runs of the stress container ended with the line:
stress: info: [1] successful run completed in 120s
We see that this run quickly fails with the line:
stress: FAIL: [1] (452) failed run completed in 1s
This is because the container tries to allocate more memory than it is allowed, and the Linux out-of-memory (OOM) killer is invoked and starts killing processes within the cgroup to reclaim memory. Since our container has only one running process, this kills the container.
Docker has features that allow you to tune and disable the Linux OOM killer by using the --oom-kill-disable and the --oom-score-adj arguments to docker run.
If you access your Docker server, you can see the kernel message related to this event by running dmesg. The output will look something like this:
[ 4210.403984] stress invoked oom-killer: gfp_mask=0x24000c0 ... [ 4210.404899] stress cpuset=5bfa65084931efabda59d9a70fa8e88 ... [ 4210.405951] CPU: 3 PID: 3429 Comm: stress Not tainted 4.9 ... [ 4210.406624] Hardware name: BHYVE, BIOS 1.00 03/14/2014 ... [ 4210.408978] Call Trace: [ 4210.409182] [<ffffffff94438115>] ? dump_stack+0x5a/0x6f .... [ 4210.414139] [<ffffffff947f9cf8>] ? page_fault+0x28/0x30 [ 4210.414619] Task in /docker-ce/docker/5...3 killed as a result of limit of /docker-ce/docker/5...3 [ 4210.416640] memory: usage 102380kB, limit 102400kB, failc ... [ 4210.417236] memory+swap: usage 204800kB, limit 204800kB, ... [ 4210.417855] kmem: usage 1180kB, limit 9007199254740988kB, ... [ 4210.418485] Memory cgroup stats for /docker-ce/docker/5...3: cache:0KB rss:101200KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:11472KB swap:102420KB inactive_anon:50728KB active_anon:50472KB inactive_file:0KB active_file:0KB unevictable:0KB ... [ 4210.426783] Memory cgroup out of memory: Kill process 3429... [ 4210.427544] Killed process 3429 (stress) total-vm:138388kB, anon-rss:44028kB, file-rss:900kB, shmem-rss:0kB [ 4210.442492] oom_reaper: reaped process 3429 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
This out-of-memory event will also be recorded by Docker and viewable via docker events.
$ docker events
2018-01-28T15:56:19.972142371-08:00 container oom \
d0d803ce32c4e86d0aa6453512a9084a156e96860e916ffc2856fc63ad9cf88b \
(image=progrium/stress, name=loving_franklin)
Many containers are just stateless applications and won’t have a need for I/O restrictions. But Docker also supports limiting block I/O in a few different ways via the cgroups mechanism.
The first way is applying some prioritization to a container’s use of block device I/O. You enable this by manipulating the default setting of the blkio.weight cgroup attribute. This attribute can have a value of 0 (disabled) or a number between 10 and 1000, the default being 500. This limit acts a bit like CPU shares, in that the system will divide all of the available I/O between every process within a cgroup slice by 1000, with the assigned weights impacting how much available I/O is available to each process.
To set this weight on a container, you need to pass the --blkio-weight to your docker run command with a valid value. You can also target a specific device using the --blkio-weight-device option.
As with CPU shares, tuning the weights is hard to get right in practice, but we can make it vastly simpler by limiting the maximum number of bytes or operations per second that are available to a container via its cgroup. The following settings let us control that:
--device-read-bps Limit read rate (bytes per second) from a device --device-read-iops Limit read rate (IO per second) from a device --device-write-bps Limit write rate (bytes per second) to a device --device-write-iops Limit write rate (IO per second) to a device
You can test how these impact the performance of a container by running some of the following commands, which use the Linux I/O tester bonnie.
$ time docker run -ti --rm spkane/train-os:latest bonnie++ -u 500:500 \
-d /tmp -r 1024 -s 2048 -x 1
...
real 0m27.715s
user 0m0.027s
sys 0m0.030s
$ time docker run -ti --rm --device-write-iops /dev/sda:256 \
spkane/train-os:latest bonnie++ -u 500:500 -d /tmp -r 1024 -s 2048 -x 1
...
real 0m58.765s
user 0m0.028s
sys 0m0.029s
$ time docker run -ti --rm --device-write-bps /dev/sda:5mb \
spkane/train-os:latest bonnie++ -u 500:500 -d /tmp -r 1024 -s 2048 -x 1
...
Windows users should be able to use the PowerShell Measure-Command function to replace the Unix time command used in these examples.
In our experience, the --device-read-ops and --device-write-ops are the most effective way to set limits, and the one we recommend. Of course there could be reasons why one of the other methods is better for your use case, so you should know about them.
Before Linux cgroups, there was another way to place a limit on the resources available to a process: the application of user limits via the ulimit command. That mechanism is still available and still useful for all of the use cases where it was traditionally used.
The following code is a list of the types of system resources that you can usually constrain by setting soft and hard limits via the ulimit command:
$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 5835 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
It is possible to configure the Docker daemon with the default user limits that you want to apply to every container. The following command tells the Docker daemon to start all containers with a soft limit of 50 open files and a hard limit of 150 open files:
$ sudo dockerd --default-ulimit nofile=50:150
You can then override these ulimits on a specific container by passing in values using the --ulimit argument.
$ docker run -d --ulimit nofile=150:300 nginx
There are some additional advanced commands that you can use when creating containers, but this covers many of the more common use cases. The Docker client documentation lists all the available options and is updated with each Docker release.
Before we got into the details of containers and constraints, we created our container using the docker create command. That container is just sitting there without doing anything. There is a configuration, but no running process. When we’re ready to start the container, we can do so using the docker start command.
Let’s say that we needed to run a copy of Redis, a common key/value store. We won’t really do anything with this Redis container, but it’s a lightweight, long-lived process and serves as an example of something we might do in a real environment. We could first create the container:
$ docker create -p 6379:6379 redis:2.8 Unable to find image 'redis:2.8' locally 30d39e59ffe2: Pull complete ... 868be653dea3: Pull complete 511136ea3c5a: Already exists redis:2.8: The image you are pulling has been verified. Important: ... Status: Downloaded newer image for redis:2.8 6b785f78b75ec2652f81d92721c416ae854bae085eba378e46e8ab54d7ff81d1
The result of the command is some output, the last line of which is the full hash that was generated for the container. We could use that long hash to start it, but if we failed to note it down, we could also list all the containers on the system, whether they are running or not, using:
$ docker ps -a CONTAINER ID IMAGE COMMAND ... 6b785f78b75e redis:2.8 "/entrypoint.sh redi ... 92b797f12af1 progrium/stress:latest "/usr/bin/stress --v ...
We can identify our container by the image we used and the creation time (truncated here for formatting). We can then start the container with the following command:
$ docker start 6b785f78b75e
Most Docker commands will work with the container name, the full hash, the short hash, or even just enough of the hash to make it unique. In the previous example, the full hash for the container is 6b785f78b75ec2652f81d92…bae085eba378e46e8ab54d7ff81d1, but the short hash that is shown in most command output is 6b785f78b75e. This short hash consists of the first 12 characters of the full hash. In the previous example, running docker start 6b7 would have worked just fine.
That should have started the container, but with it running it the background we won’t necessarily know if something went wrong. To verify that it’s running, we can run:
$ docker ps CONTAINER ID IMAGE COMMAND ... STATUS ... 6b785f78b75e redis:2.8 "/entrypoint.sh redi ... Up 2 minutes ...
And, there it is: running as expected. We can tell because the status says “Up” and shows how long the container has been running.
In many cases, we want our containers to restart if they exit. Some containers are very short-lived and come and go quickly. But for production applications, for instance, you expect them to be up and running at all times after you’ve told them to run. If you are running a more complex system, a scheduler may do this for you.
In the simple case, we can tell Docker to manage restarts on our behalf by passing the --restart argument to the docker run command. It takes four values: no, always, or on-failure, or unless-stopped. If restart is set to no, the container will never restart if it exits. If it is set to always, the container will restart whenever it exits, with no regard to the exit code. If restart is set to on-failure:3, then whenever the container exits with a nonzero exit code, Docker will try to restart the container three times before giving up. unless-stopped is the most command choice, and will restart the container unless it is intentionally stopped with something like docker stop.
We can see this in action by rerunning our last memory-constrained stress container without the --rm argument, but with the --restart argument.
$ docker run -ti --restart=on-failure:3 --memory 100m progrium/stress \
--cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
In this example, we’ll see the output from the first run appear on the console before it dies. If we run a docker ps immediately after the container dies, we’ll see that Docker is attempting to restart the container.
$ docker ps ... IMAGE ... STATUS ... ... progrium/stress:latest ... Restarting (1) Less than a second ago ...
It will continue to fail because we haven’t given it enough memory to function properly. After three attempts, Docker will give up and we’ll see the container disappear from the output of docker ps.
Containers can be stopped and started at will. You might think that starting and stopping a container is analogous to pausing and resuming a normal process, but it’s not quite the same in reality. When stopped, the process is not paused; it actually exits. And when a container is stopped, it no longer shows up in the normal docker ps output. On reboot, Docker will attempt to start all of the containers that were running at shutdown. If you need to prevent a container from doing any additional work, without actually stopping the process, then you can pause the Docker container with docker pause and unpause, which will be discussed in more detail later. For now, go ahead and stop our container:
$ docker stop 6b785f78b75e $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Now that we have stopped the container, nothing is in the running ps list! We can start it back up with the container ID, but it would be really inconvenient to have to remember that. So docker ps has an additional option (-a) to show all containers, not just the running ones.
$ docker ps -a CONTAINER ID IMAGE STATUS ... 6b785f78b75e progrium/stress:latest Exited (0) 2 minutes ago ...
That STATUS field now shows that our container exited with a status code of 0 (no errors). We can start it back up with all of the same configuration it had before:
docker start 6b785f78b75e 6b785f78b75e $ docker ps -a CONTAINER ID IMAGE ... STATUS ... 6b785f78b75e progrium/stress:latest Up 15 seconds ...
Voilà, our container is back up and running, and configured just as it was before.
Remember that containers exist as a blob of configuration in the Docker system even when they are not started. That means that as long as the container has not been deleted, you can restart it without needing to recreate it. Although memory contents will have been lost, all of the container’s filesystem contents and metadata, including environment variables and port bindings, are saved and will still be in place when you restart the container.
By now we’ve probably thumped on enough about the idea that containers are just a tree of processes that interact with the system in essentially the same way as any other process on the server. But it’s important to point it out here again because it means that we can send Unix signals to our process in the containers that they can then respond to. In the previous docker stop example, we’re sending the container a SIGTERM signal and waiting for the container to exit gracefully. Containers follow the same process group signal propagation that any other process group would receive on Linux.
A normal docker stop sends a SIGTERM to the process. If you want to force a container to be killed if it hasn’t stopped after a certain amount of time, you can use the -t argument, like this:
$ docker stop -t 25 6b785f78b75e
This tells Docker to initially send a SIGTERM signal as before, but then if the container has not stopped within 25 seconds, to send a SIGKILL signal to forcefully kill it.
Although stop is the best way to shut down your containers, there are times when it doesn’t work and you’ll need to forcefully kill a container, just as you might have to do with any process outside of a container.
When a process is misbehaving, docker stop might not cut it. You might just want the container to exit immediately.
In these circumstances, you can use docker kill . As you’d expect, it looks a lot like docker stop:
$ docker kill 6b785f78b75e 6b785f78b75e
A docker ps command now shows that the container is no longer running, as expected:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Just because it was killed rather than stopped does not mean you can’t start it again, though. You can just issue a docker start like you would for a nicely stopped container. Sometimes you might want to send another signal to a container, one that is not stop or kill. Like the Linux kill command, docker kill supports sending any Unix signal. Let’s say we wanted to send a USR1 signal to our container to tell it to do something like reconnect a remote logging session. We could do the following:
$ docker kill --signal=USR1 6b785f78b75e 6b785f78b75e
If our container actually did something with the USR1 signal, it would now do it. Since we’re just running a bash shell, though, it just continues on as if nothing happened. Try sending a HUP signal, though, and see what happens. Remember that HUP is the signal that is sent when the terminal closes on a foreground process.
There are a few reasons why we might not want to completely stop our container. We might want to pause it, leave its resources allocated, and leave its entries in the process table. That could be because we’re taking a snapshot of its filesystem to create a new image, or just because we need some CPU on the host for a while. If you are used to normal Unix process handling, you might wonder how this actually works since containerized processes are just processes.
Pausing leverages the cgroups freezer, which essentially just prevents your process from being scheduled until you unfreeze it. This will prevent the container from doing anything while maintaining its overall state, including memory contents. Unlike stopping a container, where the processes are made aware that they are stopping via the SIGSTOP signal, pausing a container doesn’t send any information to the container about its state change. That’s an important distinction. Several Docker commands use pausing and unpausing internally as well. Here is how we pause a container:
$ docker pause 6b785f78b75e
To pause and unpause containers in Windows, you must be using Hyper-V as the underlying virtualization technology.
If we look at the list of running containers, we will now see that the Redis container status is listed as (Paused).
# docker ps CONTAINER ID IMAGE ... STATUS ... 6b785f78b75e progrium/stress:latest ... Up 36 minutes (Paused) ...
Attempting to use the container in this paused state would fail. It’s present, but nothing is running. We can now resume the container by using the docker unpause command.
$ docker unpause 6b785f78b75e 6b785f78b75e $ docker ps CONTAINER ID IMAGE ... STATUS ... 6b785f78b75e progrium/stress:latest ... Up 38 minutes ...
It’s back to running, and docker ps correctly reflects the new state. Note that it shows “Up 38 minutes” now, because Docker still considers the container to be running even when it is paused.
After running all these commands to build images, create containers, and run them, we have accumulated a lot of image layers and container folders on our system.
We can list all the containers on our system using the docker ps -a command and then delete any of the containers in the list. We must stop all containers that are using an image before removing the image itself. Assuming we’ve done that, we can remove it as follows, using the docker rm command:
$ docker ps -a CONTAINER ID IMAGE ... 92b797f12af1 progrium/stress:latest ... ... $ docker rm 92b797f12af1
We can then list all the images on our system using:
$ docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE ubuntu latest 5ba9dab47459 3 weeks ago 188.3 MB redis 2.8 868be653dea3 3 weeks ago 110.7 MB progrium/stress latest 873c28292d23 7 months ago 281.8 MB
We can then delete an image and all associated filesystem layers by running:
$ docker rmi 873c28292d23
If you try to delete an image that is in use by a container, you will get a Conflict, cannot delete error. You should stop and delete the container(s) first.
There are times, especially during development cycles, when it makes sense to completely purge all the images or containers from your system. The easiest way to do this is by running the docker system prune command.
$ docker system prune
WARNING! This will remove:
- all stopped containers
- all networks not used by at least one container
- all dangling images
- all build cache
Are you sure you want to continue? [y/N] y
Deleted Containers:
cbbc42acfe6cc7c2d5e6c3361003e077478c58bb062dd57a230d31bcd01f6190
...
Deleted Images:
deleted: sha256:bec6ec29e16a409af1c556bf9e6b2ec584c7fb5ffbfd7c46ec00b30bf ...
untagged: spkane/squid@sha256:64fbc44666405fd1a02f0ec731e35881465fac395e7 ...
...
Total reclaimed space: 1.385GB
To remove all unused images, instead of only dangling images, try docker system prune -a
It is also possible to craft more specific commands to accomplish similar goals.
To delete all of the containers on your Docker hosts, use the following command:
$ docker rm $(docker ps -a -q)
And to delete all the images on your Docker host, this command will get the job done:
$ docker rmi $(docker images -q)
The docker ps and docker images commands both support a filter argument that can make it easy to fine-tune your delete commands for certain circumstances.
To remove all containers that exited with a nonzero state, you can use this filter:
$ docker rm $(docker ps -a -q --filter 'exited!=0')
And to remove all untagged images, you can type:
$ docker rmi $(docker images -q -f "dangling=true")
You can read the official Docker documentation to explore the filtering options. At the moment there are very few filters to choose from, but more will likely be added over time. And if you’re really interested, Docker is an open source project, so it is always open to public code contributions.
You can also make your own very creative filters by stringing together commands using pipes (|) and other similar techniques.
In production systems that see a lot of deployments, you can sometimes end up with old containers or unused images lying around and filling up disk space. It can be useful to script some of these docker rm and docker rmi commands to run on a schedule (e.g., running under cron or via a systemd timer). You can use what you’ve learned to do that for yourself, or you could look at something like Spotify’s docker-gc tool to keep your servers nice and neat. docker-gc has some nice options that let you keep a certain number of images or containers around, exempt certain images from garbage collection, or exclude recent containers. The tool includes a lot of what you would probably implement on your own, but it has been heavily battle-hardened and works well.
Up to now we have focused entirely on Docker commands for Linux containers, since this is the most common use case and works on all Docker platforms. However, since 2016, the Microsoft Windows platform has supported running Windows Docker containers that include native Windows applications and can be managed with the usual set of Docker commands.
Windows containers are not really the focus of this book, since they make up only a very tiny portion of production containers at this point and they aren’t compatible with the rest of the Docker ecosystem because they require Windows-specific container images. However, they’re a growing and important part of the Docker world, so we’ll take a brief look at how they work. In fact, except for the actual contents of the containers, almost everything else works the same as on Linux Docker containers. In this section we’ll run through a quick example of how you can run a Windows container on Windows 10 with Hyper-V and Docker.
For this to work, you must be using Docker Community Edition (or Enterprise Edition) on a 64-bit edition of Windows 10 (Professional or better).
The first thing you’ll need to do is to switch Docker from Linux Containers to Windows Containers. To do this, right-click on the Docker whale icon in your taskbar and select Switch to Windows Containers…. You should get a notification that Docker is switching, and this process might take some time, although usually it happens almost immediately. Unfortunately, there is no notification that the switch has completed, so you’ll just need to try using the Windows version. If you right-click on the Docker icon again, you should now see Switch to Linux Containers… in place of the original option.
If the first time you right-click on the Docker icon, it reads Switch to Linux Containers…, then you are already configured for Windows containers.
We can test a simple Windows container by opening up PowerShell and trying to run the following command:
PS C:\> docker run -it microsoft/nanoserver powershell 'Write-Host "Hello World"' Hello World
This will download and launch a base container for Windows Server Nano Server and then use PowerShell scripting to print Hello World to the screen.
If you want to build an image that accomplishes the same task, you can create the following Dockerfile:
FROMmicrosoft/nanoserverRUNpowershell.exe Add-Content C:\\helloworld.ps1'Write-Host "Hello World"'CMD["powershell", "C:\\helloworld.ps1"]
When we build this Dockerfile it will base the container on microsoft/nanoserver, create a very small PowerShell script, and then set the image to run the script by default, when this image is used to launch a container.
You may have noticed that we had to escape the backslash (\) with an additional backslash in the preceding Dockerfile. This is because Docker has its roots in Unix and the backslash has a special meaning in Unix shells. So, we escape all backslashes to ensure that Docker does not interpret them to mean that this command is continued on the next line.
If you build this Dockerfile now, you’ll see something similar to this:
PS C:\> docker build -t windows-helloworld:latest .
Sending build context to Docker daemon 2.048kB
Step 1/3 : FROM microsoft/nanoserver
---> 8a62949f0058
Step 2/3 : RUN powershell.exe Add-Content C:\\helloworld.ps1 'Write-Host \
"Hello World"'
---> Using cache
---> 930ed3b401fb
Step 3/3 : CMD ["powershell", "C:\\helloworld.ps1"]
---> Using cache
---> 445764524411
Successfully built 445764524411
Successfully tagged windows-helloworld:latest
And now if you run the resulting image, you’ll see this:
PS C:\> docker run --rm -ti windows-helloworld:latest Hello World
Microsoft maintains good documentation about Windows containers that also includes an example of building a container that launches a .NET application.
On the Windows platform, it is also useful to know that you can get improved isolation for your container by launching it inside a dedicated and very lightweight Hyper-V virtual machine. You can do this very easily, by simply adding the --isolation=hyperv option to your docker create and docker run commands. There is a small performance and resource penalty for this, but it does significantly improve the isolation of your container. You can read more about this in the documentation.
Even if you plan to mostly work with Windows Containers, for the rest of the book you should switch back to Linux Containers, so that all the examples work as expected. When you are done reading and are ready to dive into building your own containers, you can always switch back.
Remember that you can reenable Linux Containers by right-clicking on the Docker icon, and selecting Switch to Linux Containers….
In the next chapter, we’ll continue our exploration of what Docker brings to the table. For now it’s probably worth doing a little experimentation on your own. We suggest exercising some of the container control commands we covered here so that you’re familiar with the command-line options and the overall syntax. Now would even be a great time to try to design and build a small image and then launch it as a new container. When you are ready to continue, head on to Chapter 6!