Containers Fundamentals

A container is not a VM. It’s a regular Linux process that the kernel has tricked into believing it lives alone on the machine. The tricks are namespaces (what the process can see) and cgroups (what the process can use). Everything else — Docker, Kubernetes, runc, OCI — is tooling built on those two kernel features.

The one-sentence definition

A container is a Linux process with restricted visibility (namespaces) and resource limits (cgroups), usually running from a filesystem bundle (image) produced from a Dockerfile.

Container vs VM — the mental model

┌──────────────────────── Virtual Machines ──────────────────────────┐
│  app          app          app                                     │
│  libs         libs         libs                                    │
│  guest OS     guest OS     guest OS        ← each VM = full kernel │
│        ────────────────────────────                                │
│                  hypervisor                 ← Type 1 / Type 2      │
│        ────────────────────────────                                │
│                   hardware                                         │
└────────────────────────────────────────────────────────────────────┘

┌────────────────────── Containers ──────────────────────────────────┐
│  app          app          app                                     │
│  libs         libs         libs             ← separate userspace   │
│        ────────────────────────────                                │
│                   host kernel               ← SHARED               │
│        ────────────────────────────                                │
│                    hardware                                        │
└────────────────────────────────────────────────────────────────────┘
ContainerVM
BoundaryLinux namespaces + cgroupsHypervisor + separate kernel
Boot timemsSeconds to minutes
Memory overhead~MBHundreds of MB
KernelShared with hostIts own
OS in imageJust the userspace (busybox, alpine, debian slim)Full OS
IsolationProcess-levelHardware-level
AnalogueVLAN (logical segmentation)Separate physical switch

The kernel-shared property is the big deal: containers start in milliseconds, weigh tens of MB, and scale to thousands on one box. The cost is weaker isolation — a kernel bug is a tenant-boundary bug.

The five kernel namespaces you care about

Namespaces partition what a process can see. Each process belongs to one namespace of each type:

NamespaceScopesWhat it hides
PIDProcess IDsThe container sees PID 1 = itself; host PIDs are invisible
NETNetwork stackOwn interfaces, routing table, iptables rules, sockets
MNTMount pointsIts own / and filesystem tree
UTSHostname, domainhostname shows container’s, not host’s
IPCSysV IPC, POSIX msg queuesSeparate shared-memory area
USERUIDs, GIDsRoot inside ≠ root outside (user namespaces remap IDs)
CGROUPcgroup hierarchy viewContainer sees ”/” as its cgroup root
TIMEBoot/monotonic clocks(Linux 5.6+) Separate clock per container — rarely used

Created via clone() / unshare() syscalls with flags like CLONE_NEWPID, CLONE_NEWNET. Docker/runc set them all at once; you can play with one at a time:

# A shell that sees only its own PID tree
unshare --fork --pid --mount-proc bash
# inside: ps ax → just this shell and ps, PID 1 is bash

cgroups — what a process can use

Control groups limit and account for resource use. A process’s cgroup membership caps:

ResourceLimit
CPUShares (weight), quota/period (hard cap), cpuset (pinning)
Memorymemory.max, memory.high, memory.swap.max
I/OIOPS and bandwidth per block device
PIDsMax processes (fork-bomb protection)
NetworkClassid for tc shaping (rarely used directly)
DevicesWhich /dev nodes are allowed (cgroup v1)

cgroup v2 is the modern unified hierarchy (one tree, not one per controller). Every recent Linux is on v2. Kubernetes assumes v2 on modern installs.

From the command line:

# Docker — set limits on run
docker run --cpus=1.5 --memory=512m --pids-limit=200 nginx
 
# systemd — same knobs (systemd uses cgroups for all services)
systemctl set-property nginx.service CPUQuota=150% MemoryMax=512M

Under the hood both edit cgroup files like /sys/fs/cgroup/....

Images — a stack of layers

An image is a filesystem snapshot plus metadata (entrypoint, env, labels). It’s not a single tarball — it’s a stack of layers.

┌─── image: myapp:1.2 ───┐
│   layer 5: your code   │  ← thin, changes often
│   layer 4: pip install │
│   layer 3: apt update  │
│   layer 2: python:3.12 │
│   layer 1: debian:bookworm
└────────────────────────┘
  • Each Dockerfile instruction that changes the filesystem = one layer.
  • Layers are content-addressable (sha256 of contents). Pull / push / cache by hash.
  • The OverlayFS driver stacks layers into a single view at container runtime.
  • Copy-on-write: the running container gets an ephemeral writeable layer on top; base layers stay immutable and shared.

Implications:

  • Two images sharing base layers only download / store the shared layers once.
  • Layer order matters for caching: put slow-changing things early (FROM, apt install) and fast-changing things late (COPY . .). Otherwise one code change invalidates the whole chain.

The Dockerfile (minimal sketch)

# syntax=docker/dockerfile:1.7
FROM python:3.12-slim AS base
WORKDIR /app
 
# Deps layer — changes rarely
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
 
# App layer — changes often
COPY . .
 
EXPOSE 8080
USER 1000:1000
ENTRYPOINT ["python", "-m", "myapp"]

Conventions worth following:

  • Pin the base image (python:3.12-slim not latest).
  • Multi-stage builds — build stage has compilers/SDKs; final stage is a slim image with just the artifact. Reduces image size 10-100x.
  • Non-root USER — never run as root unless you have to. Most K8s policies reject root containers.
  • One process per container is the spirit, but not literal — many processes is fine (a sidecar model is the discipline).
  • .dockerignore — don’t copy your node_modules, .git, or secrets into the build context.

OCI — the standard

Early Docker was proprietary. The Open Container Initiative (OCI) standardised:

  • OCI Image Format — how images are structured on disk / in registries.
  • OCI Runtime Spec — how a runtime starts a container from a filesystem bundle.
  • OCI Distribution Spec — the registry API.

Thanks to OCI, any OCI runtime (runc, crun, gVisor, kata) runs any OCI image, from any OCI registry. Docker isn’t required for any of them.

The runtime stack

There’s more than one “thing that runs a container”:

  Developer CLI    →   docker / podman / nerdctl
                           │
                           ▼
  High-level runtime  →   containerd / CRI-O       (manages images, networking, lifecycle)
                           │
                           ▼
  Low-level runtime   →   runc / crun / gVisor / kata   (calls clone, sets up namespaces/cgroups)
                           │
                           ▼
                      Linux kernel
  • Docker (engine) uses containerd under the hood, which shells out to runc to start each container.
  • Kubernetes used to use Docker; now it uses containerd or CRI-O directly via the CRI interface (Container Runtime Interface). The dockershim was removed in K8s 1.24.
  • Podman is rootless/daemonless; same image spec, no long-running daemon.

You don’t need to know the internals to use containers, but when something goes wrong at the kernel level (weird cgroup behaviour, AppArmor denial), knowing which layer you’re at is essential.

Container networking — the quick tour

Each container has its own net namespace. Docker (default) sets up:

  • A host bridge docker0 (L2 switch).
  • A veth pair per container: one end inside the container (as eth0), one on the host plugged into the bridge.
  • NAT via iptables for outbound internet.
   container-A                container-B
    ┌──────┐                   ┌──────┐
    │ eth0 │                   │ eth0 │
    └──┬───┘                   └──┬───┘
       │ veth                     │ veth
       └──────┬──────  docker0 ───┘
              │
          host NIC (masquerade)
              │
              ▼
           internet

Drivers for different needs:

  • bridge (default): isolated L2 on the host; NAT for external.
  • host: no namespace — container shares the host’s stack. Fast, no isolation.
  • macvlan / ipvlan: container gets a real MAC/IP on the physical LAN. For legacy apps that need L2 visibility.
  • overlay: spans hosts via VXLAN (Docker Swarm, Kubernetes with some CNIs).

Kubernetes uses a CNI plugin (Calico, Cilium, Flannel, …) to implement pod networking. Pods get flat, routable IPs; the CNI chooses how.

Container storage

Three categories:

  1. Image layers (read-only) — everything baked into the image. Immutable.
  2. Container writeable layer — the ephemeral top layer. Wiped when container dies. Never put state here.
  3. Volumes / bind mounts — persistent storage outside the container filesystem.
Mount typeExampleWhen
Named volumedocker volume create data; -v data:/var/lib/appPersistent data you want Docker to manage
Bind mount-v /host/path:/container/pathDev: mount your source code into the container
tmpfs--tmpfs /tmpWrite-heavy ephemeral scratch space
K8s PV/PVCPersistentVolumeClaim → cloud disk, NFS, CephThe K8s abstraction over all of the above

Rule: container filesystem is ephemeral. Anything you care about goes in a volume or an external datastore.

Security model — where containers are weaker than VMs

Because containers share the host kernel, the attack surface between a container and the host is every syscall. Defenses:

  • Drop capabilities — default Docker drops most of root’s caps. Don’t --privileged unless you must.
  • Read-only root filesystem--read-only + tmpfs for writes.
  • User namespaces — map container root to unprivileged host UID.
  • seccomp — allow-list of syscalls (Docker ships a default seccomp profile).
  • AppArmor / SELinux — MAC policy on top.
  • Rootless containers (Podman, rootless Docker) — the whole runtime runs as a non-root user.

For workloads with truly untrusted tenants, add a stronger sandbox: gVisor (intercepts syscalls in userspace) or Kata Containers (runs each container in a lightweight VM).

Where containers shine, and where they don’t

Great at:

  • Packaging an app with its deps (Python + its libs + its Python version).
  • Fast, predictable deployment (same image, dev → prod).
  • Horizontal scale (start 100 in a minute).
  • Ephemeral workloads (CI jobs, batch).

Awkward at:

  • Stateful systems with direct hardware (HPC, DPDK, SR-IOV) without extra work.
  • Desktop applications with complex GUIs.
  • Anything that needs a specific kernel version different from the host.
  • Untrusted multi-tenant workloads on a single kernel (see “weaker than VMs”).

See also