Containers Fundamentals
A container is not a VM. It’s a regular Linux process that the kernel has tricked into believing it lives alone on the machine. The tricks are namespaces (what the process can see) and cgroups (what the process can use). Everything else — Docker, Kubernetes, runc, OCI — is tooling built on those two kernel features.
The one-sentence definition
A container is a Linux process with restricted visibility (namespaces) and resource limits (cgroups), usually running from a filesystem bundle (image) produced from a Dockerfile.
Container vs VM — the mental model
┌──────────────────────── Virtual Machines ──────────────────────────┐
│ app app app │
│ libs libs libs │
│ guest OS guest OS guest OS ← each VM = full kernel │
│ ──────────────────────────── │
│ hypervisor ← Type 1 / Type 2 │
│ ──────────────────────────── │
│ hardware │
└────────────────────────────────────────────────────────────────────┘
┌────────────────────── Containers ──────────────────────────────────┐
│ app app app │
│ libs libs libs ← separate userspace │
│ ──────────────────────────── │
│ host kernel ← SHARED │
│ ──────────────────────────── │
│ hardware │
└────────────────────────────────────────────────────────────────────┘
| Container | VM | |
|---|---|---|
| Boundary | Linux namespaces + cgroups | Hypervisor + separate kernel |
| Boot time | ms | Seconds to minutes |
| Memory overhead | ~MB | Hundreds of MB |
| Kernel | Shared with host | Its own |
| OS in image | Just the userspace (busybox, alpine, debian slim) | Full OS |
| Isolation | Process-level | Hardware-level |
| Analogue | VLAN (logical segmentation) | Separate physical switch |
The kernel-shared property is the big deal: containers start in milliseconds, weigh tens of MB, and scale to thousands on one box. The cost is weaker isolation — a kernel bug is a tenant-boundary bug.
The five kernel namespaces you care about
Namespaces partition what a process can see. Each process belongs to one namespace of each type:
| Namespace | Scopes | What it hides |
|---|---|---|
| PID | Process IDs | The container sees PID 1 = itself; host PIDs are invisible |
| NET | Network stack | Own interfaces, routing table, iptables rules, sockets |
| MNT | Mount points | Its own / and filesystem tree |
| UTS | Hostname, domain | hostname shows container’s, not host’s |
| IPC | SysV IPC, POSIX msg queues | Separate shared-memory area |
| USER | UIDs, GIDs | Root inside ≠ root outside (user namespaces remap IDs) |
| CGROUP | cgroup hierarchy view | Container sees ”/” as its cgroup root |
| TIME | Boot/monotonic clocks | (Linux 5.6+) Separate clock per container — rarely used |
Created via clone() / unshare() syscalls with flags like CLONE_NEWPID, CLONE_NEWNET. Docker/runc set them all at once; you can play with one at a time:
# A shell that sees only its own PID tree
unshare --fork --pid --mount-proc bash
# inside: ps ax → just this shell and ps, PID 1 is bashcgroups — what a process can use
Control groups limit and account for resource use. A process’s cgroup membership caps:
| Resource | Limit |
|---|---|
| CPU | Shares (weight), quota/period (hard cap), cpuset (pinning) |
| Memory | memory.max, memory.high, memory.swap.max |
| I/O | IOPS and bandwidth per block device |
| PIDs | Max processes (fork-bomb protection) |
| Network | Classid for tc shaping (rarely used directly) |
| Devices | Which /dev nodes are allowed (cgroup v1) |
cgroup v2 is the modern unified hierarchy (one tree, not one per controller). Every recent Linux is on v2. Kubernetes assumes v2 on modern installs.
From the command line:
# Docker — set limits on run
docker run --cpus=1.5 --memory=512m --pids-limit=200 nginx
# systemd — same knobs (systemd uses cgroups for all services)
systemctl set-property nginx.service CPUQuota=150% MemoryMax=512MUnder the hood both edit cgroup files like /sys/fs/cgroup/....
Images — a stack of layers
An image is a filesystem snapshot plus metadata (entrypoint, env, labels). It’s not a single tarball — it’s a stack of layers.
┌─── image: myapp:1.2 ───┐
│ layer 5: your code │ ← thin, changes often
│ layer 4: pip install │
│ layer 3: apt update │
│ layer 2: python:3.12 │
│ layer 1: debian:bookworm
└────────────────────────┘
- Each Dockerfile instruction that changes the filesystem = one layer.
- Layers are content-addressable (sha256 of contents). Pull / push / cache by hash.
- The OverlayFS driver stacks layers into a single view at container runtime.
- Copy-on-write: the running container gets an ephemeral writeable layer on top; base layers stay immutable and shared.
Implications:
- Two images sharing base layers only download / store the shared layers once.
- Layer order matters for caching: put slow-changing things early (
FROM,apt install) and fast-changing things late (COPY . .). Otherwise one code change invalidates the whole chain.
The Dockerfile (minimal sketch)
# syntax=docker/dockerfile:1.7
FROM python:3.12-slim AS base
WORKDIR /app
# Deps layer — changes rarely
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# App layer — changes often
COPY . .
EXPOSE 8080
USER 1000:1000
ENTRYPOINT ["python", "-m", "myapp"]Conventions worth following:
- Pin the base image (
python:3.12-slimnotlatest). - Multi-stage builds — build stage has compilers/SDKs; final stage is a slim image with just the artifact. Reduces image size 10-100x.
- Non-root
USER— never run as root unless you have to. Most K8s policies reject root containers. - One process per container is the spirit, but not literal — many processes is fine (a sidecar model is the discipline).
.dockerignore— don’t copy yournode_modules,.git, or secrets into the build context.
OCI — the standard
Early Docker was proprietary. The Open Container Initiative (OCI) standardised:
- OCI Image Format — how images are structured on disk / in registries.
- OCI Runtime Spec — how a runtime starts a container from a filesystem bundle.
- OCI Distribution Spec — the registry API.
Thanks to OCI, any OCI runtime (runc, crun, gVisor, kata) runs any OCI image, from any OCI registry. Docker isn’t required for any of them.
The runtime stack
There’s more than one “thing that runs a container”:
Developer CLI → docker / podman / nerdctl
│
▼
High-level runtime → containerd / CRI-O (manages images, networking, lifecycle)
│
▼
Low-level runtime → runc / crun / gVisor / kata (calls clone, sets up namespaces/cgroups)
│
▼
Linux kernel
- Docker (engine) uses containerd under the hood, which shells out to runc to start each container.
- Kubernetes used to use Docker; now it uses containerd or CRI-O directly via the CRI interface (Container Runtime Interface). The
dockershimwas removed in K8s 1.24. - Podman is rootless/daemonless; same image spec, no long-running daemon.
You don’t need to know the internals to use containers, but when something goes wrong at the kernel level (weird cgroup behaviour, AppArmor denial), knowing which layer you’re at is essential.
Container networking — the quick tour
Each container has its own net namespace. Docker (default) sets up:
- A host bridge
docker0(L2 switch). - A
vethpair per container: one end inside the container (aseth0), one on the host plugged into the bridge. - NAT via iptables for outbound internet.
container-A container-B
┌──────┐ ┌──────┐
│ eth0 │ │ eth0 │
└──┬───┘ └──┬───┘
│ veth │ veth
└──────┬────── docker0 ───┘
│
host NIC (masquerade)
│
▼
internet
Drivers for different needs:
- bridge (default): isolated L2 on the host; NAT for external.
- host: no namespace — container shares the host’s stack. Fast, no isolation.
- macvlan / ipvlan: container gets a real MAC/IP on the physical LAN. For legacy apps that need L2 visibility.
- overlay: spans hosts via VXLAN (Docker Swarm, Kubernetes with some CNIs).
Kubernetes uses a CNI plugin (Calico, Cilium, Flannel, …) to implement pod networking. Pods get flat, routable IPs; the CNI chooses how.
Container storage
Three categories:
- Image layers (read-only) — everything baked into the image. Immutable.
- Container writeable layer — the ephemeral top layer. Wiped when container dies. Never put state here.
- Volumes / bind mounts — persistent storage outside the container filesystem.
| Mount type | Example | When |
|---|---|---|
| Named volume | docker volume create data; -v data:/var/lib/app | Persistent data you want Docker to manage |
| Bind mount | -v /host/path:/container/path | Dev: mount your source code into the container |
| tmpfs | --tmpfs /tmp | Write-heavy ephemeral scratch space |
| K8s PV/PVC | PersistentVolumeClaim → cloud disk, NFS, Ceph | The K8s abstraction over all of the above |
Rule: container filesystem is ephemeral. Anything you care about goes in a volume or an external datastore.
Security model — where containers are weaker than VMs
Because containers share the host kernel, the attack surface between a container and the host is every syscall. Defenses:
- Drop capabilities — default Docker drops most of root’s caps. Don’t
--privilegedunless you must. - Read-only root filesystem —
--read-only+ tmpfs for writes. - User namespaces — map container root to unprivileged host UID.
- seccomp — allow-list of syscalls (Docker ships a default seccomp profile).
- AppArmor / SELinux — MAC policy on top.
- Rootless containers (Podman, rootless Docker) — the whole runtime runs as a non-root user.
For workloads with truly untrusted tenants, add a stronger sandbox: gVisor (intercepts syscalls in userspace) or Kata Containers (runs each container in a lightweight VM).
Where containers shine, and where they don’t
Great at:
- Packaging an app with its deps (Python + its libs + its Python version).
- Fast, predictable deployment (same image, dev → prod).
- Horizontal scale (start 100 in a minute).
- Ephemeral workloads (CI jobs, batch).
Awkward at:
- Stateful systems with direct hardware (HPC, DPDK, SR-IOV) without extra work.
- Desktop applications with complex GUIs.
- Anything that needs a specific kernel version different from the host.
- Untrusted multi-tenant workloads on a single kernel (see “weaker than VMs”).