DevOps Fundamentals
DevOps is not a job title, a tool, or a team. It’s a set of practices for shortening the feedback loop between “we want a change” and “the change is running in production, working.” Tools and teams exist to support the practice — they are not the practice itself.
The problem DevOps exists to solve
Before DevOps, two teams with competing incentives:
- Dev rewarded for shipping features fast → writes code and throws it over the wall.
- Ops rewarded for uptime → resists every change because every change is risk.
Result: long release cycles (months), ticket ping-pong, blame games after every outage, production state no one can reproduce. Everybody was doing their job and the overall system was broken.
DevOps answers: tear down the wall. Shared goal (value delivered to users), shared tooling (code in git, infra as code, CI/CD), shared responsibility (you build it, you run it).
The three ways (from “The Phoenix Project”)
A useful mental model:
- Flow — optimise left-to-right, from idea to production. Remove handoffs, reduce batch size, make the pipeline visible.
- Feedback — shorten feedback loops. Fast tests. Monitoring that alerts early. Users get fixes in hours, not weeks.
- Continuous learning — blameless postmortems, chaos engineering, game days, experimentation as a first-class activity.
Every DevOps practice is fundamentally one of these three.
What DevOps is (in practice)
A working DevOps setup usually includes all of these:
1. Version control for everything
Not just application code. Everything diffable in git:
- Application code
- Infrastructure as code (Automation-IaC, Terraform, Ansible)
- CI/CD pipeline definitions (
.github/workflows,.gitlab-ci.yml,Jenkinsfile) - Kubernetes manifests
- Documentation, runbooks, this wiki
- Config files (under
/etc/in a repo, applied by Ansible)
If it’s not in git, it doesn’t exist — because you can’t diff, review, rollback, or audit it.
2. CI — continuous integration
Every commit runs an automated pipeline: build, lint, test, security scan. Fast (< 10 min), reliable (no flaky tests), and mandatory (can’t merge if red). See CI-CD Fundamentals.
3. CD — continuous delivery / deployment
Delivery: every commit produces a deployable artifact; humans decide when to release. Deployment: every commit that passes CI goes to production automatically.
Both require reliable tests, automated rollback, and feature flags (to decouple “deploy” from “release”).
4. Infrastructure as Code
Servers, networks, databases, DNS records — defined in text files, applied by tools (Ansible Fundamentals, Terraform, Pulumi). Cattle, not pets: any machine is disposable because recreating it is terraform apply away.
See IaC Fundamentals and Automation-IaC.
5. Observability
You don’t know your system unless you can see it. Metrics (Prometheus), logs (Loki, ELK, Cloud Logging), traces (Jaeger, Tempo, Datadog). See Observability.
The rule: if it alerts, there’s a runbook. If there’s no runbook, the alert is noise.
6. “You build it, you run it”
Werner Vogels’ famous line. Developers carry pagers for their services. It aligns incentives — sloppy code wakes you up at 3 AM, so you write better code.
Not every org can do this; at minimum, dev and ops share a Slack channel and one on-call rota.
Tooling landscape (the short version)
| Slice | Examples |
|---|---|
| Version control | Git (+ GitHub / GitLab / Bitbucket) |
| CI/CD | GitHub Actions, GitLab CI, Jenkins, CircleCI, Argo Workflows |
| IaC | Terraform / OpenTofu, Ansible, Pulumi, CloudFormation, Bicep |
| Containers | Docker, Podman, containerd |
| Orchestration | Kubernetes (+ k3s, OpenShift, EKS/GKE/AKS) |
| GitOps | ArgoCD, Flux |
| Templating / packaging | Helm, Kustomize |
| Artifact registries | Docker Hub, ECR / GCR / ACR, Artifactory, Nexus |
| Secrets | Vault, cloud KMS, SOPS, sealed-secrets — see Secrets Management |
| Monitoring | Prometheus + Grafana, Datadog, New Relic, CloudWatch |
| Logging | Loki, Elastic, Splunk, Cloud Logging |
| Tracing | Jaeger, Tempo, Honeycomb, OTEL |
| Alerting / paging | PagerDuty, Opsgenie, Alertmanager |
| Feature flags | LaunchDarkly, Unleash, GrowthBook |
| Chaos | Chaos Monkey, Litmus, Gremlin |
Don’t learn them all. Learn one per slice well; the concepts transfer.
DORA metrics — how to tell if you’re “doing DevOps”
The DORA research (now annual “State of DevOps” report) distilled performance into four metrics:
| Metric | What it measures | Elite target |
|---|---|---|
| Deployment frequency | How often you ship | On-demand, multiple per day |
| Lead time for changes | Commit → production | Under 1 hour |
| Change failure rate | % deploys causing problems | 0–15 % |
| Time to restore service | Outage → resolution | Under 1 hour |
A fifth was added later: reliability (SLOs met).
These four together beat any single metric, because they trade off: you can ship fast with a high failure rate (cheating) or have rock-solid changes by never shipping (cheating in the other direction). All four are equally watched.
GitOps — the purest form
GitOps is DevOps with a specific discipline: a git repo is the declared state of the system, and a controller (ArgoCD, Flux) continuously reconciles actual state to match.
Properties:
- No kubectl apply from laptops. Ever.
- Changes flow through PR → merge → controller.
- Drift is visible (diff of declared vs actual) and auto-corrected.
- Rollback is
git revert.
See GitOps Fundamentals. This is where Idempotence and Declarative vs Imperative Automation become critical — GitOps only works because the controllers converge on declared state.
Deployment strategies
How do you actually put new code in production without downtime?
| Strategy | How it works | Rollback |
|---|---|---|
| Recreate | Stop v1, start v2 | Slow, with downtime |
| Rolling | Replace instances one-by-one | Revert & roll back the same way |
| Blue/Green | Two full environments; swap the load balancer | Swap back instantly |
| Canary | Send 1% → 5% → 25% → 100% of traffic to v2 | Stop the canary |
| Shadow / dark launch | v2 receives traffic but responses are thrown away | Turn off shadowing |
| Feature flags | Deploy v2 with features off; toggle per user / cohort | Flip flag off |
Modern prod is usually: rolling for infra (managed by k8s), canary + feature flags for features.
Culture, not tools
A common failure mode: a company buys Jenkins + Docker + Terraform, doesn’t change incentives or structures, and then is “doing DevOps.” They aren’t.
The things that actually move the needle:
- Small batches. Merge small PRs often. Huge PRs = huge risk = slow review = slow feedback.
- Blameless postmortems. People aren’t the cause — systems are. Write postmortems that change systems.
- Stop-the-line culture. Anyone can halt a deploy or escalate a risk without fear.
- On-call sanity. Pages should be rare, actionable, and owned. 3-AM pages with no runbook is a system failure, not a human one.
- Automation as discipline. If you did it twice manually, the third time is a script.
Where DevOps ends and SRE / Platform begins
Three adjacent words that confuse people:
- DevOps — the cultural practice and set of engineering techniques.
- SRE (Site Reliability Engineering) — Google’s implementation of DevOps. Strong emphasis on SLOs, error budgets, toil reduction, running production as a software problem.
- Platform Engineering — building internal paved-road tooling so product teams don’t each reinvent CI/CD / infra. “Platform as a product.”
They overlap heavily. Tell me what you do day-to-day and I can’t usually tell which title is on the door.
The honest reality
DevOps can absolutely regress. Signs:
- Shipping fast but the change failure rate climbs → tests are theatre, not covering real risk.
- Every service runs differently because “we had a reason” each time → you’ve lost standardisation; platform engineering is needed.
- On-call burnout → alert volume is too high, runbooks don’t exist, or failure modes aren’t being fixed.
- “DevOps team” sitting between Dev and Ops → congratulations, you invented a new wall.
DevOps is a direction, not a destination. You’re always one organisational change or one technology shift from regressing.