Production Operations Runbook

The complete reference for deploying, monitoring, troubleshooting, and maintaining the Python production blueprint

10m 10m reading Lab included

Purpose

This is the operational reference for running python-production-blueprint in production. Keep it updated as your system evolves.

Quick Reference

Item Value
Repository github.com/jinnabaalu/python-production-blueprint
Port 8000
Health GET /health
Readiness GET /ready
Metrics GET /metrics
API Docs GET /docs (staging only)
Log format JSON (structlog)
Tracing OpenTelemetry → Jaeger

Deployment Checklist

Before deploying to production:

  • All CI checks pass (tests, SAST, secret scan, dependency audit)
  • Docker image built and scanned by Trivy
  • No CRITICAL or HIGH vulnerabilities
  • Vault secrets configured (AppRole credentials)
  • Environment variables set correctly
  • Health check endpoints responding
  • Monitoring dashboards updated
  • Rollback plan documented

Environment Variables

# Required
APP_NAME=python-production-blueprint
APP_ENV=production
APP_VERSION=0.2.0

# Logging
LOG_FORMAT=json
LOG_LEVEL=INFO
LOG_FILE_ENABLED=true          # true for Marathon, optional for K8s
LOG_FILE_PATH=/var/log/app/app.log

# OpenTelemetry
OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=python-production-blueprint

# Vault
VAULT_ENABLED=true
VAULT_URL=http://vault:8200
VAULT_AUTH_METHOD=approle
VAULT_MOUNT_POINT=secret
VAULT_SECRET_PATH=python-production-blueprint

Deploy Commands

Docker Compose

docker compose up -d --build
docker compose logs -f app

Marathon

curl -X PUT http://marathon.example.com/v2/apps/python-app \
  -H "Content-Type: application/json" \
  -d @deploy/marathon/marathon-app.json

Docker Swarm

docker stack deploy -c deploy/docker-swarm/docker-stack.yml python-app
docker service ps python-app_app

Kubernetes

kubectl apply -f deploy/kubernetes/
kubectl rollout status deployment/python-production-blueprint -n python-app

Rollback Procedures

Docker Swarm

docker service update --rollback python-app_app

Kubernetes

kubectl rollout undo deployment/python-production-blueprint -n python-app
kubectl rollout history deployment/python-production-blueprint -n python-app

Marathon

# Deploy previous version
curl -X PUT http://marathon.example.com/v2/apps/python-app \
  -H "Content-Type: application/json" \
  -d '{"container":{"docker":{"image":"your-registry/python-production-blueprint:0.1.0"}}}'

Troubleshooting

App not starting

# Check container logs
docker logs python-app-dev
kubectl logs -l app=python-production-blueprint -n python-app

# Common causes:
# - Vault unreachable → VAULT_ENABLED=false to bypass
# - Port conflict → check APP_PORT
# - Missing env vars → check ConfigMap/env_file

Health check failing

# Test manually
curl -v http://localhost:8000/health

# If 000/connection refused → app not started, check logs
# If 500 → app error, check structured logs for stack trace
# If timeout → check CPU/memory limits, app may be overloaded

Missing logs

# Check if file logging is enabled
curl http://localhost:8000/health  # Triggers a log entry

# Check the log file directly
docker exec python-app-dev cat /var/log/app/app.log

# Check Vector is running
docker ps | grep vector

# Check Vector metrics
curl http://vector:8686/health

High memory usage

# Check container metrics
docker stats python-app-dev

# Kubernetes
kubectl top pods -n python-app

# Common causes:
# - Too many workers → reduce APP_WORKERS
# - Memory leak → check for unclosed connections
# - Log buffer overflow → check LOG_FILE_MAX_BYTES

Vault connection errors

# Test Vault connectivity
curl http://vault:8200/v1/sys/health

# Check AppRole credentials
curl -X POST http://vault:8200/v1/auth/approle/login \
  -d '{"role_id":"your-role-id","secret_id":"your-secret-id"}'

# Bypass Vault temporarily
VAULT_ENABLED=false

Monitoring

Key Metrics to Watch

Metric Healthy Range Alert Threshold
http_requests_total Varies Sudden drop = outage
Request latency p99 < 500ms > 1s
Error rate (5xx) < 0.1% > 1%
CPU usage < 70% > 85%
Memory usage < 80% > 90%
Pod restarts 0 > 2 in 5 min

Dashboards

  • Prometheus: http://prometheus:9090 — raw metrics
  • Jaeger: http://jaeger:16686 — distributed traces
  • Kibana: http://kibana:5601 — log search and analysis
  • Grafana: Import dashboards for FastAPI + Vector

Log Search (Kibana)

# Find errors
level:ERROR AND service:python-production-blueprint

# Trace a specific request
request_id:"abc-123-def"

# Find slow requests
duration_ms:>1000

# Find recent deploys
event:application_starting

Scaling

Horizontal Scaling

# Kubernetes
kubectl scale deployment/python-production-blueprint --replicas=4 -n python-app

# Docker Swarm
docker service scale python-app_app=4

# Marathon
curl -X PUT http://marathon.example.com/v2/apps/python-app \
  -d '{"instances": 4}'

Vertical Scaling

Adjust resource limits in deployment manifests:

# Kubernetes
resources:
  requests:
    cpu: 200m      # Up from 100m
    memory: 256Mi  # Up from 128Mi
  limits:
    cpu: 1000m     # Up from 500m
    memory: 512Mi  # Up from 256Mi

Maintenance Windows

Scheduled Maintenance Checklist

  • Notify stakeholders
  • Scale up before patching (extra capacity)
  • Apply updates to one node/pod at a time
  • Verify health checks pass after each update
  • Monitor error rates for 15 minutes
  • Scale back to normal capacity

Architecture Summary

Client → Ingress/HAProxy → FastAPI App (N replicas)
                                  ↓
                          Structured Logs → Vector Agent
                          Traces → Jaeger (OTLP)       ↓
                          Metrics → Prometheus      Kafka
                                                       ↓
                                              Vector Aggregator
                                                       ↓
                                              Elasticsearch → Kibana

Course Complete

You’ve built a production-grade Python application from scratch — covering API design, configuration, secrets management, structured logging, distributed tracing, log pipelines, testing, security scanning, multi-platform deployment, and operational maintenance.

The python-production-blueprint repository is your reference implementation. Fork it, adapt it to your stack, and ship with confidence.