Datadog charges $15 per host per month for infrastructure monitoring, $12 per host per month for APM, and $0.10 per million log events. For a small team with 5 servers, that is $135/month for basic monitoring and alerting — $1,620/year. The open-source alternative — Prometheus for metrics, Grafana for visualization, Loki for logs, and Alertmanager for notifications — provides equivalent functionality at zero cost, running on a single $5-$10/month VPS that also handles monitoring 10-20 target servers.
This is not a theoretical comparison. We run this exact stack at ZeonEdge to monitor our production infrastructure. This guide covers the complete deployment — from Docker Compose configuration to Grafana dashboards, PromQL alerting rules, and log aggregation — with exact configuration files that you can copy and deploy in under an hour.
Architecture Overview
The monitoring stack consists of four components that work together:
- Prometheus: Time-series database that scrapes metrics from targets every 15 seconds. Stores 15-30 days of data. Evaluates alerting rules. Memory usage: ~200 MB for 10 targets.
- Grafana: Dashboard and visualization platform. Queries Prometheus and Loki. Provides pre-built dashboards for Node Exporter, Docker, Nginx, and PostgreSQL. Memory usage: ~100 MB.
- Loki: Log aggregation system (like Elasticsearch but 10x lighter). Indexes labels, not log content, making it extremely resource-efficient. Memory usage: ~100 MB.
- Alertmanager: Receives alerts from Prometheus, deduplicates them, groups them, and routes them to notification channels (Slack, PagerDuty, email, Telegram). Memory usage: ~30 MB.
Total memory footprint: ~430 MB. This fits comfortably on a 1 GB VPS alongside the exporters, or on the same server as your application if you have 2+ GB RAM.
Docker Compose Deployment
# docker-compose.monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
restart: unless-stopped
ports:
- "127.0.0.1:9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=5GB'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
networks:
- monitoring
grafana:
image: grafana/grafana:11.0.0
container_name: grafana
restart: unless-stopped
ports:
- "127.0.0.1:3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
GF_SECURITY_ADMIN_PASSWORD: your-strong-grafana-password
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SERVER_ROOT_URL: "https://monitor.yourdomain.com"
GF_SMTP_ENABLED: "true"
GF_SMTP_HOST: "smtp.yourdomain.com:587"
GF_SMTP_USER: "alerts@yourdomain.com"
GF_SMTP_PASSWORD: "smtp-password"
GF_SMTP_FROM_ADDRESS: "alerts@yourdomain.com"
networks:
- monitoring
loki:
image: grafana/loki:3.0.0
container_name: loki
restart: unless-stopped
ports:
- "127.0.0.1:3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml
- loki-data:/loki
command: -config.file=/etc/loki/loki-config.yml
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
ports:
- "127.0.0.1:9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.8.0
container_name: node-exporter
restart: unless-stopped
ports:
- "127.0.0.1:9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
ports:
- "127.0.0.1:8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
networks:
- monitoring
promtail:
image: grafana/promtail:3.0.0
container_name: promtail
restart: unless-stopped
volumes:
- ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/promtail-config.yml
networks:
- monitoring
volumes:
prometheus-data:
grafana-data:
loki-data:
networks:
monitoring:
driver: bridge
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alert-rules.yml'
scrape_configs:
# Monitor Prometheus itself:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Server metrics (CPU, memory, disk, network):
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100' # Monitoring server
- '10.0.0.2:9100' # Production server 1
- '10.0.0.3:9100' # Production server 2
labels:
environment: 'production'
# Docker container metrics:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Nginx metrics (requires nginx-prometheus-exporter):
- job_name: 'nginx'
static_configs:
- targets: ['10.0.0.2:9113']
# PostgreSQL metrics (requires postgres_exporter):
- job_name: 'postgresql'
static_configs:
- targets: ['10.0.0.2:9187']
Alerting Rules
# prometheus/alert-rules.yml
groups:
- name: server-alerts
rules:
# High CPU usage for more than 5 minutes:
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf "%.1f" }}% (threshold: 85%)"
# Memory usage over 90%:
- alert: HighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf "%.1f" }}%"
# Disk space below 15%:
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 10m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Available disk space is {{ $value | printf "%.1f" }}%"
# Server unreachable:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} target has been unreachable for 2 minutes"
# SSL certificate expiring within 14 days:
- alert: SSLCertExpiring
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "SSL cert expiring for {{ $labels.instance }}"
description: "Certificate expires in {{ $value | printf "%.0f" }} days"
- name: docker-alerts
rules:
# Container restart loop:
- alert: ContainerRestarting
expr: rate(container_restart_count[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} is restarting"
# Container using over 90% of memory limit:
- alert: ContainerHighMemory
expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} memory usage: {{ $value | printf "%.1f" }}%"
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.yourdomain.com:587'
smtp_from: 'alerts@yourdomain.com'
smtp_auth_username: 'alerts@yourdomain.com'
smtp_auth_password: 'smtp-password'
smtp_require_tls: true
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
receiver: 'slack-default'
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts: Slack + email
- match:
severity: critical
receiver: 'critical-alerts'
repeat_interval: 1h
# Warning alerts: Slack only
- match:
severity: warning
receiver: 'slack-default'
repeat_interval: 4h
receivers:
- name: 'slack-default'
slack_configs:
- channel: '#monitoring'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'
send_resolved: true
- name: 'critical-alerts'
slack_configs:
- channel: '#alerts-critical'
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'
send_resolved: true
email_configs:
- to: 'oncall@yourdomain.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Loki Configuration for Log Aggregation
# loki/loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
limits_config:
retention_period: 720h # 30 days
max_query_length: 721h
max_query_series: 5000
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
compactor:
working_directory: /loki/compactor
retention_enabled: true
compaction_interval: 10m
retention_delete_delay: 2h
# promtail/promtail-config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# System logs:
- job_name: syslog
static_configs:
- targets: [localhost]
labels:
job: syslog
host: production-1
__path__: /var/log/syslog
# Nginx access logs:
- job_name: nginx
static_configs:
- targets: [localhost]
labels:
job: nginx
type: access
__path__: /var/log/nginx/access.log
# Docker container logs:
- job_name: docker
static_configs:
- targets: [localhost]
labels:
job: docker
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- docker: {}
- json:
expressions:
stream: stream
tag: attrs.tag
- labels:
stream:
tag:
Grafana Data Source Provisioning
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
- name: Alertmanager
type: alertmanager
access: proxy
url: http://alertmanager:9093
editable: false
This monitoring stack provides complete observability — metrics, logs, and alerting — for the cost of a single $5-$10/month VPS. Start by deploying the Docker Compose stack, import the Node Exporter Full dashboard (ID 1860) and Docker Monitoring dashboard (ID 893) in Grafana, configure Alertmanager to send notifications to your Slack channel, and you have production-grade monitoring running in under an hour. ZeonEdge deploys and maintains monitoring stacks for clients who need visibility into their infrastructure without enterprise pricing. Learn about our monitoring and observability services.
Alex Thompson
CEO & Cloud Architecture Expert at ZeonEdge with 15+ years building enterprise infrastructure.