BlogCloud & Infrastructure
Cloud & Infrastructure

Production Monitoring Stack in 2026: Prometheus, Grafana, Loki, and Alertmanager on a Single VPS

You do not need Datadog at $15/host/month to monitor your infrastructure. Prometheus, Grafana, Loki, and Alertmanager run comfortably on a single 2 GB VPS and monitor everything — server metrics, application performance, Docker containers, log aggregation, and intelligent alerting with PagerDuty/Slack/email integration.

A

Alex Thompson

CEO & Cloud Architecture Expert at ZeonEdge with 15+ years building enterprise infrastructure.

February 9, 2026
24 min read

Datadog charges $15 per host per month for infrastructure monitoring, $12 per host per month for APM, and $0.10 per million log events. For a small team with 5 servers, that is $135/month for basic monitoring and alerting — $1,620/year. The open-source alternative — Prometheus for metrics, Grafana for visualization, Loki for logs, and Alertmanager for notifications — provides equivalent functionality at zero cost, running on a single $5-$10/month VPS that also handles monitoring 10-20 target servers.

This is not a theoretical comparison. We run this exact stack at ZeonEdge to monitor our production infrastructure. This guide covers the complete deployment — from Docker Compose configuration to Grafana dashboards, PromQL alerting rules, and log aggregation — with exact configuration files that you can copy and deploy in under an hour.

Architecture Overview

The monitoring stack consists of four components that work together:

  • Prometheus: Time-series database that scrapes metrics from targets every 15 seconds. Stores 15-30 days of data. Evaluates alerting rules. Memory usage: ~200 MB for 10 targets.
  • Grafana: Dashboard and visualization platform. Queries Prometheus and Loki. Provides pre-built dashboards for Node Exporter, Docker, Nginx, and PostgreSQL. Memory usage: ~100 MB.
  • Loki: Log aggregation system (like Elasticsearch but 10x lighter). Indexes labels, not log content, making it extremely resource-efficient. Memory usage: ~100 MB.
  • Alertmanager: Receives alerts from Prometheus, deduplicates them, groups them, and routes them to notification channels (Slack, PagerDuty, email, Telegram). Memory usage: ~30 MB.

Total memory footprint: ~430 MB. This fits comfortably on a 1 GB VPS alongside the exporters, or on the same server as your application if you have 2+ GB RAM.

Docker Compose Deployment

# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "127.0.0.1:9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=5GB'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:11.0.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "127.0.0.1:3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      GF_SECURITY_ADMIN_PASSWORD: your-strong-grafana-password
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: "https://monitor.yourdomain.com"
      GF_SMTP_ENABLED: "true"
      GF_SMTP_HOST: "smtp.yourdomain.com:587"
      GF_SMTP_USER: "alerts@yourdomain.com"
      GF_SMTP_PASSWORD: "smtp-password"
      GF_SMTP_FROM_ADDRESS: "alerts@yourdomain.com"
    networks:
      - monitoring

  loki:
    image: grafana/loki:3.0.0
    container_name: loki
    restart: unless-stopped
    ports:
      - "127.0.0.1:3100:3100"
    volumes:
      - ./loki/loki-config.yml:/etc/loki/loki-config.yml
      - loki-data:/loki
    command: -config.file=/etc/loki/loki-config.yml
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "127.0.0.1:9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.8.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "127.0.0.1:9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "127.0.0.1:8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:3.0.0
    container_name: promtail
    restart: unless-stopped
    volumes:
      - ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command: -config.file=/etc/promtail/promtail-config.yml
    networks:
      - monitoring

volumes:
  prometheus-data:
  grafana-data:
  loki-data:

networks:
  monitoring:
    driver: bridge

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alert-rules.yml'

scrape_configs:
  # Monitor Prometheus itself:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Server metrics (CPU, memory, disk, network):
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-exporter:9100'           # Monitoring server
          - '10.0.0.2:9100'                # Production server 1
          - '10.0.0.3:9100'                # Production server 2
        labels:
          environment: 'production'

  # Docker container metrics:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # Nginx metrics (requires nginx-prometheus-exporter):
  - job_name: 'nginx'
    static_configs:
      - targets: ['10.0.0.2:9113']

  # PostgreSQL metrics (requires postgres_exporter):
  - job_name: 'postgresql'
    static_configs:
      - targets: ['10.0.0.2:9187']

Alerting Rules

# prometheus/alert-rules.yml
groups:
  - name: server-alerts
    rules:
      # High CPU usage for more than 5 minutes:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf "%.1f" }}% (threshold: 85%)"

      # Memory usage over 90%:
      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf "%.1f" }}%"

      # Disk space below 15%:
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Available disk space is {{ $value | printf "%.1f" }}%"

      # Server unreachable:
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} target has been unreachable for 2 minutes"

      # SSL certificate expiring within 14 days:
      - alert: SSLCertExpiring
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expiring for {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | printf "%.0f" }} days"

  - name: docker-alerts
    rules:
      # Container restart loop:
      - alert: ContainerRestarting
        expr: rate(container_restart_count[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} is restarting"

      # Container using over 90% of memory limit:
      - alert: ContainerHighMemory
        expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} memory usage: {{ $value | printf "%.1f" }}%"

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.yourdomain.com:587'
  smtp_from: 'alerts@yourdomain.com'
  smtp_auth_username: 'alerts@yourdomain.com'
  smtp_auth_password: 'smtp-password'
  smtp_require_tls: true
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'slack-default'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts: Slack + email
    - match:
        severity: critical
      receiver: 'critical-alerts'
      repeat_interval: 1h

    # Warning alerts: Slack only
    - match:
        severity: warning
      receiver: 'slack-default'
      repeat_interval: 4h

receivers:
  - name: 'slack-default'
    slack_configs:
      - channel: '#monitoring'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'
        send_resolved: true

  - name: 'critical-alerts'
    slack_configs:
      - channel: '#alerts-critical'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'
        send_resolved: true
    email_configs:
      - to: 'oncall@yourdomain.com'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Loki Configuration for Log Aggregation

# loki/loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

limits_config:
  retention_period: 720h  # 30 days
  max_query_length: 721h
  max_query_series: 5000
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

compactor:
  working_directory: /loki/compactor
  retention_enabled: true
  compaction_interval: 10m
  retention_delete_delay: 2h
# promtail/promtail-config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # System logs:
  - job_name: syslog
    static_configs:
      - targets: [localhost]
        labels:
          job: syslog
          host: production-1
          __path__: /var/log/syslog

  # Nginx access logs:
  - job_name: nginx
    static_configs:
      - targets: [localhost]
        labels:
          job: nginx
          type: access
          __path__: /var/log/nginx/access.log

  # Docker container logs:
  - job_name: docker
    static_configs:
      - targets: [localhost]
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - docker: {}
      - json:
          expressions:
            stream: stream
            tag: attrs.tag
      - labels:
          stream:
          tag:

Grafana Data Source Provisioning

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false

  - name: Alertmanager
    type: alertmanager
    access: proxy
    url: http://alertmanager:9093
    editable: false

This monitoring stack provides complete observability — metrics, logs, and alerting — for the cost of a single $5-$10/month VPS. Start by deploying the Docker Compose stack, import the Node Exporter Full dashboard (ID 1860) and Docker Monitoring dashboard (ID 893) in Grafana, configure Alertmanager to send notifications to your Slack channel, and you have production-grade monitoring running in under an hour. ZeonEdge deploys and maintains monitoring stacks for clients who need visibility into their infrastructure without enterprise pricing. Learn about our monitoring and observability services.

A

Alex Thompson

CEO & Cloud Architecture Expert at ZeonEdge with 15+ years building enterprise infrastructure.

Ready to Transform Your Infrastructure?

Let's discuss how we can help you achieve similar results.