Version: 0.1.0

Prometheus Integration

Comprehensive monitoring and observability setup for Kthena deployments.

Overview

Effective monitoring is crucial for maintaining reliable AI model inference services. This guide covers setting up comprehensive monitoring, alerting, and observability for Kthena deployments using industry-standard tools and practices.

Monitoring Stack

Core Components

Prometheus - Metrics collection and storage Grafana - Visualization and dashboards Jaeger - Distributed tracing AlertManager - Alert routing and management Loki - Log aggregation (optional)

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Kthena   │───▶│   Prometheus    │───▶│     Grafana     │
│   Components    │    │                 │    │   Dashboards    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│     Jaeger      │    │  AlertManager   │    │      Loki       │
│    Tracing      │    │    Alerts       │    │      Logs       │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Prometheus Setup

Installation

Using Helm:

# Add Prometheus community helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

Configuration

Prometheus Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
    - "/etc/prometheus/rules/*.yml"
    
    scrape_configs:
    - job_name: 'kthena-models'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['default', 'production', 'staging']
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
    
    - job_name: 'kthena-controllers'
      static_configs:
      - targets: ['kthena-controller:8080']
      metrics_path: /metrics
      scrape_interval: 30s

ServiceMonitor for Kthena

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kthena-metrics
  namespace: monitoring
  labels:
    app: kthena
spec:
  selector:
    matchLabels:
      app: kthena
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics
    honorLabels: true
  namespaceSelector:
    matchNames:
    - default
    - production
    - staging

Key Metrics

Model Performance Metrics

Inference Metrics:

# Request rate
kthena_inference_requests_total
# Request duration
kthena_inference_duration_seconds
# Request size
kthena_inference_request_size_bytes
# Response size
kthena_inference_response_size_bytes
# Error rate
kthena_inference_errors_total

Model Loading Metrics:

# Model load time
kthena_model_load_duration_seconds
# Model load failures
kthena_model_load_failures_total
# Model memory usage
kthena_model_memory_usage_bytes
# Model status
kthena_model_status

Resource Utilization:

# CPU usage
container_cpu_usage_seconds_total
# Memory usage
container_memory_usage_bytes
# GPU utilization
nvidia_gpu_utilization_percent
# GPU memory usage
nvidia_gpu_memory_usage_bytes

Custom Metrics Configuration

Model-specific Metrics:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kthena-metrics-config
data:
  metrics.yaml: |
    custom_metrics:
      - name: model_accuracy_score
        type: gauge
        help: "Current model accuracy score"
        labels: ["model_id", "version", "dataset"]
      
      - name: inference_queue_depth
        type: gauge
        help: "Number of requests waiting in inference queue"
        labels: ["model_id", "priority"]
      
      - name: model_cache_hit_ratio
        type: gauge
        help: "Cache hit ratio for model predictions"
        labels: ["model_id", "cache_type"]
      
      - name: batch_processing_efficiency
        type: histogram
        help: "Efficiency of batch processing"
        labels: ["model_id", "batch_size"]
        buckets: [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 1.0]

Grafana Dashboards

Installation and Configuration

Grafana Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-config
data:
  grafana.ini: |
    [server]
    root_url = https://grafana.company.com
    
    [security]
    admin_user = admin
    admin_password = $__env{GRAFANA_ADMIN_PASSWORD}
    
    [auth.generic_oauth]
    enabled = true
    name = OAuth
    allow_sign_up = true
    client_id = $__env{OAUTH_CLIENT_ID}
    client_secret = $__env{OAUTH_CLIENT_SECRET}
    scopes = openid profile email
    auth_url = https://auth.company.com/oauth/authorize
    token_url = https://auth.company.com/oauth/token
    api_url = https://auth.company.com/oauth/userinfo

Kthena Dashboard

Main Dashboard JSON:

{
  "dashboard": {
    "id": null,
    "title": "Kthena Overview",
    "tags": ["kthena", "ml", "inference"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Inference Requests/sec",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(kthena_inference_requests_total[5m]))",
            "legendFormat": "Requests/sec"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "thresholds"
            },
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 100},
                {"color": "red", "value": 1000}
              ]
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Inference Latency (P95)",
        "type": "stat",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(kthena_inference_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "P95 Latency"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "id": 3,
        "title": "Error Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(kthena_inference_errors_total[5m])) / sum(rate(kthena_inference_requests_total[5m])) * 100",
            "legendFormat": "Error Rate %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "id": 4,
        "title": "Active Models",
        "type": "stat",
        "targets": [
          {
            "expr": "count(kthena_model_status == 1)",
            "legendFormat": "Active Models"
          }
        ]
      },
      {
        "id": 5,
        "title": "Request Rate by Model",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(kthena_inference_requests_total[5m])) by (model_id)",
            "legendFormat": "{{model_id}}"
          }
        ],
        "xAxis": {
          "show": true
        },
        "yAxes": [
          {
            "label": "Requests/sec",
            "show": true
          }
        ]
      },
      {
        "id": 6,
        "title": "Memory Usage by Model",
        "type": "graph",
        "targets": [
          {
            "expr": "kthena_model_memory_usage_bytes / 1024 / 1024 / 1024",
            "legendFormat": "{{model_id}}"
          }
        ],
        "yAxes": [
          {
            "label": "Memory (GB)",
            "show": true
          }
        ]
      },
      {
        "id": 7,
        "title": "GPU Utilization",
        "type": "graph",
        "targets": [
          {
            "expr": "nvidia_gpu_utilization_percent",
            "legendFormat": "GPU {{gpu_id}} - {{model_id}}"
          }
        ],
        "yAxes": [
          {
            "label": "Utilization %",
            "max": 100,
            "show": true
          }
        ]
      },
      {
        "id": 8,
        "title": "Autoscaling Activity",
        "type": "graph",
        "targets": [
          {
            "expr": "kube_deployment_status_replicas{deployment=~\".*-infer\"}",
            "legendFormat": "{{deployment}} - Replicas"
          }
        ],
        "yAxes": [
          {
            "label": "Replica Count",
            "show": true
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Model-specific Dashboard

Per-Model Dashboard:

{
  "dashboard": {
    "title": "Kthena Model: $model_id",
    "templating": {
      "list": [
        {
          "name": "model_id",
          "type": "query",
          "query": "label_values(kthena_inference_requests_total, model_id)",
          "refresh": 1
        }
      ]
    },
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(kthena_inference_requests_total{model_id=\"$model_id\"}[5m]))",
            "legendFormat": "Requests/sec"
          }
        ]
      },
      {
        "title": "Latency Distribution",
        "type": "heatmap",
        "targets": [
          {
            "expr": "sum(rate(kthena_inference_duration_seconds_bucket{model_id=\"$model_id\"}[5m])) by (le)",
            "format": "heatmap",
            "legendFormat": "{{le}}"
          }
        ]
      },
      {
        "title": "Model Accuracy Over Time",
        "type": "graph",
        "targets": [
          {
            "expr": "model_accuracy_score{model_id=\"$model_id\"}",
            "legendFormat": "Accuracy"
          }
        ]
      }
    ]
  }
}

Alerting

AlertManager Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
data:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'smtp.company.com:587'
      smtp_from: 'alerts@company.com'
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'default'
      routes:
      - match:
          severity: critical
        receiver: 'critical-alerts'
      - match:
          service: kthena
        receiver: 'kthena-team'
    
    receivers:
    - name: 'default'
      email_configs:
      - to: 'ops-team@company.com'
        subject: 'Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
    
    - name: 'critical-alerts'
      slack_configs:
      - channel: '#critical-alerts'
        title: 'Critical Alert: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          {{ end }}
      email_configs:
      - to: 'oncall@company.com'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
    
    - name: 'kthena-team'
      slack_configs:
      - channel: '#ml-ops'
        title: 'Kthena Alert: {{ .GroupLabels.alertname }}'

Prometheus Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kthena-alerts
  namespace: monitoring
spec:
  groups:
  - name: kthena.rules
    rules:
    
    # High-level service alerts
    - alert: KthenaServiceDown
      expr: up{job="kthena-models"} == 0
      for: 1m
      labels:
        severity: critical
        service: kthena
      annotations:
        summary: "Kthena service is down"
        description: "Kthena service {{ $labels.instance }} has been down for more than 1 minute"
    
    - alert: HighInferenceLatency
      expr: histogram_quantile(0.95, sum(rate(kthena_inference_duration_seconds_bucket[5m])) by (le, model_id)) > 2
      for: 5m
      labels:
        severity: warning
        service: kthena
      annotations:
        summary: "High inference latency detected"
        description: "Model {{ $labels.model_id }} has P95 latency of {{ $value }}s for more than 5 minutes"
    
    - alert: HighErrorRate
      expr: sum(rate(kthena_inference_errors_total[5m])) by (model_id) / sum(rate(kthena_inference_requests_total[5m])) by (model_id) > 0.05
      for: 2m
      labels:
        severity: critical
        service: kthena
      annotations:
        summary: "High error rate detected"
        description: "Model {{ $labels.model_id }} has error rate of {{ $value | humanizePercentage }} for more than 2 minutes"
    
    # Resource-based alerts
    - alert: ModelHighMemoryUsage
      expr: kthena_model_memory_usage_bytes / 1024 / 1024 / 1024 > 8
      for: 10m
      labels:
        severity: warning
        service: kthena
      annotations:
        summary: "Model using high memory"
        description: "Model {{ $labels.model_id }} is using {{ $value }}GB of memory"
    
    - alert: GPUHighUtilization
      expr: nvidia_gpu_utilization_percent > 90
      for: 15m
      labels:
        severity: warning
        service: kthena
      annotations:
        summary: "GPU high utilization"
        description: "GPU {{ $labels.gpu_id }} utilization is {{ $value }}% for model {{ $labels.model_id }}"
    
    # Model-specific alerts
    - alert: ModelLoadFailure
      expr: increase(kthena_model_load_failures_total[5m]) > 0
      for: 1m
      labels:
        severity: critical
        service: kthena
      annotations:
        summary: "Model load failure"
        description: "Model {{ $labels.model_id }} failed to load {{ $value }} times in the last 5 minutes"
    
    - alert: ModelAccuracyDrop
      expr: model_accuracy_score < 0.85
      for: 10m
      labels:
        severity: warning
        service: kthena
      annotations:
        summary: "Model accuracy drop detected"
        description: "Model {{ $labels.model_id }} accuracy dropped to {{ $value | humanizePercentage }}"
    
    # Autoscaling alerts
    - alert: AutoscalingNotWorking
      expr: kube_deployment_status_replicas{deployment=~".*-infer"} == kube_deployment_spec_replicas{deployment=~".*-infer"} and on(deployment) rate(kthena_inference_requests_total[5m]) > 10
      for: 10m
      labels:
        severity: warning
        service: kthena
      annotations:
        summary: "Autoscaling may not be working"
        description: "Deployment {{ $labels.deployment }} has high load but is not scaling"

Distributed Tracing

Jaeger Setup

Jaeger Installation:

# Install Jaeger operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.42.0/jaeger-operator.yaml -n observability

# Create Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: kthena-jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      storage:
        size: 100Gi
      resources:
        requests:
          memory: "2Gi"
          cpu: "1"
        limits:
          memory: "4Gi"
          cpu: "2"
EOF

Tracing Configuration

OpenTelemetry Collector:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268
    
    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      resource:
        attributes:
        - key: service.name
          value: kthena
          action: upsert
        - key: service.version
          from_attribute: version
          action: upsert
    
    exporters:
      jaeger:
        endpoint: kthena-jaeger-collector:14250
        tls:
          insecure: true
      prometheus:
        endpoint: "0.0.0.0:8889"
        namespace: kthena
        const_labels:
          service: kthena
    
    service:
      pipelines:
        traces:
          receivers: [otlp, jaeger]
          processors: [batch, resource]
          exporters: [jaeger]
        metrics:
          receivers: [otlp]
          processors: [batch, resource]
          exporters: [prometheus]

Log Management

Loki Setup (Optional)

Loki Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-config
data:
  loki.yaml: |
    auth_enabled: false
    
    server:
      http_listen_port: 3100
    
    ingester:
      lifecycler:
        address: 127.0.0.1
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
        final_sleep: 0s
      chunk_idle_period: 5m
      chunk_retain_period: 30s
    
    schema_config:
      configs:
        - from: 2020-10-24
          store: boltdb
          object_store: filesystem
          schema: v11
          index:
            prefix: index_
            period: 168h
    
    storage_config:
      boltdb:
        directory: /loki/index
      filesystem:
        directory: /loki/chunks
    
    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h

Promtail Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
      grpc_listen_port: 0
    
    positions:
      filename: /tmp/positions.yaml
    
    clients:
      - url: http://loki:3100/loki/api/v1/push
    
    scrape_configs:
    - job_name: kthena-logs
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['default', 'production', 'staging']
      pipeline_stages:
      - docker: {}
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: kthena.*
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_label_model_id]
        target_label: model_id

Health Checks and SLIs/SLOs

Service Level Indicators (SLIs)

Key SLIs for Kthena:

slis:
  availability:
    description: "Percentage of successful inference requests"
    query: "sum(rate(kthena_inference_requests_total{status!~'5..'}[5m])) / sum(rate(kthena_inference_requests_total[5m]))"
    target: 99.9%
  
  latency:
    description: "95th percentile of inference request latency"
    query: "histogram_quantile(0.95, sum(rate(kthena_inference_duration_seconds_bucket[5m])) by (le))"
    target: "< 1s"
  
  throughput:
    description: "Number of inference requests per second"
    query: "sum(rate(kthena_inference_requests_total[5m]))"
    target: "> 100 req/s"
  
  error_rate:
    description: "Percentage of failed inference requests"
    query: "sum(rate(kthena_inference_errors_total[5m])) / sum(rate(kthena_inference_requests_total[5m]))"
    target: "< 0.1%"

Service Level Objectives (SLOs)

SLO Configuration:

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: kthena-slo
  namespace: monitoring
spec:
  service: "kthena"
  labels:
    team: "ml-ops"
  slos:
  - name: "inference-availability"
    objective: 99.9
    description: "99.9% of inference requests should be successful"
    sli:
      events:
        error_query: sum(rate(kthena_inference_requests_total{status=~"5.."}[5m]))
        total_query: sum(rate(kthena_inference_requests_total[5m]))
    alerting:
      name: KthenaHighErrorRate
      labels:
        severity: critical
      annotations:
        summary: "High error rate on Kthena inference requests"
  
  - name: "inference-latency"
    objective: 95.0
    description: "95% of inference requests should complete within 1 second"
    sli:
      events:
        error_query: sum(rate(kthena_inference_duration_seconds_bucket{le="1.0"}[5m]))
        total_query: sum(rate(kthena_inference_duration_seconds_count[5m]))
    alerting:
      name: KthenaHighLatency
      labels:
        severity: warning
      annotations:
        summary: "High latency on Kthena inference requests"

Monitoring Best Practices

Metric Naming Conventions

Follow Prometheus naming conventions:

Use _total suffix for counters
Use _seconds for time measurements
Use _bytes for size measurements
Include units in metric names
Use consistent label names across metrics

Dashboard Organization

Dashboard Structure:

Overview Dashboard - High-level service health
Model-specific Dashboards - Per-model metrics
Infrastructure Dashboards - Resource utilization
Troubleshooting Dashboards - Detailed debugging views

Alert Fatigue Prevention

Alert Guidelines:

Set appropriate thresholds based on SLOs
Use different severity levels (critical, warning, info)
Implement alert suppression during maintenance
Group related alerts to reduce noise
Include runbook links in alert annotations

Troubleshooting Monitoring

Common Issues

Metrics Not Appearing:

# Check if metrics endpoint is accessible
kubectl port-forward svc/my-model-server 8080:8080
curl http://localhost:8080/metrics

# Verify ServiceMonitor configuration
kubectl get servicemonitor -n monitoring
kubectl describe servicemonitor kthena-metrics -n monitoring

# Check Prometheus targets
kubectl port-forward svc/prometheus-operated 9090:9090
# Visit http://localhost:9090/targets

High Cardinality Issues:

# Check metric cardinality
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | grep kthena | wc -l

# Identify high cardinality metrics
curl -s 'http://prometheus:9090/api/v1/query?query={__name__=~"kthena.*"}' | jq '.data.result | length'

Overview​

Monitoring Stack​

Core Components​

Architecture​

Prometheus Setup​

Installation​

Configuration​

ServiceMonitor for Kthena​

Key Metrics​

Model Performance Metrics​

Custom Metrics Configuration​

Grafana Dashboards​

Installation and Configuration​

Kthena Dashboard​

Model-specific Dashboard​

Alerting​

AlertManager Configuration​

Prometheus Alert Rules​

Distributed Tracing​

Jaeger Setup​

Tracing Configuration​

Log Management​

Loki Setup (Optional)​

Promtail Configuration​

Health Checks and SLIs/SLOs​

Service Level Indicators (SLIs)​

Service Level Objectives (SLOs)​

Monitoring Best Practices​

Metric Naming Conventions​

Dashboard Organization​

Alert Fatigue Prevention​

Troubleshooting Monitoring​

Common Issues​