Skip to main content
Version: Next

Router Observability

Overview & Purpose

The kthena-router serves as the central data-plane gateway for all inference traffic in the Kthena LLM inference platform.
It is responsible for request routing, load balancing, scheduling, fairness queuing, rate limiting, token accounting, and (when applicable) disaggregated prefill/decode forwarding.

Without strong observability, diagnosing issues such as:

  • Why is this model slow?
  • Which users are being unfairly delayed?
  • Where are the 5xx errors coming from?
  • Is the scheduler making good pod selections?
  • Are we hitting rate limits or resource exhaustion?

becomes extremely difficult and time-consuming.

This observability framework provides production-grade visibility through three main channels:

  1. Prometheus metrics — quantitative signals for dashboards, alerting, and trending
  2. Structured access logs — rich per-request forensic details
  3. Debug endpoints — instant insight into routing configuration and system state

Together they enable fast root-cause analysis, performance tuning, capacity planning, cost monitoring, and abuse detection.

Metrics

Endpoint

  • Metrics Port: 8080 (default)
  • Metrics Path: /metrics

Note: The Prometheus metrics are exposed on port 8080 by default.
The debug endpoints (/debug/config_dump/*) are served on port 15000.

Core Request & Latency Metrics

Metric NameTypeDescriptionLabelsBuckets
kthena_router_requests_totalCounterTotal requests processedmodel, path, status_code, error_type
kthena_router_request_duration_secondsHistogramEnd-to-end latency (client → response)model, path, status_code0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60
kthena_router_request_prefill_duration_secondsHistogramPrefill (prompt processing) phase durationmodel, path, status_codesame as above
kthena_router_request_decode_duration_secondsHistogramDecode (token generation) phase durationmodel, path, status_codesame as above
kthena_router_active_downstream_requestsGaugeCurrently active client requestsmodel
kthena_router_active_upstream_requestsGaugeCurrently active requests to inference podsmodel_route, model_server

Token & Usage Metrics

Metric NameTypeDescriptionLabels
kthena_router_tokens_totalCounterTotal tokens processed (input + output)model, path, token_type (input/output)

Scheduler & Fairness Metrics

Metric NameTypeDescriptionLabelsBuckets
kthena_router_scheduler_plugin_duration_secondsHistogramExecution time per scheduler pluginmodel, plugin, type0.001, 0.005, 0.01, 0.05, 0.1, 0.5
kthena_router_fairness_queue_sizeGaugeCurrent queued requests per model/usermodel, user_id
kthena_router_fairness_queue_duration_secondsHistogramTime spent waiting in fairness/priority queuemodel, user_id0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5

Rate Limiting & Protection

Metric NameTypeDescriptionLabels
kthena_router_rate_limit_exceeded_totalCounterRequests rejected due to rate limitingmodel, limit_type, path

Access Logs

{
"timestamp": "2026-01-09T14:35:22.147Z",
"method": "POST",
"path": "/v1/chat/completions",
"protocol": "HTTP/1.1",
"status_code": 200,
"model_name": "llama3-70b-instruct",
"model_route": "prod/llama3-70b-route",
"model_server": "prod/llama3-70b-server",
"selected_pod": "llama3-70b-deployment-7b9f4c2d-kjx9p",
"request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"input_tokens": 412,
"output_tokens": 189,
"duration_total_ms": 3840,
"duration_queue_ms": 180,
"duration_request_processing_ms": 65,
"duration_upstream_ms": 3480,
"duration_response_processing_ms": 115,
"error": null
}

Configuration

Observability features are configured via the Kthena Router's deployment or ConfigMap.
Most settings are controlled through environment variables or the router's configuration file (depending on your deployment method).

Access Log Configuration

accessLogger:
enabled: true
format: "json" # "json" (strongly recommended) or "text"
output: "stdout" # "stdout", "stderr", or file path
env:
- name: ACCESS_LOG_ENABLED
value: "true"
- name: ACCESS_LOG_FORMAT
value: "json"
- name: ACCESS_LOG_OUTPUT
value: "stdout"

Metrics Configuration

observability:
metrics:
enabled: true
port: 8080 # Default metrics port
path: /metrics

Debug Endpoints

All available on the same :15000 port

EndpointDescription
/debug/config_dump/modelroutesAll ModelRoute resources
/debug/config_dump/modelserversAll ModelServer resources
/debug/config_dump/podsCurrent view of healthy/ready inference pods
/debug/config_dump/namespaces/{ns}/modelroutes/{name}Detailed single ModelRoute
/debug/config_dump/namespaces/{ns}/modelservers/{name}Detailed single ModelServer

Quick Start – Observability in Action

# Forward metrics port (8080)
kubectl port-forward -n kthena-system svc/kthena-router 8080:8080 &

# Forward debug port (15000) when needed
kubectl port-forward -n kthena-system svc/kthena-router 15000:15000 &

# Watch real-time request rate by model
watch -n 2 'curl -s http://localhost:8080/metrics | grep kthena_router_requests_total | sort'

# Tail logs and filter access log entries (JSON lines only)
kubectl logs -n kthena-system deployment/kthena-router -f \
| grep -E '^{.*}$' | jq . # Only process valid JSON lines

# Alternative: look for model-related entries in all logs
kubectl logs -n kthena-system deployment/kthena-router -f \
| grep -E "model_name|request_id|duration_total"

Important note about logs

  • Router logs usually contain both regular application logs (plain text) and structured access logs (JSON).
  • Piping everything directly to jq will cause errors on non-JSON lines.
  • Use grep to filter JSON lines first, or use a log processor (like fluentd, vector, or loki) in production.

Troubleshooting Guide

Preparation

# Forward metrics port (8080) for metrics queries
kubectl port-forward -n kthena-system svc/kthena-router 8080:8080 &

# Forward debug port (15000) for debug endpoints
kubectl port-forward -n kthena-system svc/kthena-router 15000:15000 &

# Live structured logs (recommended, with filtering)
kubectl logs -n kthena-system deployment/kthena-router -f \
| grep -E '^{.*}$' | jq .

1. High Error Rate (5xx, timeouts, internal server errors)

Count errors by status & model:

curl -s http://localhost:8080/metrics | grep 'status_code=5' | sort -nr

Top affected models:

curl -s http://localhost:8080/metrics \
| grep 'status_code=5' \
| grep -o 'model="[^"]*"' | sort | uniq -c | sort -nr

Inspect recent failed requests:

kubectl logs -n kthena-system deployment/kthena-router --since=30m \
| jq 'select(.status_code >= 500) | {ts: .timestamp, model: .model_name, err: .error, pod: .selected_pod, dur: .duration_total_ms}'

Check upstream health:

curl http://localhost:15000/debug/config_dump/modelservers | jq .
curl http://localhost:15000/debug/config_dump/pods | jq .

2. High Latency / Slow TTFT or Generation Speed

Latency percentiles (p50/p95/p99):

curl -s http://localhost:8080/metrics \
| grep -E 'kthena_router_request_duration_seconds_(bucket|sum|count)'

Find slowest requests:

kubectl logs -n kthena-system deployment/kthena-router --since=20m \
| jq 'select(.duration_total_ms > 4000) | {model: .model_name, total: .duration_total_ms, upstream: .duration_upstream_ms, pod: .selected_pod}'

Check queue pressure:

watch -n 2 'curl -s http://localhost:8080/metrics | grep -E "(active_downstream|fairness_queue_size)"'

3. Queue Buildup / Fairness / Throttling

Live queue monitoring:

watch -n 3 'curl -s http://localhost:8080/metrics | grep fairness_queue_size'

Queue wait time distribution:

curl -s http://localhost:8080/metrics | grep fairness_queue_duration_seconds

Find throttled/rejected requests:

kubectl logs -n kthena-system deployment/kthena-router --since=1h \
| jq 'select(.error? | .type? == ("rate_limit","throttled","queue_full"))'

4. Wrong Routing / 404 / Pod Selection Issues

Validate full routing table:

curl http://localhost:15000/debug/config_dump/modelroutes | jq .

Check pod readiness:

curl http://localhost:15000/debug/config_dump/pods | jq .

Trace a specific request:

# Use request_id from client or error message
kubectl logs -n kthena-system deployment/kthena-router \
| jq 'select(.request_id == "a1b2c3d4-...")'

5. Token Usage / Cost / Abuse Monitoring

Current token consumption rate:

curl -s http://localhost:8080/metrics | grep kthena_router_tokens_total

High-token requests:

kubectl logs -n kthena-system deployment/kthena-router --since=2h \
| jq 'select(.input_tokens > 3000 or .output_tokens > 1500)'