Architecture Overview
Kthena follows a two-plane architecture that cleanly separates what you declare from how requests flow. The Control Plane reconciles your CRDs into a running system; the Data Plane routes every inference request through a model-aware pipeline to the right pod at the right time.
Control Plane — Declarative Model Lifecycle
The Control Plane is Kthena's model-aware brain. You describe intent through CRDs; controllers continuously reconcile that intent into live resources with safety guardrails, rollout logic, and status visibility.
Custom Resource Definitions (CRDs)
Kthena extends Kubernetes with a layered CRD model. A single high-level resource cascades into fine-grained primitives — so you can operate at whichever abstraction level fits your workflow.
ModelBooster (one-stop deployment API)
├── ModelRoute – routing rules, canary weights, rate limits
├── ModelServer – service exposure, traffic policy, endpoint discovery
├── ModelServing – replica topology: ServingGroups × Roles (Prefill / Decode)
├── AutoScalingPolicy – metric triggers, scaling behaviors, panic mode
└── AutoScalingPolicyBinding – attach policies to workloads, cost-aware optimization
| CRD | Purpose |
|---|---|
| ModelBooster | High-level, opinionated API. Captures model specs and cascades create/update/delete of all downstream resources — one resource to deploy, one resource to manage. |
| ModelRoute | Declares routing intent: match by modelName, loraAdapters, path, or headers. Supports weighted traffic splits, canary/A-B strategies, and token-based rateLimits. |
| ModelServer | Defines service exposure and traffic policy for inference pods. Discovers backends via workloadSelector (including role-aware Prefill/Decode groupings) and applies retries, timeouts, and connection settings. |
| ModelServing | Manages inference workloads as ServingGroups with role-based replicas. Supports entry/worker pod templates, recoveryPolicy, topology- and gang-aware scheduling, rolling upgrades, and a dedicated scheduler. |
| AutoScalingPolicy | Specifies metric endpoints and triggers (CPU, memory, custom), target thresholds, and scaling behaviors — including panic windows for traffic spikes and stabilization windows to prevent flapping. |
| AutoScalingPolicyBinding | Binds policies to ModelServing targets in two modes: scalingConfiguration for metric-driven replica management, and optimizerConfiguration for cost-aware distribution across heterogeneous instance types (e.g., H100 + A100). |
Controllers
Three controllers reconcile the CRDs above into runtime resources:
| Controller | Responsibility |
|---|---|
| Model Booster Controller | Reconciles ModelBooster → downstream primitives. Propagates updates and orchestrates cascaded lifecycle operations to keep every derived resource consistent. |
| Model Serving Controller | Manages ServingGroups and role-based replicas. Handles topology- and gang-aware placement, fault recovery, rolling upgrades, and entry/worker pod template reconciliation per role. |
| Autoscaler Controller | Evaluates runtime metrics against AutoScalingPolicy targets, computes desired replica counts, and — via AutoScalingPolicyBinding — adjusts instance mix across heterogeneous hardware to meet SLOs and cost goals. Supports both stable and panic scaling modes. |
Data Plane — Request-Level Intelligent Routing
The Data Plane is Kthena's runtime path. Every inference request flows through the Kthena Router, which applies security, fairness, and model-aware scheduling before dispatching to the optimal inference pod.
Request Pipeline
Each request traverses six stages in order:
| Stage | What Happens |
|---|---|
| 1. Authentication & Authorization | Validates identity and permissions before any work is done. |
| 2. Rate Limiting | Enforces per-model and per-tenant throughput limits — token-based, not just request-based — to prevent monopolization. |
| 3. Fairness Scheduling | Per-model fair queuing ensures one model's traffic spike cannot starve others sharing the same fleet. |
| 4. Scheduling | The core intelligence layer. A pluggable scheduler runs filter → score → select and selects a set of candidate pods (details below). |
| 5. Load Balancing | Final routing to candidate backend instances with retry policy supported. |
| 6. Proxy | Dispatches the request to the selected inference pod and streams the response back. |
Pluggable Scheduler
The scheduler at Stage 5 is the heart of Kthena's intelligent routing. It mirrors the Kubernetes scheduler's filter/score pattern but operates at the request level, running in microseconds rather than milliseconds.
Filter Plugins (eliminate ineligible pods):
| Plugin | Logic |
|---|---|
| Least Requests | Drops pods that exceed a configurable active-request threshold. |
| LoRA Affinity | Excludes pods that don't have the required LoRA adapter loaded. |
Score Plugins (rank remaining pods, weighted):
| Plugin | Optimizes For |
|---|---|
| Least Requests | Prefers pods with the fewest active in-flight requests to minimize queuing delay. |
| Least Latency | Minimizes TTFT (Time-to-First-Token) and TPOT (Time-Per-Output-Token) based on recent observations. |
| KV Cache Aware | Favors pods with most available KV-cache hit to maximize cache utilization and reduce recomputation overhead. |
| Prefix Cache | Matches the request's prompt prefix against cached prefixes across pods to maximize cache hits. |
| GPU Cache | Considers GPU memory utilization to avoid requests preemption. |
PD-Aware Scheduling: When the request targets a disaggregated PD group, the scheduler first scores decode pods, then pairs each with a compatible prefill pod in the same group — ensuring KV-cache locality and minimal data transfer.
Inference Pod Architecture
Inference workloads are organized into ServingGroups, each containing multiple replicas. Replicas assume specialized roles to separate the two fundamentally different workloads of LLM inference.
Role-Based Disaggregation
ModelServing
└── ServingGroup (instance 1 … N)
├── Prefill Role → optimize for compute throughput
│ ├── Entry Pod (ingress)
│ └── Worker Pods (parallel prompt processing)
└── Decode Role → optimize for low latency
├── Entry Pod (ingress)
└── Worker Pods (incremental token generation)
| Role | Workload Profile | Scaling Strategy |
|---|---|---|
| Prefill | Compute-heavy prompt initialization and context encoding | Scale for throughput — batch multiple prompts per GPU |
| Decode | Memory-bound, latency-sensitive token-by-token generation | Scale for latency — minimize queue depth per pod |
Independent replica counts per role (e.g., 2 prefill : 4 decode) let you match hardware to workload characteristics without over-provisioning either stage.
Pod Components
Each replica deployment may include:
| Component | Purpose |
|---|---|
| Entry Pod | Provides ingress endpoints for role-specific requests. |
| Worker Pod(s) | Execute model inference computations (tensor-parallel across workers for large models). |
| Downloader | Universal model/artifact fetcher — supports HuggingFace, ModelScope, S3/OBS, and PVC sources with concurrent downloads and file-based locking. |
| Runtime Agent | Sidecar that proxies and standardizes engine metrics (vLLM / SGLang), exposes LoRA adapter lifecycle APIs (download / load / unload), handles KV cache events for PD disaggregation, and provides a model download endpoint. |
| LLM Engine | The actual inference backend — vLLM, SGLang, or other supported engines. |
KV Cache Transfer
In Prefill-Decode Disaggregation mode, KV cache state must flow from prefill pods to decode pods. Kthena supports three connector backends:
| Connector | Trade-off |
|---|---|
| LMCache | In-memory, lowest latency, same-node or RDMA |
| MoonCake | Distributed, cross-node, fault-tolerant |
| NIXL | Lightweight, NCCL-based, GPU-direct |
The choice is transparent to the application — the Runtime Agent and KV Cache connector handle transfer automatically.