Skip to main content
Version: Next

Architecture Overview

Kthena follows a two-plane architecture that cleanly separates what you declare from how requests flow. The Control Plane reconciles your CRDs into a running system; the Data Plane routes every inference request through a model-aware pipeline to the right pod at the right time.

Data Plane
Data Plane
Control Plane
Control Plane
Kthena Router
Kthena Router
Serving Pods
Serving Pods
Replica
Replica
Group
Group
Replica
Replica
Role A : Prefill
Role A : Prefill
Init Container
Init Container
Sidecar Container
Sidecar Container
Downloader
Downloader
LLM engines (vLLM, SGLang)
LLM engines (vLLM, SGLang)
Entry Pod
Entry Pod
Worker Pod
Worker Pod
Runtime Agent
Runtime Agent
Replica
Replica
Role B : Decode
Role B : Decode
Init Container
Init Container
Sidecar Container
Sidecar Container
Downloader
Downloader
LLM engines (vLLM, SGLang)
LLM engines (vLLM, SGLang)
Entry Pod
Entry Pod
Worker Pod
Worker Pod
Runtime Agent
Runtime Agent
group replicas
group replicas
role replicas
role replicas
User
User
Model Route
Controller
Model Route...
Autoscaling Policy Binding
Autoscaling Policy Bind...
Autoscaling Policy
Autoscaling Policy 
Model Route
Model Route
Model Server
Model Server
Auth
Auth
Proxy
Proxy
Operator
Operator
user request (PD Disaggregation Mode)
user request (PD Disaggregation Mode)
prefill
prefill
get metrics
get metrics
Model Booster
Model Booster
Model Booster Controller
Model Booster Con...
Model Serving
Model Serving
Model Serving Controller
Model Serving Con...
Model Server Controller
Model Server Cont...
Autoscaler Controller
Autoscaler Contro...
Rate LImiting
Rate LImiting
Load Balancing
Load Balancing
Scheduler
Scheduler
schedule
schedule
Fairness Scheduling
Fairness Scheduli...
get metrics
get metrics
Score Plugin
Score Plugin
Filter Plugin
Filter Plugin
Least Requets
Least Requets
Lora Affinity
Lora Affinity
KV Cache Aware
KV Cache Aware
Least Latency
(TTFT & TPOT)
Least Latency...
Prefix Cache
Prefix Cache
GPU Cache
GPU Cache
decode
decode
Text is not SVG - cannot display

Control Plane — Declarative Model Lifecycle

The Control Plane is Kthena's model-aware brain. You describe intent through CRDs; controllers continuously reconcile that intent into live resources with safety guardrails, rollout logic, and status visibility.

Custom Resource Definitions (CRDs)

Kthena extends Kubernetes with a layered CRD model. A single high-level resource cascades into fine-grained primitives — so you can operate at whichever abstraction level fits your workflow.

ModelBooster  (one-stop deployment API)
├── ModelRoute – routing rules, canary weights, rate limits
├── ModelServer – service exposure, traffic policy, endpoint discovery
├── ModelServing – replica topology: ServingGroups × Roles (Prefill / Decode)
├── AutoScalingPolicy – metric triggers, scaling behaviors, panic mode
└── AutoScalingPolicyBinding – attach policies to workloads, cost-aware optimization
CRDPurpose
ModelBoosterHigh-level, opinionated API. Captures model specs and cascades create/update/delete of all downstream resources — one resource to deploy, one resource to manage.
ModelRouteDeclares routing intent: match by modelName, loraAdapters, path, or headers. Supports weighted traffic splits, canary/A-B strategies, and token-based rateLimits.
ModelServerDefines service exposure and traffic policy for inference pods. Discovers backends via workloadSelector (including role-aware Prefill/Decode groupings) and applies retries, timeouts, and connection settings.
ModelServingManages inference workloads as ServingGroups with role-based replicas. Supports entry/worker pod templates, recoveryPolicy, topology- and gang-aware scheduling, rolling upgrades, and a dedicated scheduler.
AutoScalingPolicySpecifies metric endpoints and triggers (CPU, memory, custom), target thresholds, and scaling behaviors — including panic windows for traffic spikes and stabilization windows to prevent flapping.
AutoScalingPolicyBindingBinds policies to ModelServing targets in two modes: scalingConfiguration for metric-driven replica management, and optimizerConfiguration for cost-aware distribution across heterogeneous instance types (e.g., H100 + A100).

Controllers

Three controllers reconcile the CRDs above into runtime resources:

ControllerResponsibility
Model Booster ControllerReconciles ModelBooster → downstream primitives. Propagates updates and orchestrates cascaded lifecycle operations to keep every derived resource consistent.
Model Serving ControllerManages ServingGroups and role-based replicas. Handles topology- and gang-aware placement, fault recovery, rolling upgrades, and entry/worker pod template reconciliation per role.
Autoscaler ControllerEvaluates runtime metrics against AutoScalingPolicy targets, computes desired replica counts, and — via AutoScalingPolicyBinding — adjusts instance mix across heterogeneous hardware to meet SLOs and cost goals. Supports both stable and panic scaling modes.

Data Plane — Request-Level Intelligent Routing

The Data Plane is Kthena's runtime path. Every inference request flows through the Kthena Router, which applies security, fairness, and model-aware scheduling before dispatching to the optimal inference pod.

Request Pipeline

Kthena Router
Kthena Router
User
User
Auth
Auth
Proxy
Proxy
Rate LImiting
Rate LImiting
Load Balancing
Load Balancing
Scheduler
Scheduler
schedule
schedule
Fairness Scheduling
Fairness Scheduli...
Score Plugin
Score Plugin
Filter Plugin
Filter Plugin
Least Requets
Least Requets
Lora Affinity
Lora Affinity
KV Cache Aware
KV Cache Aware
Least Latency
(TTFT & TPOT)
Least Latency...
Prefix Cache
Prefix Cache
GPU Cache
GPU Cache
Text is not SVG - cannot display

Each request traverses six stages in order:

StageWhat Happens
1. Authentication & AuthorizationValidates identity and permissions before any work is done.
2. Rate LimitingEnforces per-model and per-tenant throughput limits — token-based, not just request-based — to prevent monopolization.
3. Fairness SchedulingPer-model fair queuing ensures one model's traffic spike cannot starve others sharing the same fleet.
4. SchedulingThe core intelligence layer. A pluggable scheduler runs filter → score → select and selects a set of candidate pods (details below).
5. Load BalancingFinal routing to candidate backend instances with retry policy supported.
6. ProxyDispatches the request to the selected inference pod and streams the response back.

Pluggable Scheduler

The scheduler at Stage 5 is the heart of Kthena's intelligent routing. It mirrors the Kubernetes scheduler's filter/score pattern but operates at the request level, running in microseconds rather than milliseconds.

Filter Plugins (eliminate ineligible pods):

PluginLogic
Least RequestsDrops pods that exceed a configurable active-request threshold.
LoRA AffinityExcludes pods that don't have the required LoRA adapter loaded.

Score Plugins (rank remaining pods, weighted):

PluginOptimizes For
Least RequestsPrefers pods with the fewest active in-flight requests to minimize queuing delay.
Least LatencyMinimizes TTFT (Time-to-First-Token) and TPOT (Time-Per-Output-Token) based on recent observations.
KV Cache AwareFavors pods with most available KV-cache hit to maximize cache utilization and reduce recomputation overhead.
Prefix CacheMatches the request's prompt prefix against cached prefixes across pods to maximize cache hits.
GPU CacheConsiders GPU memory utilization to avoid requests preemption.

PD-Aware Scheduling: When the request targets a disaggregated PD group, the scheduler first scores decode pods, then pairs each with a compatible prefill pod in the same group — ensuring KV-cache locality and minimal data transfer.


Inference Pod Architecture

Inference workloads are organized into ServingGroups, each containing multiple replicas. Replicas assume specialized roles to separate the two fundamentally different workloads of LLM inference.

Role-Based Disaggregation

ModelServing
└── ServingGroup (instance 1 … N)
├── Prefill Role → optimize for compute throughput
│ ├── Entry Pod (ingress)
│ └── Worker Pods (parallel prompt processing)
└── Decode Role → optimize for low latency
├── Entry Pod (ingress)
└── Worker Pods (incremental token generation)
RoleWorkload ProfileScaling Strategy
PrefillCompute-heavy prompt initialization and context encodingScale for throughput — batch multiple prompts per GPU
DecodeMemory-bound, latency-sensitive token-by-token generationScale for latency — minimize queue depth per pod

Independent replica counts per role (e.g., 2 prefill : 4 decode) let you match hardware to workload characteristics without over-provisioning either stage.

Pod Components

Each replica deployment may include:

ComponentPurpose
Entry PodProvides ingress endpoints for role-specific requests.
Worker Pod(s)Execute model inference computations (tensor-parallel across workers for large models).
DownloaderUniversal model/artifact fetcher — supports HuggingFace, ModelScope, S3/OBS, and PVC sources with concurrent downloads and file-based locking.
Runtime AgentSidecar that proxies and standardizes engine metrics (vLLM / SGLang), exposes LoRA adapter lifecycle APIs (download / load / unload), handles KV cache events for PD disaggregation, and provides a model download endpoint.
LLM EngineThe actual inference backend — vLLM, SGLang, or other supported engines.

KV Cache Transfer

In Prefill-Decode Disaggregation mode, KV cache state must flow from prefill pods to decode pods. Kthena supports three connector backends:

ConnectorTrade-off
LMCacheIn-memory, lowest latency, same-node or RDMA
MoonCakeDistributed, cross-node, fault-tolerant
NIXLLightweight, NCCL-based, GPU-direct

The choice is transparent to the application — the Runtime Agent and KV Cache connector handle transfer automatically.