Version: Next

Architecture Overview

Kthena follows a two-plane architecture that cleanly separates what you declare from how requests flow. The Control Plane reconciles your CRDs into a running system; the Data Plane routes every inference request through a model-aware pipeline to the right pod at the right time.

Control Plane — Declarative Model Lifecycle

The Control Plane is Kthena's model-aware brain. You describe intent through CRDs; controllers continuously reconcile that intent into live resources with safety guardrails, rollout logic, and status visibility.

Custom Resource Definitions (CRDs)

Kthena extends Kubernetes with a layered CRD model. A single high-level resource cascades into fine-grained primitives — so you can operate at whichever abstraction level fits your workflow.

ModelBooster  (one-stop deployment API)
 ├── ModelRoute        – routing rules, canary weights, rate limits
 ├── ModelServer       – service exposure, traffic policy, endpoint discovery
 ├── ModelServing      – replica topology: ServingGroups × Roles (Prefill / Decode)
 ├── AutoScalingPolicy – metric triggers, scaling behaviors, panic mode
 └── AutoScalingPolicyBinding – attach policies to workloads, cost-aware optimization

CRD	Purpose
ModelBooster	High-level, opinionated API. Captures model specs and cascades create/update/delete of all downstream resources — one resource to deploy, one resource to manage.
ModelRoute	Declares routing intent: match by `modelName`, `loraAdapters`, path, or headers. Supports weighted traffic splits, canary/A-B strategies, and token-based `rateLimit`s.
ModelServer	Defines service exposure and traffic policy for inference pods. Discovers backends via `workloadSelector` (including role-aware Prefill/Decode groupings) and applies retries, timeouts, and connection settings.
ModelServing	Manages inference workloads as `ServingGroup`s with role-based replicas. Supports entry/worker pod templates, `recoveryPolicy`, topology- and gang-aware scheduling, rolling upgrades, and a dedicated scheduler.
AutoScalingPolicy	Specifies metric endpoints and triggers (CPU, memory, custom), target thresholds, and scaling behaviors — including panic windows for traffic spikes and stabilization windows to prevent flapping.
AutoScalingPolicyBinding	Binds policies to `ModelServing` targets in two modes: scalingConfiguration for metric-driven replica management, and optimizerConfiguration for cost-aware distribution across heterogeneous instance types (e.g., H100 + A100).

Controllers

Three controllers reconcile the CRDs above into runtime resources:

Controller	Responsibility
Model Booster Controller	Reconciles `ModelBooster` → downstream primitives. Propagates updates and orchestrates cascaded lifecycle operations to keep every derived resource consistent.
Model Serving Controller	Manages `ServingGroup`s and role-based replicas. Handles topology- and gang-aware placement, fault recovery, rolling upgrades, and entry/worker pod template reconciliation per role.
Autoscaler Controller	Evaluates runtime metrics against `AutoScalingPolicy` targets, computes desired replica counts, and — via `AutoScalingPolicyBinding` — adjusts instance mix across heterogeneous hardware to meet SLOs and cost goals. Supports both stable and panic scaling modes.

Data Plane — Request-Level Intelligent Routing

The Data Plane is Kthena's runtime path. Every inference request flows through the Kthena Router, which applies security, fairness, and model-aware scheduling before dispatching to the optimal inference pod.

Request Pipeline

Each request traverses six stages in order:

Stage	What Happens
1. Authentication & Authorization	Validates identity and permissions before any work is done.
2. Rate Limiting	Enforces per-model and per-tenant throughput limits — token-based, not just request-based — to prevent monopolization.
3. Fairness Scheduling	Per-model fair queuing ensures one model's traffic spike cannot starve others sharing the same fleet.
4. Scheduling	The core intelligence layer. A pluggable scheduler runs filter → score → select and selects a set of candidate pods (details below).
5. Load Balancing	Final routing to candidate backend instances with retry policy supported.
6. Proxy	Dispatches the request to the selected inference pod and streams the response back.

Pluggable Scheduler

The scheduler at Stage 5 is the heart of Kthena's intelligent routing. It mirrors the Kubernetes scheduler's filter/score pattern but operates at the request level, running in microseconds rather than milliseconds.

Filter Plugins (eliminate ineligible pods):

Plugin	Logic
Least Requests	Drops pods that exceed a configurable active-request threshold.
LoRA Affinity	Excludes pods that don't have the required LoRA adapter loaded.

Score Plugins (rank remaining pods, weighted):

Plugin	Optimizes For
Least Requests	Prefers pods with the fewest active in-flight requests to minimize queuing delay.
Least Latency	Minimizes TTFT (Time-to-First-Token) and TPOT (Time-Per-Output-Token) based on recent observations.
KV Cache Aware	Favors pods with most available KV-cache hit to maximize cache utilization and reduce recomputation overhead.
Prefix Cache	Matches the request's prompt prefix against cached prefixes across pods to maximize cache hits.
GPU Cache	Considers GPU memory utilization to avoid requests preemption.

PD-Aware Scheduling: When the request targets a disaggregated PD group, the scheduler first scores decode pods, then pairs each with a compatible prefill pod in the same group — ensuring KV-cache locality and minimal data transfer.

Inference Pod Architecture

Inference workloads are organized into ServingGroups, each containing multiple replicas. Replicas assume specialized roles to separate the two fundamentally different workloads of LLM inference.

Role-Based Disaggregation

ModelServing
 └── ServingGroup (instance 1 … N)
      ├── Prefill Role    → optimize for compute throughput
      │    ├── Entry Pod   (ingress)
      │    └── Worker Pods (parallel prompt processing)
      └── Decode Role     → optimize for low latency
           ├── Entry Pod   (ingress)
           └── Worker Pods (incremental token generation)

Role	Workload Profile	Scaling Strategy
Prefill	Compute-heavy prompt initialization and context encoding	Scale for throughput — batch multiple prompts per GPU
Decode	Memory-bound, latency-sensitive token-by-token generation	Scale for latency — minimize queue depth per pod

Independent replica counts per role (e.g., 2 prefill : 4 decode) let you match hardware to workload characteristics without over-provisioning either stage.

Pod Components

Each replica deployment may include:

Component	Purpose
Entry Pod	Provides ingress endpoints for role-specific requests.
Worker Pod(s)	Execute model inference computations (tensor-parallel across workers for large models).
Downloader	Universal model/artifact fetcher — supports HuggingFace, ModelScope, S3/OBS, and PVC sources with concurrent downloads and file-based locking.
Runtime Agent	Sidecar that proxies and standardizes engine metrics (vLLM / SGLang), exposes LoRA adapter lifecycle APIs (download / load / unload), handles KV cache events for PD disaggregation, and provides a model download endpoint.
LLM Engine	The actual inference backend — vLLM, SGLang, or other supported engines.

KV Cache Transfer

In Prefill-Decode Disaggregation mode, KV cache state must flow from prefill pods to decode pods. Kthena supports three connector backends:

Connector	Trade-off
LMCache	In-memory, lowest latency, same-node or RDMA
MoonCake	Distributed, cross-node, fault-tolerant
NIXL	Lightweight, NCCL-based, GPU-direct

The choice is transparent to the application — the Runtime Agent and KV Cache connector handle transfer automatically.

Control Plane — Declarative Model Lifecycle​

Custom Resource Definitions (CRDs)​

Controllers​

Data Plane — Request-Level Intelligent Routing​

Request Pipeline​

Pluggable Scheduler​

Inference Pod Architecture​

Role-Based Disaggregation​

Pod Components​

KV Cache Transfer​