A Deep Dive into the Kthena Router
1. Introduction
As Large Language Models (LLMs) become increasingly central to modern applications, the infrastructure supporting them must evolve to meet demanding performance, scalability, and cost requirements. Deploying LLMs in production presents unique challenges: models are resource-intensive, inference workloads vary significantly, and users expect low latency with high throughput. Traditional load balancers and API gateways, while excellent for conventional web services, lack the awareness needed to intelligently route AI inference traffic.
Kthena Router addresses these challenges head-on. It is a Kubernetes-native, standalone inference router purpose-built for LLM serving workloads. Unlike generic proxies or load balancers, Kthena Router is model-aware, making intelligent routing decisions based on real-time metrics from inference engines. This enables sophisticated traffic management strategies that significantly improve throughput, reduce latency, and lower operational costs.
The router seamlessly integrates with existing API gateway infrastructure while providing advanced capabilities specifically designed for AI workloads:
- Model-Aware Routing: Leverages real-time metrics from inference engines (vLLM, SGLang, TGI) to make intelligent routing decisions
- LoRA-Aware Load Balancing: Intelligently route to pods that have already loaded the desired LoRA adapter to reduce adapter swap latency from hundreds of milliseconds to near-zero
- Advanced Scheduling Algorithms: Includes Prefix Cache Aware, KV Cache Aware and Fairness Scheduling, etc.
- Prefill-Decode Disaggregation: Native support for xPyD (x-prefill/y-decode) deployment patterns
Kthena Router is deployed as a standalone binary with minimal dependencies, ensuring lightweight operation and straightforward deployment. It continuously monitors inference engine metrics to obtain real-time information about model status, including currently loaded LoRA adapters, KV cache utilization, request queue lengths, and latency metrics (TTFT/TPOT). This real-time awareness enables the router to make optimal routing decisions that traditional load balancers simply cannot achieve.


