Skip to main content

A Deep Dive into the Kthena Router

· 12 min read

1. Introduction

As Large Language Models (LLMs) become increasingly central to modern applications, the infrastructure supporting them must evolve to meet demanding performance, scalability, and cost requirements. Deploying LLMs in production presents unique challenges: models are resource-intensive, inference workloads vary significantly, and users expect low latency with high throughput. Traditional load balancers and API gateways, while excellent for conventional web services, lack the awareness needed to intelligently route AI inference traffic.

Kthena Router addresses these challenges head-on. It is a Kubernetes-native, standalone inference router purpose-built for LLM serving workloads. Unlike generic proxies or load balancers, Kthena Router is model-aware, making intelligent routing decisions based on real-time metrics from inference engines. This enables sophisticated traffic management strategies that significantly improve throughput, reduce latency, and lower operational costs.

The router seamlessly integrates with existing API gateway infrastructure while providing advanced capabilities specifically designed for AI workloads:

  • Model-Aware Routing: Leverages real-time metrics from inference engines (vLLM, SGLang, TGI) to make intelligent routing decisions
  • LoRA-Aware Load Balancing: Intelligently route to pods that have already loaded the desired LoRA adapter to reduce adapter swap latency from hundreds of milliseconds to near-zero
  • Advanced Scheduling Algorithms: Includes Prefix Cache Aware, KV Cache Aware and Fairness Scheduling, etc.
  • Prefill-Decode Disaggregation: Native support for xPyD (x-prefill/y-decode) deployment patterns

Kthena Router is deployed as a standalone binary with minimal dependencies, ensuring lightweight operation and straightforward deployment. It continuously monitors inference engine metrics to obtain real-time information about model status, including currently loaded LoRA adapters, KV cache utilization, request queue lengths, and latency metrics (TTFT/TPOT). This real-time awareness enables the router to make optimal routing decisions that traditional load balancers simply cannot achieve.

A Deep Dive of the Kthena's ModelServing

· 8 min read

Introduction

As large models continue to grow exponentially in parameter size, the resource limits of a single virtual or physical machine can no longer meet their demands. To address this challenge, the industry has introduced innovative strategies such as PD-diaggregation deployment and hybrid deployment of large and small models. These approaches have transformed inference execution: instead of a single Pod handling an entire inference task, multiple Pods now often collaborate to complete a single prediction. This multi-Pod collaboration has become a key trend in large model inference deployment.

In practice, inference models may still run within a single Pod (as in traditional single-node scenarios), across a group of identical Pods (for larger models), or among Pods with specialized roles (as in PD-disaggregation deployments). This flexible deployment not only improves resource utilization but also enables more efficient large model inference.

ModelServing is a specialized component of Kthena designed to manage and orchestrate the lifecycle of inference model workloads. It can conveniently represent and manage multiple deployment models, such as PD-disaggregation, tensor parallelism, pipeline parallelism, and native model deployment, because its three-tier architecture.

Kthena Router ScorePlugin Architecture and Benchmark Analysis

· 7 min read
Kuba
Member

Abstract

This paper analyzes the system design and implementation of the ScorePlugin module in Kthena Router, which leverages a configurable, pluggable architecture to enable multi-dimensional scoring and intelligent routing of inference requests. We provide a detailed examination of the six currently implemented ScorePlugins, and construct a standardized benchmarking environment based on the DeepSeek-R1-Distill-Qwen-7B model to evaluate the performance of different scheduling strategies under both long and short system prompt scenarios.

Experimental results demonstrate that in long system prompt scenarios, the KVCacheAware Plugin + Least Request Plugin combination achieves 2.73× higher throughput and reduces TTFT latency by 73.5%, significantly optimizing overall inference service performance and validating the core value of cache-aware scheduling for large-scale model inference.