Skip to main content

Kthena Router Supports Gateway API and Inference Extension

· 12 min read

Introduction

As Kubernetes becomes the de facto standard for deploying AI/ML workloads, the need for standardized, interoperable traffic management APIs has become increasingly important. The Kubernetes Gateway API represents a significant evolution from the traditional Ingress API, providing a more expressive, role-oriented, and extensible model for managing north-south traffic in Kubernetes clusters.

Building on top of Gateway API, the Gateway API Inference Extension introduces specialized resources and capabilities designed specifically for AI/ML inference workloads. This extension standardizes how inference services are exposed and routed through gateway implementations, enabling seamless integration across different gateway providers.

Kthena Router now supports both Gateway API and Gateway API Inference Extension, providing users with flexible routing options while maintaining compatibility with industry standards. This blog post explores why these APIs matter, how to enable them, and demonstrates practical usage examples.

A Deep Dive into the Kthena Router

· 12 min read

1. Introduction

As Large Language Models (LLMs) become increasingly central to modern applications, the infrastructure supporting them must evolve to meet demanding performance, scalability, and cost requirements. Deploying LLMs in production presents unique challenges: models are resource-intensive, inference workloads vary significantly, and users expect low latency with high throughput. Traditional load balancers and API gateways, while excellent for conventional web services, lack the awareness needed to intelligently route AI inference traffic.

Kthena Router addresses these challenges head-on. It is a Kubernetes-native, standalone inference router purpose-built for LLM serving workloads. Unlike generic proxies or load balancers, Kthena Router is model-aware, making intelligent routing decisions based on real-time metrics from inference engines. This enables sophisticated traffic management strategies that significantly improve throughput, reduce latency, and lower operational costs.

The router seamlessly integrates with existing API gateway infrastructure while providing advanced capabilities specifically designed for AI workloads:

  • Model-Aware Routing: Leverages real-time metrics from inference engines (vLLM, SGLang, TGI) to make intelligent routing decisions
  • LoRA-Aware Load Balancing: Intelligently route to pods that have already loaded the desired LoRA adapter to reduce adapter swap latency from hundreds of milliseconds to near-zero
  • Advanced Scheduling Algorithms: Includes Prefix Cache Aware, KV Cache Aware and Fairness Scheduling, etc.
  • Prefill-Decode Disaggregation: Native support for xPyD (x-prefill/y-decode) deployment patterns

Kthena Router is deployed as a standalone binary with minimal dependencies, ensuring lightweight operation and straightforward deployment. It continuously monitors inference engine metrics to obtain real-time information about model status, including currently loaded LoRA adapters, KV cache utilization, request queue lengths, and latency metrics (TTFT/TPOT). This real-time awareness enables the router to make optimal routing decisions that traditional load balancers simply cannot achieve.

A Deep Dive of the Kthena's ModelServing

· 8 min read

Introduction

As large models continue to grow exponentially in parameter size, the resource limits of a single virtual or physical machine can no longer meet their demands. To address this challenge, the industry has introduced innovative strategies such as PD-diaggregation deployment and hybrid deployment of large and small models. These approaches have transformed inference execution: instead of a single Pod handling an entire inference task, multiple Pods now often collaborate to complete a single prediction. This multi-Pod collaboration has become a key trend in large model inference deployment.

In practice, inference models may still run within a single Pod (as in traditional single-node scenarios), across a group of identical Pods (for larger models), or among Pods with specialized roles (as in PD-disaggregation deployments). This flexible deployment not only improves resource utilization but also enables more efficient large model inference.

ModelServing is a specialized component of Kthena designed to manage and orchestrate the lifecycle of inference model workloads. It can conveniently represent and manage multiple deployment models, such as PD-disaggregation, tensor parallelism, pipeline parallelism, and native model deployment, because its three-tier architecture.

Kthena Router ScorePlugin Architecture and Benchmark Analysis

· 7 min read
Kuba
Member

Abstract

This paper analyzes the system design and implementation of the ScorePlugin module in Kthena Router, which leverages a configurable, pluggable architecture to enable multi-dimensional scoring and intelligent routing of inference requests. We provide a detailed examination of the six currently implemented ScorePlugins, and construct a standardized benchmarking environment based on the DeepSeek-R1-Distill-Qwen-7B model to evaluate the performance of different scheduling strategies under both long and short system prompt scenarios.

Experimental results demonstrate that in long system prompt scenarios, the KVCacheAware Plugin + Least Request Plugin combination achieves 2.73× higher throughput and reduces TTFT latency by 73.5%, significantly optimizing overall inference service performance and validating the core value of cache-aware scheduling for large-scale model inference.

Volcano Community Launches Kthena Sub-project: Redefining Intelligent LLM Inference

· 6 min read

Today, we’re excited to announce to global developers and MLOps engineers the arrival of a new sub-project in the Volcano community Kthena.

Kthena is a cloud-native, high-performance LLM inference routing, orchestration, and scheduling system designed specifically for Kubernetes. It aims to solve the core challenges of deploying and serving LLMs at scale in production environments. Through its unique features such as KV Cache-aware scheduling and Prefill/Decode separation routing, Kthena significantly improves GPU resource utilization, reduces inference latency, and provides enterprises with unprecedented flexibility and control.

As a sub-project of Volcano, Kthena is dedicated to helping Volcano expand its boundaries beyond AI training, creating a complete integrated solution for both training and inference.

Project Address