Version: Next

Router Routing

This page describes the router routing features and capabilities in Kthena, based on real-world examples and configurations.

Overview

Kthena Router provides sophisticated traffic routing capabilities that enable intelligent forwarding of inference requests to appropriate backend models. The routing system is built around two core Custom Resources (CRs):

ModelServer: Defines backend inference service instances with their associated pods, models, and traffic policies
ModelRoute: Defines routing rules based on request characteristics such as model name, LoRA adapters, HTTP headers, and weight distribution

For a detailed definition of the ModelServer and ModelRoute CRs, please refer to the ModelRoute and ModelRoute Reference pages.

The router supports various routing strategies, from simple model-based forwarding to complex weighted distribution and header-based routing. This flexibility allows for advanced deployment patterns including canary releases, A/B testing, and load balancing across heterogeneous model deployments.

Preparation

Before diving into the routing configurations, let's set up the environment and understand the prerequisites. All the configuration examples in this document can be found in the examples/kthena-router directory of the Kthena repository.

Environment Setup

To simplify deployment and reduce the requirements for demonstration environments, we use a mock LLM server instead of deploying real models. This mock server implements the vLLM standard interface and returns mock data, making it perfect for testing and learning purposes.

Prerequisites

Kubernetes cluster with Kthena installed
Access to the Kthena examples repository
Basic understanding of router CRDs (ModelServer and ModelRoute)

Getting Started

Deploy mock LLM inference engine if you do not have a real GPU/NPU environment at the moment. mock deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B and mock deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Deploy all kinds of ModelServer, such as deepseek-r1-1-5b, deepseek-r1-7b, etc., as backends of different routing strategies.
All routing examples in this guide use these mock services, so you can experiment with different routing strategies without the overhead of real model deployment.

Routing Scenarios

1. Simple Model-Based Routing

Scenario: Direct all requests for a specific model to a single backend service.

Traffic Processing: When a request comes in for model "deepseek-r1", the router matches this criterion and forwards all traffic to the 1.5B ModelServer. This is the most straightforward routing pattern.

apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelRoute
metadata:
  name: deepseek-simple
  namespace: default
spec:
  modelName: "deepseek-simple"
  rules:
  - name: "default"
    targetModels:
    - modelServerName: "deepseek-r1-1-5b"

Flow Description:

Request arrives for model name "deepseek-r1"
Router matches the modelName field in the ModelRoute
100% of traffic is directed to deepseek-r1-1-5b
The ModelServer serves requests using vLLM inference engine with 10s timeout

Try it out:

export MODEL="deepseek-simple"

curl http://$ROUTER_IP/v1/completions \
    -H "Content-Type: application/json" \
    -d "{
        \"model\": \"$MODEL\",
        \"prompt\": \"San Francisco is a\",
        \"temperature\": 0
    }"
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B!"}],"created":1756366365,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":218,"prompt_tokens":1,"time":0.0,"total_tokens":219}}

2. LoRA-Aware Routing

Scenario: Route requests requiring specific LoRA adapters to specialized ModelServers optimized for LoRA workloads.

Traffic Processing: When a request specifies LoRA adapters (lora-A or lora-B), the router routes it to ModelServers configured to handle these specific adapters.

apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelRoute
metadata:
  name: deepseek-lora
  namespace: default
spec:
  loraAdapters:
  - "lora-A"
  - "lora-B"
  rules:
  - name: "lora-route"
    targetModels:
    - modelServerName: "deepseek-r1-1-5b"

Flow Description:

Request arrives with LoRA adapter requirement (lora-A or lora-B)
Router matches the LoRA adapter against the supported list
Routes to deepseek-r1-1-5b ModelServer configured for LoRA workloads
ModelServer efficiently handles LoRA adapter loading and inference

Try it out:

export MODEL="lora-A"

curl http://$ROUTER_IP/v1/completions \
    -H "Content-Type: application/json" \
    -d "{
        \"model\": \"$MODEL\",
        \"prompt\": \"San Francisco is a\",
        \"temperature\": 0
    }"
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from lora-A!"}],"created":1756366636,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"lora-A","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":120,"prompt_tokens":1,"time":0.0,"total_tokens":121}}

3. Weight-Based Traffic Distribution

Scenario: Gradually roll out new model versions by splitting traffic between different versions using weighted distribution.

Traffic Processing: The router uses weighted round-robin to distribute requests. For every 100 requests, approximately 70 will go to version 1 and 30 to version 2. This allows safe validation of new model versions with controlled risk.

apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelRoute
metadata:
  name: deepseek-subset
  namespace: default
spec:
  modelName: "deepseek-subset"
  rules:
  - name: "deepseek-r1-route"
    targetModels:
    - modelServerName: "deepseek-r1-1-5b-v1"
      weight: 70
    - modelServerName: "deepseek-r1-1-5b-v2"
      weight: 30

Flow Description:

Request arrives for model "deepseek-r1"
Router applies weighted distribution algorithm
70% of requests → deepseek-r1-1-5b-v1 (stable version)
30% of requests → deepseek-r1-1-5b-v2 (new version being tested)
This enables controlled testing of new model versions

NOTE: This scenario need to deploy canary version of ModelServer and mock deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to test.

Try it out:

export MODEL="deepseek-subset"

for i in $(seq 1 100);
do
    curl http://$ROUTER_IP/v1/completions \
        -H "Content-Type: application/json" \
        -d "{
            \"model\": \"$MODEL\",
            \"prompt\": \"San Francisco is a\",
            \"temperature\": 0
        }";
done
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v2!"}],"created":1756371124,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v2","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":313,"prompt_tokens":1,"time":0.0,"total_tokens":314}}
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v2!"}],"created":1756371124,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v2","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":313,"prompt_tokens":1,"time":0.0,"total_tokens":314}}
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v1!"}],"created":1756371124,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v1","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":47,"prompt_tokens":1,"time":0.0,"total_tokens":48}}
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v1!"}],"created":1756371124,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v1","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":118,"prompt_tokens":1,"time":0.0,"total_tokens":119}}
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v2!"}],"created":1756371124,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v2","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":285,"prompt_tokens":1,"time":0.0,"total_tokens":286}}
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v2!"}],"created":1756371124,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v2","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":409,"prompt_tokens":1,"time":0.0,"total_tokens":410}}
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v1!"}],"created":1756371124,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-v1","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":343,"prompt_tokens":1,"time":0.0,"total_tokens":344}}
...

4. Header-Based Multi-Model Routing

Scenario: Route traffic to different model sizes based on user tier, enabling premium users to access more powerful models.

Traffic Processing: The router evaluates incoming requests in the order rules are defined. Premium users (identified by user-type: premium header) are routed to the 7B model, while regular users fall back to the 1.5B model.

apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelRoute
metadata:
  name: deepseek-multi-models
  namespace: default
spec:
  modelName: "deepseek-multi-models"
  rules:
  - name: "premium"
    modelMatch:
      headers:
        user-type:
          exact: premium
    targetModels:
    - modelServerName: "deepseek-r1-7b"
  - name: "default"
    targetModels:
    - modelServerName: "deepseek-r1-1-5b"

Flow Description:

Request arrives for model "deepseek-r1" with headers
Router first checks if user-type: premium header exists with exact match
If premium header found → Routes to deepseek-r1-7b (7B model using SGLang)
If no premium header → Falls back to deepseek-r1-1-5b (1.5B model using vLLM)
Premium users get access to the more powerful 7B model for better performance

Try it out:

export MODEL="deepseek-multi-models"

curl http://$ROUTER_IP/v1/completions \
    -H "Content-Type: application/json" \
    -H "user-type: premium" \
    -d "{
        \"model\": \"$MODEL\",
        \"prompt\": \"San Francisco is a\",
        \"temperature\": 0
    }"
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"This is simulated message from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B!"}],"created":1756367891,"id":"cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7","model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B","object":"text_completion","system_fingerprint":"fp_44709d6fcb","usage":{"completion_tokens":71,"prompt_tokens":1,"time":0.0,"total_tokens":72}}

This comprehensive routing system enables flexible, scalable, and maintainable model serving infrastructure that can adapt to various deployment patterns and user requirements.

Overview​

Preparation​

Environment Setup​

Prerequisites​

Getting Started​

Routing Scenarios​

1. Simple Model-Based Routing​

2. LoRA-Aware Routing​

3. Weight-Based Traffic Distribution​

4. Header-Based Multi-Model Routing​

Overview

Preparation

Environment Setup

Prerequisites

Getting Started

Routing Scenarios

1. Simple Model-Based Routing

2. LoRA-Aware Routing

3. Weight-Based Traffic Distribution

4. Header-Based Multi-Model Routing