Skip to main content
Version: Next

Quick Start

Get up and running with Kthena in minutes! This guide will walk you through deploying your first AI model. We'll install a model from Hugging Face and perform inference using a simple curl command.

We have two optional ways to quickly start using kthena to deploy LLM.

  1. ModelBooster
  2. ModelServing

Prerequisites

  • Kthena installed on your Kubernetes cluster (see Installation)
  • Access to a Kubernetes cluster with kubectl configured
  • Pod in Kubernetes can access the internet

ModelBooster

Kthena ModelBooster is a Custom Resource Definitions(CRD) of Kthena that provides a simple way to deploy LLMs. It allows you to deploy LLMs with a single click.

Step 1: Create a ModelBooster Resource

Create the example model in your namespace (replace <your-namespace> with your actual namespace):

kubectl apply -n <your-namespace> -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-booster/Qwen2.5-0.5B-Instruct.yaml

Content of the Model:

apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelBooster
metadata:
name: demo
spec:
backends:
- name: "backend1"
type: "vLLM"
modelURI: "hf://Qwen/Qwen2.5-0.5B-Instruct"
cacheURI: "hostpath:///tmp/cache"
minReplicas: 1
maxReplicas: 1
env:
- name: "HF_ENDPOINT" # Optional: Use a Hugging Face mirror if you have network issues
value: "https://hf-mirror.com/"
workers:
- type: "server"
image: "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest" # This model will run on CPU, for more details visit https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html#pre-built-images
replicas: 1
pods: 1
config:
served-model-name: "Qwen2.5-0.5B-Instruct"
max-model-len: 32768
max-num-batched-tokens: 65536
block-size: 128
enable-prefix-caching: ""
resources:
limits:
cpu: "4"
memory: "8Gi"
requests:
cpu: "2"
memory: "4Gi"

Step 2: Wait for Model to be Ready

Wait model condition Active to become true. You can check the status using:

kubectl get modelBooster demo -o jsonpath='{.status.conditions}'

And the status section should look like this when the model is ready:

[
{
"lastTransitionTime": "2025-09-05T02:14:16Z",
"message": "Model initialized",
"reason": "ModelCreating",
"status": "True",
"type": "Initialized"
},
{
"lastTransitionTime": "2025-09-05T02:18:46Z",
"message": "Model is ready",
"reason": "ModelAvailable",
"status": "True",
"type": "Active"
}
]

Step 3: Perform Inference

You can now perform inference using the model. Here's an example of how to send a request:

curl -X POST http://<model-route-ip>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "demo",
"messages": [
{
"role": "user",
"content": "Where is the capital of China?"
}
],
"stream": false
}'

Use the following command to get the <model-route-ip>:

kubectl get svc networking-kthena-router -o jsonpath='{.spec.clusterIP}' -n <your-namespace>

This IP can only be used inside the cluster. If you want to chat from outside the cluster, you can use the EXTERNAL-IP of networking-kthena-router after you bind it.

ModelServing

In addition to using Kthena with a single click via modelBooster, you can also flexibly configure your own LLM through modelServing.

Model Serving Controller is a component of Kthena that provides a flexible and customizable way to deploy LLMs. It allows you to configure your own LLM through ModelServing CRD. ModelServing supports deploying large language models (LLMs) based on roles, with support for gang scheduling and network topology scheduling. It also provides fundamental features such as scaling and rolling updates.

Herer is an example of deploying the PD-disaggregation Qwen-8B Model on GPU Using ModelServing.

Step 1: Create a ModelServing Resource Object:

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-serving/gpu-pd-disaggregation.yaml

Step 2: Wait for ModelServing to be Ready

After all Pods awaiting deployment have started running, you can run the following command to see the result:

kubectl get po

NAMESPACE NAME READY STATUS RESTARTS AGE
default PD-sample-0-decode-0-0 1/1 Running 0 2m
default PD-sample-0-prefill-0-0 1/1 Running 0 2m

------------------------------------------

kubectl get modelserving sample -o jsonpath='{.status.conditions}' | jq '.'

[
{
"lastTransitionTime": "2025-09-29T08:11:16Z",
"message": "Some groups is progressing: [0]",
"reason": "GroupProgressing",
"status": "False",
"type": "Progressing"
},
{
"lastTransitionTime": "2025-09-29T08:11:21Z",
"message": "All Serving groups are ready",
"reason": "AllGroupsReady",
"status": "True",
"type": "Available"
}
]

Step 3: Perform Inference

Before you can perform inference, you need to create ModelRoute and ModelServer. You can refer to modelRouter Configuration and modelServer Configuration.

Then you can use the following command to send a request:

export MODEL="models/Qwen3-8B"

curl http://$ROUTER_IP/v1/completions -H "Content-Type: application/json" -d "{
\"model\": \"$MODEL\",
\"prompt\": \"San Francisco is a\",
\"temperature\": 0
}"