Quick Start
Get up and running with Kthena in minutes! This guide will walk you through deploying your first AI model. We'll install a model from Hugging Face and perform inference using a simple curl command.
Prerequisites
- Kthena installed on your Kubernetes cluster (see Installation)
- Access to a Kubernetes cluster with
kubectlconfigured - Pod in Kubernetes can access the internet
Step 1: Create a Model Resource
Create the example model in your namespace (replace <your-namespace> with your actual namespace):
kubectl apply -n <your-namespace> -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-booster/Qwen2.5-0.5B-Instruct.yaml
Content of the Model:
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelBooster
metadata:
name: demo
spec:
backends:
- name: "backend1"
type: "vLLM"
modelURI: "hf://Qwen/Qwen2.5-0.5B-Instruct"
cacheURI: "hostpath:///tmp/cache"
minReplicas: 1
maxReplicas: 1
env:
- name: "HF_ENDPOINT" # Optional: Use a Hugging Face mirror if you have network issues
value: "https://hf-mirror.com/"
workers:
- type: "server"
image: "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest" # This model will run on CPU, for more details visit https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html#pre-built-images
replicas: 1
pods: 1
config:
served-model-name: "Qwen2.5-0.5B-Instruct"
max-model-len: 32768
max-num-batched-tokens: 65536
block-size: 128
enable-prefix-caching: ""
resources:
limits:
cpu: "4"
memory: "8Gi"
requests:
cpu: "2"
memory: "4Gi"
Step 2: Wait for Model to be Ready
Wait model condition Active to become true. You can check the status using:
kubectl get model demo -o jsonpath='{.status.conditions}'
And the status section should look like this when the model is ready:
[
{
"lastTransitionTime": "2025-09-05T02:14:16Z",
"message": "Model initialized",
"reason": "ModelCreating",
"status": "True",
"type": "Initialized"
},
{
"lastTransitionTime": "2025-09-05T02:18:46Z",
"message": "Model is ready",
"reason": "ModelAvailable",
"status": "True",
"type": "Active"
}
]
Step 3: Perform Inference
You can now perform inference using the model. Here's an example of how to send a request:
curl -X POST http://<model-route-ip>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "demo",
"messages": [
{
"role": "user",
"content": "Where is the capital of China?"
}
],
"stream": false
}'
Use the following command to get the <model-route-ip>:
kubectl get svc networking-kthena-router -o jsonpath='{.spec.clusterIP}' -n <your-namespace>
This IP can only be used inside the cluster. If you want to chat from outside the cluster, you can use the EXTERNAL-IP
of networking-kthena-router after you bind it.