Skip to main content
Version: 0.1.0

Quick Start

Get up and running with Kthena in minutes! This guide will walk you through deploying your first AI model. We'll install a model from Hugging Face and perform inference using a simple curl command.

Prerequisites

  • Kthena installed on your Kubernetes cluster (see Installation)
  • Access to a Kubernetes cluster with kubectl configured
  • Pod in Kubernetes can access the internet

Step 1: Create a Model Resource

Create the example model in your namespace (replace <your-namespace> with your actual namespace):

kubectl apply -n <your-namespace> -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-booster/Qwen2.5-0.5B-Instruct.yaml

Content of the Model:

apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelBooster
metadata:
name: demo
spec:
backends:
- name: "backend1"
type: "vLLM"
modelURI: "hf://Qwen/Qwen2.5-0.5B-Instruct"
cacheURI: "hostpath:///tmp/cache"
minReplicas: 1
maxReplicas: 1
env:
- name: "HF_ENDPOINT" # Optional: Use a Hugging Face mirror if you have network issues
value: "https://hf-mirror.com/"
workers:
- type: "server"
image: "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest" # This model will run on CPU, for more details visit https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html#pre-built-images
replicas: 1
pods: 1
config:
served-model-name: "Qwen2.5-0.5B-Instruct"
max-model-len: 32768
max-num-batched-tokens: 65536
block-size: 128
enable-prefix-caching: ""
resources:
limits:
cpu: "4"
memory: "8Gi"
requests:
cpu: "2"
memory: "4Gi"

Step 2: Wait for Model to be Ready

Wait model condition Active to become true. You can check the status using:

kubectl get model demo -o jsonpath='{.status.conditions}'

And the status section should look like this when the model is ready:

[
{
"lastTransitionTime": "2025-09-05T02:14:16Z",
"message": "Model initialized",
"reason": "ModelCreating",
"status": "True",
"type": "Initialized"
},
{
"lastTransitionTime": "2025-09-05T02:18:46Z",
"message": "Model is ready",
"reason": "ModelAvailable",
"status": "True",
"type": "Active"
}
]

Step 3: Perform Inference

You can now perform inference using the model. Here's an example of how to send a request:

curl -X POST http://<model-route-ip>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "demo",
"messages": [
{
"role": "user",
"content": "Where is the capital of China?"
}
],
"stream": false
}'

Use the following command to get the <model-route-ip>:

kubectl get svc networking-kthena-router -o jsonpath='{.spec.clusterIP}' -n <your-namespace>

This IP can only be used inside the cluster. If you want to chat from outside the cluster, you can use the EXTERNAL-IP of networking-kthena-router after you bind it.