Version: 0.1.0

Quick Start

Get up and running with Kthena in minutes! This guide will walk you through deploying your first AI model. We'll install a model from Hugging Face and perform inference using a simple curl command.

Prerequisites

Kthena installed on your Kubernetes cluster (see Installation)
Access to a Kubernetes cluster with kubectl configured
Pod in Kubernetes can access the internet

Step 1: Create a Model Resource

Create the example model in your namespace (replace <your-namespace> with your actual namespace):

kubectl apply -n <your-namespace> -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-booster/Qwen2.5-0.5B-Instruct.yaml

Content of the Model:

apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelBooster
metadata:
  name: demo
spec:
  backends:
    - name: "backend1"
      type: "vLLM"
      modelURI: "hf://Qwen/Qwen2.5-0.5B-Instruct"
      cacheURI: "hostpath:///tmp/cache"
      minReplicas: 1
      maxReplicas: 1
      env:
        - name: "HF_ENDPOINT" # Optional: Use a Hugging Face mirror if you have network issues
          value: "https://hf-mirror.com/"
      workers:
        - type: "server"
          image: "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest" # This model will run on CPU, for more details visit https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html#pre-built-images
          replicas: 1
          pods: 1
          config:
            served-model-name: "Qwen2.5-0.5B-Instruct"
            max-model-len: 32768
            max-num-batched-tokens: 65536
            block-size: 128
            enable-prefix-caching: ""
          resources:
            limits:
              cpu: "4"
              memory: "8Gi"
            requests:
              cpu: "2"
              memory: "4Gi"

Step 2: Wait for Model to be Ready

Wait model condition Active to become true. You can check the status using:

kubectl get model demo -o jsonpath='{.status.conditions}'

And the status section should look like this when the model is ready:

[
  {
    "lastTransitionTime": "2025-09-05T02:14:16Z",
    "message": "Model initialized",
    "reason": "ModelCreating",
    "status": "True",
    "type": "Initialized"
  },
  {
    "lastTransitionTime": "2025-09-05T02:18:46Z",
    "message": "Model is ready",
    "reason": "ModelAvailable",
    "status": "True",
    "type": "Active"
  }
]

Step 3: Perform Inference

You can now perform inference using the model. Here's an example of how to send a request:

curl -X POST http://<model-route-ip>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "demo",
  "messages": [
    {
      "role": "user",
      "content": "Where is the capital of China?"
    }
  ],
  "stream": false
}'

Use the following command to get the <model-route-ip>:

kubectl get svc networking-kthena-router -o jsonpath='{.spec.clusterIP}' -n <your-namespace>

This IP can only be used inside the cluster. If you want to chat from outside the cluster, you can use the EXTERNAL-IP of networking-kthena-router after you bind it.

Prerequisites​

Step 1: Create a Model Resource​

Step 2: Wait for Model to be Ready​

Step 3: Perform Inference​

Prerequisites

Step 1: Create a Model Resource

Step 2: Wait for Model to be Ready

Step 3: Perform Inference