Version: v0.4.0

SGLang Prefill-Decode Disaggregation (GPU)

This page describes how to deploy prefill-decode disaggregated inference using SGLang with Kthena on a GPU cluster.

Prerequisites

Kubernetes cluster with Kthena installed
NVIDIA GPU-enabled nodes with the appropriate device plugin configured
Access to the SGLang container image (docker.io/lmsysorg/sglang:latest)
A Hugging Face token with access to the target model (e.g. Qwen/Qwen3-0.6B)

Deployment

When using the ModelServing approach, you need to manually create the following resources in order:

ModelServing — Manages Pods for prefill and decode roles
ModelServer — Manages the networking layer and PD-group routing
ModelRoute — Provides request routing to the ModelServer

1. ModelServing Configuration

The ModelServing resource defines two roles: prefill and decode. Each role runs a separate SGLang server with the corresponding --disaggregation-mode flag, using Mooncake as the KV-transfer backend.

note

Replace <YOUR_HF_TOKEN> with your actual Hugging Face token before applying.

kubectl apply -f - <<'EOF'
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelServing
metadata:
  name: sglang-qwen-06b
  namespace: default
spec:
  schedulerName: volcano
  replicas: 1
  template:
    roles:
      - name: prefill
        replicas: 1
        entryTemplate:
          spec:
            containers:
              - name: sglang-prefill
                image: docker.io/lmsysorg/sglang:latest
                imagePullPolicy: IfNotPresent
                command: ["python3", "-m", "sglang.launch_server"]
                args:
                  - "--model-path"
                  - "Qwen/Qwen3-0.6B"
                  - "--host"
                  - "$(POD_IP)"
                  - "--port"
                  - "30000"
                  - "--disaggregation-mode"
                  - "prefill"
                  - "--disaggregation-transfer-backend"
                  - "mooncake"
                  - "--enable-metrics"
                env:
                  - name: POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: HF_TOKEN
                    value: <YOUR_HF_TOKEN>
                ports:
                  - containerPort: 30000
                    name: http
                resources:
                  limits:
                    nvidia.com/gpu: 1
                    cpu: "4"
                    memory: 12Gi
                  requests:
                    nvidia.com/gpu: 1
                    cpu: "2"
                    memory: 12Gi
                volumeMounts:
                  - name: shm
                    mountPath: /dev/shm
                  - name: hf-cache
                    mountPath: /root/.cache/huggingface
                readinessProbe:
                  httpGet:
                    path: /health
                    port: 30000
                  initialDelaySeconds: 120
                  periodSeconds: 15
                  timeoutSeconds: 10
                  failureThreshold: 5
            volumes:
              - name: shm
                emptyDir:
                  medium: Memory
                  sizeLimit: 8Gi
              - name: hf-cache
                hostPath:
                  path: /root/.cache/huggingface
                  type: DirectoryOrCreate
        workerReplicas: 0
      - name: decode
        replicas: 1
        entryTemplate:
          spec:
            containers:
              - name: sglang-decode
                image: docker.io/lmsysorg/sglang:latest
                imagePullPolicy: IfNotPresent
                command: ["python3", "-m", "sglang.launch_server"]
                args:
                  - "--model-path"
                  - "Qwen/Qwen3-0.6B"
                  - "--host"
                  - "$(POD_IP)"
                  - "--port"
                  - "30000"
                  - "--disaggregation-mode"
                  - "decode"
                  - "--disaggregation-transfer-backend"
                  - "mooncake"
                  - "--enable-metrics"
                env:
                  - name: POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: HF_TOKEN
                    value: <YOUR_HF_TOKEN>
                ports:
                  - containerPort: 30000
                    name: http
                resources:
                  limits:
                    nvidia.com/gpu: 1
                    cpu: "4"
                    memory: 12Gi
                  requests:
                    nvidia.com/gpu: 1
                    cpu: "2"
                    memory: 12Gi
                volumeMounts:
                  - name: shm
                    mountPath: /dev/shm
                  - name: hf-cache
                    mountPath: /root/.cache/huggingface
                readinessProbe:
                  httpGet:
                    path: /health
                    port: 30000
                  initialDelaySeconds: 120
                  periodSeconds: 15
                  timeoutSeconds: 10
                  failureThreshold: 5
            volumes:
              - name: shm
                emptyDir:
                  medium: Memory
                  sizeLimit: 8Gi
              - name: hf-cache
                hostPath:
                  path: /root/.cache/huggingface
                  type: DirectoryOrCreate
        workerReplicas: 0
EOF

2. ModelServer Configuration

The ModelServer resource configures the networking layer. The pdGroup field tells Kthena how to identify prefill and decode pods so that it can perform PD-aware routing.

kubectl apply -f - <<'EOF'
apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelServer
metadata:
  name: sglang-qwen-06b
  namespace: default
spec:
  workloadSelector:
    matchLabels:
      modelserving.volcano.sh/name: sglang-qwen-06b
    pdGroup:
      groupKey: "modelserving.volcano.sh/group-name"
      prefillLabels:
        modelserving.volcano.sh/role: prefill
      decodeLabels:
        modelserving.volcano.sh/role: decode
  workloadPort:
    port: 30000
  model: "Qwen/Qwen3-0.6B"
  inferenceEngine: "SGLang"
  trafficPolicy:
    timeout: 10s
EOF

3. ModelRoute Configuration

The ModelRoute resource routes incoming requests by model name to the ModelServer created above.

kubectl apply -f - <<'EOF'
apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelRoute
metadata:
  name: sglang-qwen-06b
  namespace: default
spec:
  modelName: "sglang-qwen-06b"
  rules:
    - name: "default"
      targetModels:
        - modelServerName: "sglang-qwen-06b"
EOF

Verification

Check Pod Status

After applying the resources, verify that both prefill and decode pods are running:

kubectl get pod -owide -l modelserving.volcano.sh/name=sglang-qwen-06b

Expected output:

NAME                              READY   STATUS    RESTARTS   AGE   IP         NODE     NOMINATED NODE   READINESS GATES
sglang-qwen-06b-0-decode-0-0      1/1     Running   0          5m    <pod-ip>   <node>   <none>           <none>
sglang-qwen-06b-0-prefill-0-0     1/1     Running   0          5m    <pod-ip>   <node>   <none>           <none>

Send a Test Request

Once the pods are ready, send a chat completion request through the Kthena router:

curl -v http://<ROUTER_IP>:80/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"sglang-qwen-06b","messages":[{"role":"user","content":"Hello"}]}'

Expected response:

{
  "id": "2db2ac09a603498283f61badd8ff0ba3",
  "object": "chat.completion",
  "created": 1776149893,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, the user just said \"Hello\". I need to respond appropriately. Let me think. They might be testing if I can handle a greeting. Since I'm a language model, I should acknowledge their message. I can say \"Hello! How can I assist you today?\" That way, I'm polite and open to helping. I should make sure to keep the tone friendly and helpful. Let me check for any possible errors or issues. Everything looks good. Alright, time to send a friendly response.\n</think>\n\nHello! How can I assist you today?",
        "reasoning_content": null,
        "tool_calls": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": 151645
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 125,
    "completion_tokens": 116,
    "prompt_tokens_details": null,
    "reasoning_tokens": 0
  },
  "metadata": {
    "weight_version": "default"
  }
}

Prerequisites​

Deployment​

1. ModelServing Configuration​

2. ModelServer Configuration​

3. ModelRoute Configuration​

Verification​

Check Pod Status​

Send a Test Request​