Runtime
Kthena Runtime is a lightweight sidecar service designed to standardize Prometheus metrics from inference engines, provides LoRA adapter download/load/unload capabilities, and supports model downloading.
Overview
- Metrics standardization: fetch native metrics from the engine's /metrics endpoint, rename them to unified Kthena metrics according to rules.
- LoRA lifecycle management: simple HTTP APIs to download+load and unload LoRA adapters for dynamic enable/disable.
- Model downloading: supports downloading models from S3/OBS/PVC/HuggingFace to a local path.
Notes:
- If you want to download from S3/OBS, you first need to upload the model to the bucket.
Installation
-
Runtime does not support separate installation. it will be automatically deployed alongside the inference container as a sidecar when you are using
ModelBoosterto deploy llm. -
When deploying via the ModelBooster CR (one-stop deployment), no additional configuration is needed; ModelServing will automatically enable the runtime feature.
-
For standalone deployment using ModelServing YAML, you can add the following configuration to start Runtime as sidecar container:
- name: runtime
ports:
- containerPort: 8900
image: kthena/runtime:latest
args:
- --port
- "8900"
- --engine
- vllm
- --engine-base-url
- http://localhost:8000
- --engine-metrics-path
- /metrics
- --pod
- $(POD_NAME).$(NAMESPACE)
- --model
- test-model
env:
- name: ENDPOINT
value: https://obs.test.com
- name: RUNTIME_PORT
value: "8900"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: VLLM_USE_V1
value: "1"
envFrom:
- secretRef:
name: "test-secret"
readinessProbe:
httpGet:
path: /health
port: 8900
initialDelaySeconds: 5
periodSeconds: 10
resources: { }
Startup arguments:
-E, --engine(required): engine name, supportsvllm,sglang-H, --host(default0.0.0.0): listen address for Runtime-P, --port(default9000): listen port for Runtime-B, --engine-base-url(defaulthttp://localhost:8000): engine base URL-M, --engine-metrics-path(default/metrics): engine metrics path-I, --pod(required): current instance/Pod identifier, used for events and Redis keys-N, --model(required): model name
In the ModelBooster YAML, you can control Runtime startup values via spec.backends.env:
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelBooster
metadata:
annotations:
api.kubernetes.io/name: example
name: qwen25
spec:
name: qwen25-coder-32b
owner: example
backends:
- name: "qwen25-coder-32b-server"
type: "vLLM" # --engine
modelURI: s3://kthena/Qwen/Qwen2.5-Coder-32B-Instruct
cacheURI: hostpath:///cache/
envFrom:
- secretRef:
name: your-secrets
env:
- name: "RUNTIME_PORT" # default 8100
value: "8200"
- name: "RUNTIME_URL" # default http://localhost:8000/metrics
value: "http://localhost:8100"
- name: "RUNTIME_METRICS_PATH" # default /metrics
value: "/metrics"
minReplicas: 1
maxReplicas: 1
workers:
- type: server
image: openeuler/vllm-ascend:latest
replicase: 1
pods: 1
resources:
limits:
cpu: "8"
memory: 96Gi
huawei.com/ascend-1980: "2"
requests:
cpu: "1"
memory: 96Gi
huawei.com/ascend-1980: "2"
Metric Standardization
Runtime renames key metrics from different engines to unified names prefixed with kthena:* for consistent observability (Prometheus/Grafana):
kthena:generation_tokens_totalkthena:num_requests_waitingkthena:time_to_first_token_secondskthena:time_per_output_token_secondskthena:e2e_request_latency_seconds
Notes:
- When
engine=vllmorengine=sglang, key metrics from vLLM/SGLang are renamed to the standard names above. - Only metrics covered by built-in mappings are standardized, and the original metrics are preserved. You can obtain all raw engine metrics plus the standardized metrics.
Dynamic Lora configuration
You can use ModelBooster YAML to configure LoRA adapters for automatic download and loading during the model startup. If you only change loraAdapters in ModelBooster YAML, Runtime will dynamically download and load/unload the adapters without restarting the Pod.
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelBooster
metadata:
annotations:
api.kubernetes.io/name: example
name: deepseek-r1-distill-llama-8b
spec:
name: deepseek-r1-distill-llama-8b
owner: example
backends:
- name: "deepseek-r1-distill-llama-8b-vllm"
type: "vLLM"
modelURI: "s3://model-bucket/deepseek-r1-distill-llama-8b"
cacheURI: hostpath:///cache/
envFrom:
- secretRef:
name: your-secrets # AccessKey/SecretKey for S3/OBS or HF_AUTH_TOKEN for HuggingFace
env:
- name: "ENDPOINT"
value: "https://obs.test.com"
- name: "VLLM_ALLOW_RUNTIME_LORA_UPDATING"
value: "True" # Enable dynamic LoRA load/unload
minReplicas: 1
maxReplicas: 1
workers:
- type: server
image: openeuler/vllm-ascend:latest
replicase: 1
pods: 1
loraAdapters:
- name: lora-sql
artifactURL: s3://aios_models/deepseek-ai/DeepSeek-V3-W8A8/vllm-ascend-lora
Notes:
- To enable dynamic LoRA configuration, ensure that the environment variable
VLLM_ALLOW_RUNTIME_LORA_UPDATINGis set toTrue. loraAdapters.artifactURLsupports the same sources and formats as modelURI in the ModelBooster CR, including:- Hugging Face:
<namespace>/<repo_name>, e.g.,microsoft/phi-2 - S3:
s3://bucket/path - OBS:
obs://bucket/path - PVC:
pvc://path
- Hugging Face:
- You can configure the following environment variables for Runtime to access private models or object storage services:
- Hugging Face:
HF_AUTH_TOKEN(optional): token for accessing private modelsHF_ENDPOINT(optional): custom HF API endpointHF_REVISION(optional): model branch/revision (e.g.,main)
- S3/OBS:
ACCESS_KEY,SECRET_KEY: access credentials (recommended to store in a Secret and load viaenvFrom.secretRef.name)ENDPOINT: object storage service endpoint (e.g.,https://s3.us-east-1.amazonaws.comorhttps://obs.test.com)
- Hugging Face: