LeaderWorkerSet (LWS) Integration
Kthena ModelServing provides native support for the LeaderWorkerSet (LWS) API. This capability allows users to define inference workloads using the standard LWS Custom Resource Definition (CRD) while leveraging Kthena's powerful orchestration, routing, and autoscaling features underneath.
This guide explains how to use LeaderWorkerSet resources with Kthena.
Overview
LeaderWorkerSet (LWS) is a widely adopted API for describing multi-host inference workloads (e.g., LLM inference). Kthena integrates LWS support by directly watching and handling LeaderWorkerSet Custom Resources.
Key Features:
- LWS API Compatibility: Users can submit standard LeaderWorkerSet CRs directly.
- Zero Extra Infrastructure: No need to deploy the native LWS Controller; Kthena's ModelServing Controller handles the logic.
- Kthena Powered: Automatically inherits Kthena's capabilities like ModelRoute and Autoscaling.
- Seamless Migration: Ideal for users already using LWS who want to migrate to Kthena without rewriting their manifests.
How It Works
When Kthena's LWS integration is enabled:
- Direct Processing: The Model Serving Controller listens for
LeaderWorkerSetresources. - One-Way Conversion: It automatically converts the LWS specification into Kthena's internal
ModelServingresources. - Status Sync: The status of the underlying pods is aggregated and written back to the
LeaderWorkerSetstatus, allowing you to use standardkubectl get lwscommands to monitor progress.
Note: This is a one-way synchronization. Changes should be made to the
LeaderWorkerSetresource, which will propagate to the underlying Kthena resources.
Configuration Mapping
Spec Mapping (LeaderWorkerSet -> ModelServing)
| LeaderWorkerSet Field | ModelServing Internal Semantics | Description |
|---|---|---|
metadata.name | metadata.name | Mapped to the ModelServing identifier. |
spec.replicas | spec.replicas | Defines the number of independent serving groups. |
spec.leaderWorkerTemplate.leaderTemplate | spec.template.roles.EntryTemplate | Parsed as Leader role definition. If nil, workerTemplate is used as Entry Pod. |
spec.leaderWorkerTemplate.workerTemplate | spec.template.roles.WorkerTemplate | Parsed as Worker role definition. |
spec.leaderWorkerTemplate.size | Worker Role Replicas | Used to calculate replica count for Worker Role: Replicas = Size - 1 (Assuming Leader is 1). |
spec.startupPolicy | Startup Policy | Maps startup order policy (e.g., LeaderFirst). |
Status Mapping (ModelServing -> LeaderWorkerSet)
| ModelServing Internal Status | LeaderWorkerSet Status | Description |
|---|---|---|
| ServingGroup Ready Count | status.readyReplicas | Number of ready groups. |
| ServingGroup Total Count | status.replicas | Number of currently existing groups. |
| Conditions | status.conditions | Aggregated health status (Available, Progressing). |
Deployment Example
Below is an example of deploying an inference workload using LeaderWorkerSet.
Prerequisites
Ensure the LeaderWorkerSet CRD is installed in your cluster. You do not need to install the LWS controller or operator.
# Example: Install LWS CRD only
kubectl apply -f https://github.com/kubernetes-sigs/lws/releases/download/v0.3.0/crd.yaml
LWS Configuration
This example defines a deployment with 1 replica group. Each group consists of 1 Leader and 1 Worker (Size = 2).
lws-inference-example.yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: llama-multinode
namespace: default
spec:
# Number of independent model replicas (Serving Groups)
replicas: 1
leaderWorkerTemplate:
# Total size of the group (1 Leader + 1 Worker)
size: 2
# Leader Pod Configuration
leaderTemplate:
metadata:
labels:
role: leader
model: llama-405b
spec:
containers:
- name: leader
image: vllm/vllm-openai:latest
env:
- name: HUGGING_FACE_HUB_TOKEN
value: $HUGGING_FACE_HUB_TOKEN
command:
- sh
- -c
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=2;
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
resources:
limits:
nvidia.com/gpu: "8"
memory: 1124Gi
ephemeral-storage: 800Gi
requests:
ephemeral-storage: 800Gi
cpu: 125
ports:
- containerPort: 8080
name: http
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 15Gi
# Worker Pod Configuration
workerTemplate:
metadata:
labels:
role: worker
model: llama-405b
spec:
containers:
- name: worker
image: vllm/vllm-openai:latest
command:
- sh
- -c
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(ENTRY_ADDRESS)"
resources:
limits:
nvidia.com/gpu: "8"
memory: 1124Gi
ephemeral-storage: 800Gi
requests:
ephemeral-storage: 800Gi
cpu: 125
env:
- name: HUGGING_FACE_HUB_TOKEN
value: $HUGGING_FACE_HUB_TOKEN
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 15Gi
Verifying Deployment
After applying the YAML, you can check the status using standard kubectl commands:
# Check the LeaderWorkerSet status
kubectl get lws qwen-72b-inference
# Check the underlying Pods created by Kthena
kubectl get pods -l leaderworkerset.x-k8s.io/name=qwen-72b-inference
Status & Troubleshooting
The LeaderWorkerSet status is automatically updated by Kthena:
- ReadyReplicas: Indicates how many serving groups are fully ready.
- Conditions: Provides details on the health and state of the deployment.
If the LeaderWorkerSet is not progressing:
- Check if the CRD is installed correctly.
- Inspect the Kthena Controller logs for any validation errors regarding the LWS spec.
- Verify that the resource requests (GPUs, CPU) can be satisfied by the cluster.