API Reference
Packages
workload.serving.volcano.sh/v1alpha1
Resource Types
- AutoscalingPolicy
- AutoscalingPolicyBinding
- AutoscalingPolicyBindingList
- AutoscalingPolicyList
- ModelBooster
- ModelBoosterList
- ModelServing
- ModelServingList
AutoscalingPolicy
AutoscalingPolicy defines the autoscaling policy configuration for model serving workloads. It specifies scaling rules, metrics, and behavior for automatic replica adjustment.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | AutoscalingPolicy | ||
spec AutoscalingPolicySpec | |||
status AutoscalingPolicyStatus |
AutoscalingPolicyBehavior
AutoscalingPolicyBehavior defines the scaling behavior configuration for both scale up and scale down operations.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
scaleUp AutoscalingPolicyScaleUpPolicy | ScaleUp defines the policy configuration for scaling up (increasing replicas). | ||
scaleDown AutoscalingPolicyStablePolicy | ScaleDown defines the policy configuration for scaling down (decreasing replicas). |
AutoscalingPolicyBinding
AutoscalingPolicyBinding binds AutoscalingPolicy rules to specific ModelServing deployments. It enables either traditional metric-based scaling or multi-target optimization across heterogeneous hardware deployments.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | AutoscalingPolicyBinding | ||
spec AutoscalingPolicyBindingSpec | |||
status AutoscalingPolicyBindingStatus |
AutoscalingPolicyBindingList
AutoscalingPolicyBindingList contains a list of AutoscalingPolicyBinding objects.
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | AutoscalingPolicyBindingList | ||
items AutoscalingPolicyBinding array |
AutoscalingPolicyBindingSpec
AutoscalingPolicyBindingSpec defines the desired state of AutoscalingPolicyBinding.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
policyRef LocalObjectReference | PolicyRef references the AutoscalingPolicy that defines the scaling rules and metrics. | ||
heterogeneousTarget HeterogeneousTarget | HeterogeneousTarget enables optimization-based scaling across multiple ModelServing deployments with different hardware capabilities. This approach dynamically adjusts replica distribution across heterogeneous resources (e.g., H100/A100 GPUs) based on overall computing requirements. | ||
homogeneousTarget HomogeneousTarget | HomogeneousTarget enables traditional metric-based scaling for a single ModelServing deployment. This approach adjusts replica count based on monitoring metrics and their target values. |
AutoscalingPolicyBindingStatus
AutoscalingPolicyBindingStatus defines the observed state of AutoscalingPolicyBinding.
Appears in:
AutoscalingPolicyList
AutoscalingPolicyList contains a list of AutoscalingPolicy objects.
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | AutoscalingPolicyList | ||
items AutoscalingPolicy array |
AutoscalingPolicyMetric
AutoscalingPolicyMetric defines a metric and its target value for scaling decisions.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
metricName string | MetricName defines the name of the metric to monitor for scaling decisions. | ||
targetValue Quantity | TargetValue defines the target value for the metric that triggers scaling operations. |
AutoscalingPolicyPanicPolicy
AutoscalingPolicyPanicPolicy defines the emergency scaling policy for handling sudden traffic surges.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
percent integer | Percent defines the maximum percentage of current instances to scale up during panic mode. | 1000 | Maximum: 1000 Minimum: 0 |
panicThresholdPercent integer | PanicThresholdPercent defines the metric threshold percentage that triggers panic mode. When metrics exceed this percentage of target values, panic mode is activated. | 200 | Maximum: 1000 Minimum: 110 |
AutoscalingPolicyScaleUpPolicy
AutoscalingPolicyScaleUpPolicy defines the scaling up policy configuration.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
stablePolicy AutoscalingPolicyStablePolicy | StablePolicy defines the stable scaling policy that uses average metric values over time windows. This policy smooths out short-term fluctuations and avoids unnecessary frequent scaling operations. | ||
panicPolicy AutoscalingPolicyPanicPolicy | PanicPolicy defines the emergency scaling policy for handling sudden traffic spikes. This policy activates during rapid load surges to prevent service degradation or timeouts. |
AutoscalingPolicySpec
AutoscalingPolicySpec defines the desired state of AutoscalingPolicy.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
tolerancePercent integer | TolerancePercent defines the percentage of deviation tolerated before scaling actions are triggered. current_replicas represents the current number of instances, while target_replicas represents the expected number of instances calculated from monitoring metrics. Scaling operations are performed only when |current_replicas - target_replicas| >= current_replicas * TolerancePercent / 100. | 10 | Maximum: 100 Minimum: 0 |
metrics AutoscalingPolicyMetric array | Metrics defines the list of metrics used to evaluate scaling decisions. | MinItems: 1 | |
behavior AutoscalingPolicyBehavior | Behavior defines the scaling behavior configuration for both scale up and scale down operations. |
AutoscalingPolicyStablePolicy
AutoscalingPolicyStablePolicy defines the stable scaling policy for both scale up and scale down operations.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
instances integer | Instances defines the maximum absolute number of instances to scale per period. | 1 | Minimum: 0 |
percent integer | Percent defines the maximum percentage of current instances to scale per period. | 100 | Maximum: 1000 Minimum: 0 |
selectPolicy SelectPolicyType | SelectPolicy determines the selection strategy for scaling operations. 'Or' means scaling is performed if either the Percent or Instances requirement is met. 'And' means scaling is performed only if both Percent and Instances requirements are met. | Or | Enum: [Or And] |
AutoscalingPolicyStatus
AutoscalingPolicyStatus defines the observed state of AutoscalingPolicy.
Appears in:
GangPolicy
GangPolicy defines the gang scheduling configuration.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
minRoleReplicas object (keys:string, values:integer) | MinRoleReplicas defines the minimum number of replicas required for each role in gang scheduling. This map allows users to specify different minimum replica requirements for different roles. Notice: In practice, when determining the minTaskMember for a podGroup, it takes the minimum value between MinRoleReplicas[role.Name] and role.Replicas.If you set: gangPolicy: minRoleReplicas: Prefill: 2 Decode: 2 And set the roles as: roles: - name: P replicas: 1 workerReplicas: 2 - name: D replicas: 3 workerReplicas: 1 The resulting podGroup will have minTaskMember: minTaskMember: P-0: 3 (1 entry pod + 2 worker pods) D-0: 4 (1 entry pod + 3 worker pods) D-1: 4 (1 entry pod + 3 worker pods) The replicase of P is min(minRoleReplicas['P'], role.Replicas) = min(2, 1) = 1 The replicase of D is min(minRoleReplicas['D'], role.Replicas) = min(2, 3) = 2 Key: role name Value: minimum number of replicas required for that role |
HeterogeneousTarget
HeterogeneousTarget defines the configuration for optimization-based autoscaling across multiple deployments.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
params HeterogeneousTargetParam array | Params defines the configuration parameters for multiple ModelServing groups to be optimized. | MinItems: 1 | |
costExpansionRatePercent integer | CostExpansionRatePercent defines the percentage rate at which the cost expands during optimization calculations. | 200 | Minimum: 0 |
HeterogeneousTargetParam
HeterogeneousTargetParam defines the configuration parameters for a specific deployment type in heterogeneous scaling.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
target Target | Target defines the scaling instance configuration for this deployment type. | ||
cost integer | Cost defines the relative cost factor used in optimization calculations. This factor balances performance requirements against deployment costs. | Minimum: 0 | |
minReplicas integer | MinReplicas defines the minimum number of replicas to maintain for this deployment type. | Maximum: 1e+06 Minimum: 0 | |
maxReplicas integer | MaxReplicas defines the maximum number of replicas allowed for this deployment type. | Maximum: 1e+06 Minimum: 1 |
HomogeneousTarget
HomogeneousTarget defines the configuration for traditional metric-based autoscaling of a single deployment.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
target Target | Target defines the object to be monitored and scaled. | ||
minReplicas integer | MinReplicas defines the minimum number of replicas to maintain. | Maximum: 1e+06 Minimum: 0 | |
maxReplicas integer | MaxReplicas defines the maximum number of replicas allowed. | Maximum: 1e+06 Minimum: 1 |
Metadata
Metadata is a simplified version of ObjectMeta in Kubernetes.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
labels object (keys:string, values:string) | Map of string keys and values that can be used to organize and categorize (scope and select) objects. May match selectors of replication controllers and services. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels | ||
annotations object (keys:string, values:string) | Annotations is an unstructured key value map stored with a resource that may be set by external tools to store and retrieve arbitrary metadata. They are not queryable and should be preserved when modifying objects. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations |
MetricEndpoint
MetricEndpoint defines the endpoint configuration for scraping metrics from pods.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
uri string | Uri defines the HTTP path where metrics are exposed (e.g., "/metrics"). | /metrics | |
port integer | Port defines the network port where metrics are exposed by the pods. | 8100 |
ModelBackend
ModelBackend defines the configuration for a model backend.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
name string | Name is the name of the backend. Can't duplicate with other ModelBackend name in the same ModelBooster CR. Note: update name will cause the old modelInfer deletion and a new modelInfer creation. | Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ | |
type ModelBackendType | Type is the type of the backend. | Enum: [vLLM vLLMDisaggregated] | |
modelURI string | ModelURI is the URI where you download the model. Support hf://, s3://, pvc://. | Pattern: ^(hf://|s3://|pvc://).+ | |
cacheURI string | CacheURI is the URI where the downloaded model stored. Support hostpath://, pvc://. | Pattern: ^(hostpath://|pvc://).+ | |
envFrom EnvFromSource array | List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence. Cannot be updated. | ||
env EnvVar array | List of environment variables to set in the container. Supported names: "ENDPOINT": When you download model from s3, you have to specify it. "RUNTIME_URL": default is http://localhost:8000 "RUNTIME_PORT": default is 8100 "RUNTIME_METRICS_PATH": default is /metrics "HF_ENDPOINT":The url of hugging face. Default is https://huggingface.co/ Cannot be updated. | ||
minReplicas integer | MinReplicas is the minimum number of replicas for the backend. | Maximum: 1e+06 Minimum: 0 | |
maxReplicas integer | MaxReplicas is the maximum number of replicas for the backend. | Maximum: 1e+06 Minimum: 1 | |
workers ModelWorker array | Workers is the list of workers associated with this backend. | MaxItems: 1000 MinItems: 1 | |
schedulerName string | SchedulerName defines the name of the scheduler used by ModelServing for this backend. |
ModelBackendType
Underlying type: string
ModelBackendType defines the type of model backend.
Validation:
- Enum: [vLLM vLLMDisaggregated]
Appears in:
| Field | Description |
|---|---|
vLLM | ModelBackendTypeVLLM represents a vLLM backend. |
vLLMDisaggregated | ModelBackendTypeVLLMDisaggregated represents a disaggregated vLLM backend. |
SGLang | ModelBackendTypeSGLang represents an SGLang backend. |
MindIE | ModelBackendTypeMindIE represents a MindIE backend. |
MindIEDisaggregated | ModelBackendTypeMindIEDisaggregated represents a disaggregated MindIE backend. |
ModelBooster
ModelBooster is the Schema for the models API.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | ModelBooster | ||
spec ModelBoosterSpec | |||
status ModelStatus |
ModelBoosterList
ModelBoosterList contains a list of ModelBooster.
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | ModelBoosterList | ||
items ModelBooster array |
ModelBoosterSpec
ModelBoosterSpec defines the desired state of ModelBooster.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
name string | Name is the name of the model. ModelBooster CR name is restricted by kubernetes, for example, can't contain uppercase letters. So we use this field to specify the ModelBooster name. | MaxLength: 64 Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ | |
owner string | Owner is the owner of the model. | ||
backend ModelBackend | Backend is the model backend associated with this model. ModelBackend is the minimum unit of inference instance. It can be vLLM or vLLMDisaggregated. | ||
autoscalingPolicy AutoscalingPolicySpec | AutoscalingPolicy references the autoscaling policy to be used for this model. | ||
modelMatch ModelMatch | ModelMatch defines the predicate used to match LLM inference requests to a given TargetModels. Multiple match conditions are ANDed together, i.e. the match will evaluate to true only if all conditions are satisfied. |
ModelServing
ModelServing is the Schema for the LLM Serving API
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | ModelServing | ||
spec ModelServingSpec | |||
status ModelServingStatus |
ModelServingList
ModelServingList contains a list of ModelServing
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | ModelServingList | ||
items ModelServing array |
ModelServingSpec
ModelServingSpec defines the specification of the ModelServing resource.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
replicas integer | Number of ServingGroups. That is the number of instances that run serving tasks Default to 1. | 1 | |
schedulerName string | SchedulerName defines the name of the scheduler used by ModelServing | volcano | |
template ServingGroup | Template defines the template for ServingGroup | ||
rolloutStrategy RolloutStrategy | RolloutStrategy defines the strategy that will be applied to update replicas | ||
recoveryPolicy RecoveryPolicy | RecoveryPolicy defines the recovery policy for the failed Pod to be rebuilt | RoleRecreate | Enum: [ServingGroupRecreate RoleRecreate None] |
ModelServingStatus
ModelServingStatus defines the observed state of ModelServing
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
observedGeneration integer | observedGeneration is the most recent generation observed for ModelServing. It corresponds to the ModelServing's generation, which is updated on mutation by the API Server. | ||
replicas integer | Replicas track the total number of ServingGroup that have been created (updated or not, ready or not) | ||
currentReplicas integer | CurrentReplicas is the number of ServingGroup created by the ModelServing controller from the ModelServing version | ||
updatedReplicas integer | UpdatedReplicas track the number of ServingGroup that have been updated (ready or not). | ||
availableReplicas integer | AvailableReplicas track the number of ServingGroup that are in ready state (updated or not). |
ModelStatus
ModelStatus defines the observed state of ModelBooster.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
observedGeneration integer | ObservedGeneration track of generation |
ModelWorker
ModelWorker defines the model worker configuration.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
type ModelWorkerType | Type is the type of the model worker. | server | Enum: [server prefill decode controller coordinator] |
image string | Image is the container image for the worker. | ||
replicas integer | Replicas is the number of replicas for the worker. | Maximum: 1e+06 Minimum: 0 | |
pods integer | Pods is the number of pods for the worker. | Maximum: 1e+06 Minimum: 0 | |
resources ResourceRequirements | Resources specifies the resource requirements for the worker. | ||
affinity Affinity | Affinity specifies the affinity rules for scheduling the worker pods. | ||
config JSON | Config contains worker-specific configuration in JSON format. You can find vLLM config here https://docs.vllm.ai/en/stable/configuration/engine_args.html |
ModelWorkerType
Underlying type: string
ModelWorkerType defines the type of model worker.
Validation:
- Enum: [server prefill decode controller coordinator]
Appears in:
| Field | Description |
|---|---|
server | ModelWorkerTypeServer represents a server worker. |
prefill | ModelWorkerTypePrefill represents a prefill worker. |
decode | ModelWorkerTypeDecode represents a decode worker. |
controller | ModelWorkerTypeController represents a controller worker. |
coordinator | ModelWorkerTypeCoordinator represents a coordinator worker. |
NetworkTopology
NetworkTopologySpec defines the network topology affinity scheduling policy for the roles and group, it works only when the scheduler supports network topology feature.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
groupPolicy NetworkTopologySpec | GroupPolicy defines the network topology scheduling requirement of all the instances within the ServingGroup. | ||
rolePolicy NetworkTopologySpec | RolePolicy defines the fine-grained network topology scheduling requirement for instances of a role. |
PodTemplateSpec
PodTemplateSpec describes the data a pod should have when created from a template
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
metadata Metadata | Refer to Kubernetes API documentation for fields of metadata. | ||
spec PodSpec | Specification of the desired behavior of the pod. |
RecoveryPolicy
Underlying type: string
Appears in:
| Field | Description |
|---|---|
ServingGroupRecreate | ServingGroupRecreate will recreate all the pods in the ServingGroup if 1. Any individual pod in the group is recreated; 2. Any containers/init-containers in a pod is restarted. This is to ensure all pods/containers in the group will be started in the same time. |
RoleRecreate | RoleRecreate will recreate all pods in one Role if 1. Any individual pod in the group is recreated; 2. Any containers/init-containers in a pod is restarted. |
None | NoneRestartPolicy will follow the same behavior as the default pod or deployment. |
Role
Role defines the specific pod instance role that performs the inference task.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
name string | The name of a role. Name must be unique within an ServingGroup | MaxLength: 12 Pattern: ^[a-zA-Z0-9]([-a-zA-Z0-9]*[a-zA-Z0-9])?$ | |
replicas integer | The number of a certain role. For example, in Disaggregated Prefilling, setting the replica count for both the P and D roles to 1 results in 1P1D deployment configuration. This approach can similarly be applied to configure a xPyD deployment scenario. Default to 1. | 1 | |
entryTemplate PodTemplateSpec | EntryTemplate defines the template for the entry pod of a role. Required: Currently, a role must have only one entry-pod. | ||
workerReplicas integer | WorkerReplicas defines the number for the worker pod of a role. Required: Need to set the number of worker-pod replicas. | ||
workerTemplate PodTemplateSpec | WorkerTemplate defines the template for the worker pod of a role. |
RollingUpdateConfiguration
RollingUpdateConfiguration defines the parameters to be used for RollingUpdateStrategyType.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
maxUnavailable IntOrString | The maximum number of replicas that can be unavailable during the update. Value can be an absolute number (ex: 5) or a percentage of total replicas at the start of update (ex: 10%). Absolute number is calculated from percentage by rounding down. This can not be 0 if MaxSurge is 0. By default, a fixed value of 1 is used. | 1 | XIntOrString: {} |
maxSurge IntOrString | The maximum number of replicas that can be scheduled above the original number of replicas. Value can be an absolute number (ex: 5) or a percentage of total replicas at the start of the update (ex: 10%). Absolute number is calculated from percentage by rounding up. By default, a value of 0 is used. | 0 | XIntOrString: {} |
partition integer | Partition indicates the ordinal at which the ModelServing should be partitioned for updates. During a rolling update, all ServingGroups from ordinal Replicas-1 to Partition are updated. All ServingGroups from ordinal Partition-1 to 0 remain untouched. The default value is 0. |
RolloutStrategy
RolloutStrategy defines the strategy that the ModelServing controller will use to perform replica updates.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
type RolloutStrategyType | Type defines the rollout strategy, it can only be “ServingGroupRollingUpdate” for now. | ServingGroupRollingUpdate | Enum: [ServingGroupRollingUpdate] |
rollingUpdateConfiguration RollingUpdateConfiguration | RollingUpdateConfiguration defines the parameters to be used when type is RollingUpdateStrategyType. optional |
RolloutStrategyType
Underlying type: string
Appears in:
| Field | Description |
|---|---|
ServingGroupRollingUpdate | ServingGroupRollingUpdate indicates that ServingGroup replicas will be updated one by one. |
SelectPolicyType
Underlying type: string
SelectPolicyType defines the selection strategy type for scaling operations.
Validation:
- Enum: [Or And]
Appears in:
| Field | Description |
|---|---|
Or | |
And |
ServingGroup
ServingGroup is the smallest unit to complete the inference task
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
restartGracePeriodSeconds integer | RestartGracePeriodSeconds defines the grace time for the controller to rebuild the ServingGroup when an error occurs Defaults to 0 (ServingGroup will be rebuilt immediately after an error) | 0 | |
gangPolicy GangPolicy | GangPolicy defines the gang scheduler config. | ||
networkTopology NetworkTopology | NetworkTopology defines the network topology affinity scheduling policy for the roles of the ServingGroup,it works only when the scheduler supports network topology-aware scheduling. | ||
roles Role array | MaxItems: 4 MinItems: 1 |
SubTarget
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
kind string | |||
name string |
Target
Target defines a ModelServing deployment that can be monitored and scaled.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
targetRef ObjectReference | TargetRef references the target object to be monitored and scaled. Default target GVK is ModelServing. Currently supported kinds: ModelServing. | ||
subTargets SubTarget | SubTarget defines the sub-target object to be monitored and scaled. Currently supported kinds: Role when TargetRef kind is ModelServing. | ||
metricEndpoint MetricEndpoint | MetricEndpoint defines the configuration for scraping metrics from the target pods. |