Skip to main content
Version: v0.2.0

API Reference

Packages

workload.serving.volcano.sh/v1alpha1

Resource Types

AutoscalingPolicy

AutoscalingPolicy defines the autoscaling policy configuration for model serving workloads. It specifies scaling rules, metrics, and behavior for automatic replica adjustment.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicy
spec AutoscalingPolicySpec
status AutoscalingPolicyStatus

AutoscalingPolicyBehavior

AutoscalingPolicyBehavior defines the scaling behavior configuration for both scale up and scale down operations.

Appears in:

FieldDescriptionDefaultValidation
scaleUp AutoscalingPolicyScaleUpPolicyScaleUp defines the policy configuration for scaling up (increasing replicas).
scaleDown AutoscalingPolicyStablePolicyScaleDown defines the policy configuration for scaling down (decreasing replicas).

AutoscalingPolicyBinding

AutoscalingPolicyBinding binds AutoscalingPolicy rules to specific ModelServing deployments. It enables either traditional metric-based scaling or multi-target optimization across heterogeneous hardware deployments.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyBinding
spec AutoscalingPolicyBindingSpec
status AutoscalingPolicyBindingStatus

AutoscalingPolicyBindingList

AutoscalingPolicyBindingList contains a list of AutoscalingPolicyBinding objects.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyBindingList
items AutoscalingPolicyBinding array

AutoscalingPolicyBindingSpec

AutoscalingPolicyBindingSpec defines the desired state of AutoscalingPolicyBinding.

Appears in:

FieldDescriptionDefaultValidation
policyRef LocalObjectReferencePolicyRef references the AutoscalingPolicy that defines the scaling rules and metrics.
heterogeneousTarget HeterogeneousTargetHeterogeneousTarget enables optimization-based scaling across multiple ModelServing deployments with different hardware capabilities.
This approach dynamically adjusts replica distribution across heterogeneous resources (e.g., H100/A100 GPUs) based on overall computing requirements.
homogeneousTarget HomogeneousTargetHomogeneousTarget enables traditional metric-based scaling for a single ModelServing deployment.
This approach adjusts replica count based on monitoring metrics and their target values.

AutoscalingPolicyBindingStatus

AutoscalingPolicyBindingStatus defines the observed state of AutoscalingPolicyBinding.

Appears in:

AutoscalingPolicyList

AutoscalingPolicyList contains a list of AutoscalingPolicy objects.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyList
items AutoscalingPolicy array

AutoscalingPolicyMetric

AutoscalingPolicyMetric defines a metric and its target value for scaling decisions.

Appears in:

FieldDescriptionDefaultValidation
metricName stringMetricName defines the name of the metric to monitor for scaling decisions.
targetValue QuantityTargetValue defines the target value for the metric that triggers scaling operations.

AutoscalingPolicyPanicPolicy

AutoscalingPolicyPanicPolicy defines the emergency scaling policy for handling sudden traffic surges.

Appears in:

FieldDescriptionDefaultValidation
percent integerPercent defines the maximum percentage of current instances to scale up during panic mode.1000Maximum: 1000
Minimum: 0
panicThresholdPercent integerPanicThresholdPercent defines the metric threshold percentage that triggers panic mode.
When metrics exceed this percentage of target values, panic mode is activated.
200Maximum: 1000
Minimum: 110

AutoscalingPolicyScaleUpPolicy

AutoscalingPolicyScaleUpPolicy defines the scaling up policy configuration.

Appears in:

FieldDescriptionDefaultValidation
stablePolicy AutoscalingPolicyStablePolicyStablePolicy defines the stable scaling policy that uses average metric values over time windows.
This policy smooths out short-term fluctuations and avoids unnecessary frequent scaling operations.
panicPolicy AutoscalingPolicyPanicPolicyPanicPolicy defines the emergency scaling policy for handling sudden traffic spikes.
This policy activates during rapid load surges to prevent service degradation or timeouts.

AutoscalingPolicySpec

AutoscalingPolicySpec defines the desired state of AutoscalingPolicy.

Appears in:

FieldDescriptionDefaultValidation
tolerancePercent integerTolerancePercent defines the percentage of deviation tolerated before scaling actions are triggered.
current_replicas represents the current number of instances, while target_replicas represents the expected number of instances calculated from monitoring metrics.
Scaling operations are performed only when |current_replicas - target_replicas| >= current_replicas * TolerancePercent / 100.
10Maximum: 100
Minimum: 0
metrics AutoscalingPolicyMetric arrayMetrics defines the list of metrics used to evaluate scaling decisions.MinItems: 1
behavior AutoscalingPolicyBehaviorBehavior defines the scaling behavior configuration for both scale up and scale down operations.

AutoscalingPolicyStablePolicy

AutoscalingPolicyStablePolicy defines the stable scaling policy for both scale up and scale down operations.

Appears in:

FieldDescriptionDefaultValidation
instances integerInstances defines the maximum absolute number of instances to scale per period.1Minimum: 0
percent integerPercent defines the maximum percentage of current instances to scale per period.100Maximum: 1000
Minimum: 0
selectPolicy SelectPolicyTypeSelectPolicy determines the selection strategy for scaling operations.
'Or' means scaling is performed if either the Percent or Instances requirement is met.
'And' means scaling is performed only if both Percent and Instances requirements are met.
OrEnum: [Or And]

AutoscalingPolicyStatus

AutoscalingPolicyStatus defines the observed state of AutoscalingPolicy.

Appears in:

GangPolicy

GangPolicy defines the gang scheduling configuration.

Appears in:

FieldDescriptionDefaultValidation
minRoleReplicas object (keys:string, values:integer)MinRoleReplicas defines the minimum number of replicas required for each role
in gang scheduling. This map allows users to specify different
minimum replica requirements for different roles.
Notice: In practice, when determining the minTaskMember for a podGroup, it takes the minimum value between MinRoleReplicas[role.Name] and role.Replicas.
If you set:
gangPolicy:
minRoleReplicas:
Prefill: 2
Decode: 2
And set the roles as:
roles:
- name: P
replicas: 1
workerReplicas: 2
- name: D
replicas: 3
workerReplicas: 1
The resulting podGroup will have minTaskMember:
minTaskMember:
P-0: 3 (1 entry pod + 2 worker pods)
D-0: 4 (1 entry pod + 3 worker pods)
D-1: 4 (1 entry pod + 3 worker pods)
The replicase of P is min(minRoleReplicas['P'], role.Replicas) = min(2, 1) = 1
The replicase of D is min(minRoleReplicas['D'], role.Replicas) = min(2, 3) = 2
Key: role name
Value: minimum number of replicas required for that role

HeterogeneousTarget

HeterogeneousTarget defines the configuration for optimization-based autoscaling across multiple deployments.

Appears in:

FieldDescriptionDefaultValidation
params HeterogeneousTargetParam arrayParams defines the configuration parameters for multiple ModelServing groups to be optimized.MinItems: 1
costExpansionRatePercent integerCostExpansionRatePercent defines the percentage rate at which the cost expands during optimization calculations.200Minimum: 0

HeterogeneousTargetParam

HeterogeneousTargetParam defines the configuration parameters for a specific deployment type in heterogeneous scaling.

Appears in:

FieldDescriptionDefaultValidation
target TargetTarget defines the scaling instance configuration for this deployment type.
cost integerCost defines the relative cost factor used in optimization calculations.
This factor balances performance requirements against deployment costs.
Minimum: 0
minReplicas integerMinReplicas defines the minimum number of replicas to maintain for this deployment type.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas defines the maximum number of replicas allowed for this deployment type.Maximum: 1e+06
Minimum: 1

HomogeneousTarget

HomogeneousTarget defines the configuration for traditional metric-based autoscaling of a single deployment.

Appears in:

FieldDescriptionDefaultValidation
target TargetTarget defines the object to be monitored and scaled.
minReplicas integerMinReplicas defines the minimum number of replicas to maintain.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas defines the maximum number of replicas allowed.Maximum: 1e+06
Minimum: 1

Metadata

Metadata is a simplified version of ObjectMeta in Kubernetes.

Appears in:

FieldDescriptionDefaultValidation
labels object (keys:string, values:string)Map of string keys and values that can be used to organize and categorize
(scope and select) objects. May match selectors of replication controllers
and services.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels
annotations object (keys:string, values:string)Annotations is an unstructured key value map stored with a resource that may be
set by external tools to store and retrieve arbitrary metadata. They are not
queryable and should be preserved when modifying objects.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations

MetricEndpoint

MetricEndpoint defines the endpoint configuration for scraping metrics from pods.

Appears in:

FieldDescriptionDefaultValidation
uri stringUri defines the HTTP path where metrics are exposed (e.g., "/metrics")./metrics
port integerPort defines the network port where metrics are exposed by the pods.8100

ModelBackend

ModelBackend defines the configuration for a model backend.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the backend. Can't duplicate with other ModelBackend name in the same ModelBooster CR.
Note: update name will cause the old modelInfer deletion and a new modelInfer creation.
Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
type ModelBackendTypeType is the type of the backend.Enum: [vLLM vLLMDisaggregated]
modelURI stringModelURI is the URI where you download the model. Support hf://, s3://, pvc://.Pattern: ^(hf://|s3://|pvc://).+
cacheURI stringCacheURI is the URI where the downloaded model stored. Support hostpath://, pvc://.Pattern: ^(hostpath://|pvc://).+
envFrom EnvFromSource arrayList of sources to populate environment variables in the container.
The keys defined within a source must be a C_IDENTIFIER. All invalid keys
will be reported as an event when the container is starting. When a key exists in multiple
sources, the value associated with the last source will take precedence.
Values defined by an Env with a duplicate key will take precedence.
Cannot be updated.
env EnvVar arrayList of environment variables to set in the container.
Supported names:
"ENDPOINT": When you download model from s3, you have to specify it.
"RUNTIME_URL": default is http://localhost:8000
"RUNTIME_PORT": default is 8100
"RUNTIME_METRICS_PATH": default is /metrics
"HF_ENDPOINT":The url of hugging face. Default is https://huggingface.co/
Cannot be updated.
minReplicas integerMinReplicas is the minimum number of replicas for the backend.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas is the maximum number of replicas for the backend.Maximum: 1e+06
Minimum: 1
workers ModelWorker arrayWorkers is the list of workers associated with this backend.MaxItems: 1000
MinItems: 1
schedulerName stringSchedulerName defines the name of the scheduler used by ModelServing for this backend.

ModelBackendType

Underlying type: string

ModelBackendType defines the type of model backend.

Validation:

  • Enum: [vLLM vLLMDisaggregated]

Appears in:

FieldDescription
vLLMModelBackendTypeVLLM represents a vLLM backend.
vLLMDisaggregatedModelBackendTypeVLLMDisaggregated represents a disaggregated vLLM backend.
SGLangModelBackendTypeSGLang represents an SGLang backend.
MindIEModelBackendTypeMindIE represents a MindIE backend.
MindIEDisaggregatedModelBackendTypeMindIEDisaggregated represents a disaggregated MindIE backend.

ModelBooster

ModelBooster is the Schema for the models API.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelBooster
spec ModelBoosterSpec
status ModelStatus

ModelBoosterList

ModelBoosterList contains a list of ModelBooster.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelBoosterList
items ModelBooster array

ModelBoosterSpec

ModelBoosterSpec defines the desired state of ModelBooster.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the model. ModelBooster CR name is restricted by kubernetes, for example, can't contain uppercase letters.
So we use this field to specify the ModelBooster name.
MaxLength: 64
Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
owner stringOwner is the owner of the model.
backend ModelBackendBackend is the model backend associated with this model.
ModelBackend is the minimum unit of inference instance. It can be vLLM or vLLMDisaggregated.
autoscalingPolicy AutoscalingPolicySpecAutoscalingPolicy references the autoscaling policy to be used for this model.
modelMatch ModelMatchModelMatch defines the predicate used to match LLM inference requests to a given
TargetModels. Multiple match conditions are ANDed together, i.e. the match will
evaluate to true only if all conditions are satisfied.

ModelServing

ModelServing is the Schema for the LLM Serving API

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelServing
spec ModelServingSpec
status ModelServingStatus

ModelServingList

ModelServingList contains a list of ModelServing

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelServingList
items ModelServing array

ModelServingSpec

ModelServingSpec defines the specification of the ModelServing resource.

Appears in:

FieldDescriptionDefaultValidation
replicas integerNumber of ServingGroups. That is the number of instances that run serving tasks
Default to 1.
1
schedulerName stringSchedulerName defines the name of the scheduler used by ModelServingvolcano
template ServingGroupTemplate defines the template for ServingGroup
rolloutStrategy RolloutStrategyRolloutStrategy defines the strategy that will be applied to update replicas
recoveryPolicy RecoveryPolicyRecoveryPolicy defines the recovery policy for the failed Pod to be rebuiltRoleRecreateEnum: [ServingGroupRecreate RoleRecreate None]

ModelServingStatus

ModelServingStatus defines the observed state of ModelServing

Appears in:

FieldDescriptionDefaultValidation
observedGeneration integerobservedGeneration is the most recent generation observed for ModelServing. It corresponds to the
ModelServing's generation, which is updated on mutation by the API Server.
replicas integerReplicas track the total number of ServingGroup that have been created (updated or not, ready or not)
currentReplicas integerCurrentReplicas is the number of ServingGroup created by the ModelServing controller from the ModelServing version
updatedReplicas integerUpdatedReplicas track the number of ServingGroup that have been updated (ready or not).
availableReplicas integerAvailableReplicas track the number of ServingGroup that are in ready state (updated or not).

ModelStatus

ModelStatus defines the observed state of ModelBooster.

Appears in:

FieldDescriptionDefaultValidation
observedGeneration integerObservedGeneration track of generation

ModelWorker

ModelWorker defines the model worker configuration.

Appears in:

FieldDescriptionDefaultValidation
type ModelWorkerTypeType is the type of the model worker.serverEnum: [server prefill decode controller coordinator]
image stringImage is the container image for the worker.
replicas integerReplicas is the number of replicas for the worker.Maximum: 1e+06
Minimum: 0
pods integerPods is the number of pods for the worker.Maximum: 1e+06
Minimum: 0
resources ResourceRequirementsResources specifies the resource requirements for the worker.
affinity AffinityAffinity specifies the affinity rules for scheduling the worker pods.
config JSONConfig contains worker-specific configuration in JSON format.
You can find vLLM config here https://docs.vllm.ai/en/stable/configuration/engine_args.html

ModelWorkerType

Underlying type: string

ModelWorkerType defines the type of model worker.

Validation:

  • Enum: [server prefill decode controller coordinator]

Appears in:

FieldDescription
serverModelWorkerTypeServer represents a server worker.
prefillModelWorkerTypePrefill represents a prefill worker.
decodeModelWorkerTypeDecode represents a decode worker.
controllerModelWorkerTypeController represents a controller worker.
coordinatorModelWorkerTypeCoordinator represents a coordinator worker.

NetworkTopology

NetworkTopologySpec defines the network topology affinity scheduling policy for the roles and group, it works only when the scheduler supports network topology feature.

Appears in:

FieldDescriptionDefaultValidation
groupPolicy NetworkTopologySpecGroupPolicy defines the network topology scheduling requirement of all the instances within the ServingGroup.
rolePolicy NetworkTopologySpecRolePolicy defines the fine-grained network topology scheduling requirement for instances of a role.

PodTemplateSpec

PodTemplateSpec describes the data a pod should have when created from a template

Appears in:

FieldDescriptionDefaultValidation
metadata MetadataRefer to Kubernetes API documentation for fields of metadata.
spec PodSpecSpecification of the desired behavior of the pod.

RecoveryPolicy

Underlying type: string

Appears in:

FieldDescription
ServingGroupRecreateServingGroupRecreate will recreate all the pods in the ServingGroup if
1. Any individual pod in the group is recreated; 2. Any containers/init-containers
in a pod is restarted. This is to ensure all pods/containers in the group will be
started in the same time.
RoleRecreateRoleRecreate will recreate all pods in one Role if
1. Any individual pod in the group is recreated; 2. Any containers/init-containers
in a pod is restarted.
NoneNoneRestartPolicy will follow the same behavior as the default pod or deployment.

Role

Role defines the specific pod instance role that performs the inference task.

Appears in:

FieldDescriptionDefaultValidation
name stringThe name of a role. Name must be unique within an ServingGroupMaxLength: 12
Pattern: ^[a-zA-Z0-9]([-a-zA-Z0-9]*[a-zA-Z0-9])?$
replicas integerThe number of a certain role.
For example, in Disaggregated Prefilling, setting the replica count for both the P and D roles to 1 results in 1P1D deployment configuration.
This approach can similarly be applied to configure a xPyD deployment scenario.
Default to 1.
1
entryTemplate PodTemplateSpecEntryTemplate defines the template for the entry pod of a role.
Required: Currently, a role must have only one entry-pod.
workerReplicas integerWorkerReplicas defines the number for the worker pod of a role.
Required: Need to set the number of worker-pod replicas.
workerTemplate PodTemplateSpecWorkerTemplate defines the template for the worker pod of a role.

RollingUpdateConfiguration

RollingUpdateConfiguration defines the parameters to be used for RollingUpdateStrategyType.

Appears in:

FieldDescriptionDefaultValidation
maxUnavailable IntOrStringThe maximum number of replicas that can be unavailable during the update.
Value can be an absolute number (ex: 5) or a percentage of total replicas at the start of update (ex: 10%).
Absolute number is calculated from percentage by rounding down.
This can not be 0 if MaxSurge is 0.
By default, a fixed value of 1 is used.
1XIntOrString: {}
maxSurge IntOrStringThe maximum number of replicas that can be scheduled above the original number of
replicas.
Value can be an absolute number (ex: 5) or a percentage of total replicas at
the start of the update (ex: 10%).
Absolute number is calculated from percentage by rounding up.
By default, a value of 0 is used.
0XIntOrString: {}
partition integerPartition indicates the ordinal at which the ModelServing should be partitioned
for updates. During a rolling update, all ServingGroups from ordinal Replicas-1 to
Partition are updated. All ServingGroups from ordinal Partition-1 to 0 remain untouched.
The default value is 0.

RolloutStrategy

RolloutStrategy defines the strategy that the ModelServing controller will use to perform replica updates.

Appears in:

FieldDescriptionDefaultValidation
type RolloutStrategyTypeType defines the rollout strategy, it can only be “ServingGroupRollingUpdate” for now.ServingGroupRollingUpdateEnum: [ServingGroupRollingUpdate]
rollingUpdateConfiguration RollingUpdateConfigurationRollingUpdateConfiguration defines the parameters to be used when type is RollingUpdateStrategyType.
optional

RolloutStrategyType

Underlying type: string

Appears in:

FieldDescription
ServingGroupRollingUpdateServingGroupRollingUpdate indicates that ServingGroup replicas will be updated one by one.

SelectPolicyType

Underlying type: string

SelectPolicyType defines the selection strategy type for scaling operations.

Validation:

  • Enum: [Or And]

Appears in:

FieldDescription
Or
And

ServingGroup

ServingGroup is the smallest unit to complete the inference task

Appears in:

FieldDescriptionDefaultValidation
restartGracePeriodSeconds integerRestartGracePeriodSeconds defines the grace time for the controller to rebuild the ServingGroup when an error occurs
Defaults to 0 (ServingGroup will be rebuilt immediately after an error)
0
gangPolicy GangPolicyGangPolicy defines the gang scheduler config.
networkTopology NetworkTopologyNetworkTopology defines the network topology affinity scheduling policy for the roles of the ServingGroup,
it works only when the scheduler supports network topology-aware scheduling.
roles Role arrayMaxItems: 4
MinItems: 1

SubTarget

Appears in:

FieldDescriptionDefaultValidation
kind string
name string

Target

Target defines a ModelServing deployment that can be monitored and scaled.

Appears in:

FieldDescriptionDefaultValidation
targetRef ObjectReferenceTargetRef references the target object to be monitored and scaled.
Default target GVK is ModelServing. Currently supported kinds: ModelServing.
subTargets SubTargetSubTarget defines the sub-target object to be monitored and scaled.
Currently supported kinds: Role when TargetRef kind is ModelServing.
metricEndpoint MetricEndpointMetricEndpoint defines the configuration for scraping metrics from the target pods.