Skip to main content
Version: v0.1.0

API Reference

Packages

workload.serving.volcano.sh/v1alpha1

Resource Types

AutoscalingPolicy

AutoscalingPolicy is the Schema for the autoscalingpolicies API.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicy
spec AutoscalingPolicySpec
status AutoscalingPolicyStatus

AutoscalingPolicyBehavior

AutoscalingPolicyBehavior defines the scaling behaviors for up and down actions.

Appears in:

FieldDescriptionDefaultValidation
scaleUp AutoscalingPolicyScaleUpPolicyScaleUp defines the policy for scaling up (increasing replicas).
scaleDown AutoscalingPolicyStablePolicyScaleDown defines the policy for scaling down (decreasing replicas).

AutoscalingPolicyBinding

AutoscalingPolicyBinding binds AutoscalingPolicy rules to specific ModelServing deployments, enabling either traditional metric-based scaling or multi-target optimization across heterogeneous hardware deployments.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyBinding
spec AutoscalingPolicyBindingSpec
status AutoscalingPolicyBindingStatus

AutoscalingPolicyBindingList

AutoscalingPolicyBindingList contains a list of AutoscalingPolicyBinding objects.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyBindingList
items AutoscalingPolicyBinding array

AutoscalingPolicyBindingSpec

AutoscalingPolicyBindingSpec defines the desired state of AutoscalingPolicyBinding.

Appears in:

FieldDescriptionDefaultValidation
policyRef LocalObjectReferencePolicyRef references the AutoscalingPolicy that defines the scaling rules and metrics.
optimizerConfiguration OptimizerConfigurationOptimizerConfiguration enables multi-target optimization that dynamically allocates
replicas across heterogeneous ModelServing deployments based on overall compute requirements.
This is ideal for mixed hardware environments (e.g., H100/A100 clusters) where you want to
optimize resource utilization by adjusting deployment ratios between different hardware types
using mathematical optimization methods (e.g. integer programming).
scalingConfiguration ScalingConfigurationScalingConfiguration defines traditional autoscaling behavior that adjusts replica counts
based on monitoring metrics and target values for a single ModelServing deployment.

AutoscalingPolicyBindingStatus

AutoscalingPolicyBindingStatus defines the observed state of AutoscalingPolicyBinding.

Appears in:

AutoscalingPolicyList

AutoscalingPolicyList contains a list of AutoscalingPolicy.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyList
items AutoscalingPolicy array

AutoscalingPolicyMetric

AutoscalingPolicyMetric defines a metric and its target value for scaling.

Appears in:

FieldDescriptionDefaultValidation
metricName stringMetricName is the name of the metric to monitor.
targetValue QuantityTargetValue is the target value for the metric to trigger scaling.

AutoscalingPolicyPanicPolicy

AutoscalingPolicyPanicPolicy defines the policy for panic scaling up.

Appears in:

FieldDescriptionDefaultValidation
percent integerPercent is the maximum percentage of instances to scale up.1000Maximum: 1000
Minimum: 0
panicThresholdPercent integerPanicThresholdPercent is the threshold percent to enter panic mode.200Maximum: 1000
Minimum: 110

AutoscalingPolicyScaleUpPolicy

Appears in:

FieldDescriptionDefaultValidation
stablePolicy AutoscalingPolicyStablePolicyStable policy usually makes decisions based on the average value of metrics calculated over the past few minutes and introduces a scaling-down cool-down period/delay.
This mechanism is relatively stable, as it can smooth out short-term small fluctuations and avoid overly frequent and unnecessary Pod scaling.
panicPolicy AutoscalingPolicyPanicPolicyWhen the load surges sharply within a short period (for example, encountering a sudden traffic peak or a rush of sudden computing tasks),
using the average value over a long time window to calculate the required number of replicas will cause significant lag.
If the system needs to scale out quickly to cope with such peaks, the ordinary scaling logic may fail to respond in time,
resulting in delayed Pod startup, slower service response time or timeouts, and may even lead to service paralysis or data backlogs (for workloads such as message queues).

AutoscalingPolicySpec

AutoscalingPolicySpec defines the desired state of AutoscalingPolicy.

Appears in:

FieldDescriptionDefaultValidation
tolerancePercent integerTolerancePercent is the percentage of deviation tolerated before scaling actions are triggered.
The current number of instances is current_replicas, and the expected number of instances inferred from monitoring metrics is target_replicas.
The scaling operation will only be actually performed when |current_replicas - target_replicas| >= current_replicas * TolerancePercent.
10Maximum: 100
Minimum: 0
metrics AutoscalingPolicyMetric arrayMetrics is the list of metrics used to evaluate scaling decisions.MinItems: 1
behavior AutoscalingPolicyBehaviorBehavior defines the scaling behavior for both scale up and scale down.

AutoscalingPolicyStablePolicy

AutoscalingPolicyStablePolicy defines the policy for stable scaling up or scaling down.

Appears in:

FieldDescriptionDefaultValidation
instances integerInstances is the maximum number of instances to scale.1Minimum: 0
percent integerPercent is the maximum percentage of instances to scaling.100Maximum: 1000
Minimum: 0
selectPolicy SelectPolicyTypeSelectPolicy determines the selection strategy for scaling up (e.g., Or, And).
'Or' represents the scaling operation will be performed as long as either the Percent requirement or the Instances requirement is met.
'And' represents the scaling operation will be performed as long as both the Percent requirement and the Instances requirement is met.
OrEnum: [Or And]

AutoscalingPolicyStatus

AutoscalingPolicyStatus defines the observed state of AutoscalingPolicy.

Appears in:

GangPolicy

GangPolicy defines the gang scheduling configuration.

Appears in:

FieldDescriptionDefaultValidation
minRoleReplicas object (keys:string, values:integer)MinRoleReplicas defines the minimum number of replicas required for each role
in gang scheduling. This map allows users to specify different
minimum replica requirements for different roles.
Key: role name
Value: minimum number of replicas required for that role

LoraAdapter

LoraAdapter defines a LoRA (Low-Rank Adaptation) adapter configuration.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the LoRA adapter.Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
artifactURL stringArtifactURL is the URL where the LoRA adapter artifact is stored.Pattern: ^(hf://|s3://|pvc://).+

Metadata

Metadata is a simplified version of ObjectMeta in Kubernetes.

Appears in:

FieldDescriptionDefaultValidation
labels object (keys:string, values:string)Map of string keys and values that can be used to organize and categorize
(scope and select) objects. May match selectors of replication controllers
and services.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels
annotations object (keys:string, values:string)Annotations is an unstructured key value map stored with a resource that may be
set by external tools to store and retrieve arbitrary metadata. They are not
queryable and should be preserved when modifying objects.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations

MetricEndpoint

MetricEndpoint defines the endpoint configuration for scraping metrics from pods.

Appears in:

FieldDescriptionDefaultValidation
uri stringURI is the path where metrics are exposed (e.g., "/metrics")./metrics
port integerPort is the network port where metrics are exposed by the pods.8100

ModelBackend

ModelBackend defines the configuration for a model backend.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the backend. Can't duplicate with other ModelBackend name in the same ModelBooster CR.
Note: update name will cause the old modelInfer deletion and a new modelInfer creation.
Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
type ModelBackendTypeType is the type of the backend.Enum: [vLLM vLLMDisaggregated SGLang MindIE MindIEDisaggregated]
modelURI stringModelURI is the URI where you download the model. Support hf://, s3://, pvc://.Pattern: ^(hf://|s3://|pvc://).+
cacheURI stringCacheURI is the URI where the downloaded model stored. Support hostpath://, pvc://.Pattern: ^(hostpath://|pvc://).+
envFrom EnvFromSource arrayList of sources to populate environment variables in the container.
The keys defined within a source must be a C_IDENTIFIER. All invalid keys
will be reported as an event when the container is starting. When a key exists in multiple
sources, the value associated with the last source will take precedence.
Values defined by an Env with a duplicate key will take precedence.
Cannot be updated.
env EnvVar arrayList of environment variables to set in the container.
Supported names:
"ENDPOINT": When you download model from s3, you have to specify it.
"RUNTIME_URL": default is http://localhost:8000
"RUNTIME_PORT": default is 8100
"RUNTIME_METRICS_PATH": default is /metrics
"HF_ENDPOINT":The url of hugging face. Default is https://huggingface.co/
Cannot be updated.
minReplicas integerMinReplicas is the minimum number of replicas for the backend.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas is the maximum number of replicas for the backend.Maximum: 1e+06
Minimum: 1
scalingCost integerScalingCost is the cost associated with running this backend.Minimum: 0
routeWeight integerRouteWeight is used to specify the percentage of traffic should be sent to the target backend.
It's used to create model route.
100Maximum: 100
Minimum: 0
workers ModelWorker arrayWorkers is the list of workers associated with this backend.MaxItems: 1000
MinItems: 1
loraAdapters LoraAdapter arrayLoraAdapter is a list of LoRA adapters.
autoscalingPolicy AutoscalingPolicySpecAutoscalingPolicyRef references the autoscaling policy for this backend.
schedulerName stringSchedulerName defines the name of the scheduler used by ModelServing for this backend.

ModelBackendStatus

ModelBackendStatus defines the status of a model backend.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the backend.
replicas integerReplicas is the number of replicas currently running for the backend.

ModelBackendType

Underlying type: string

ModelBackendType defines the type of model backend.

Validation:

  • Enum: [vLLM vLLMDisaggregated SGLang MindIE MindIEDisaggregated]

Appears in:

FieldDescription
vLLMModelBackendTypeVLLM represents a vLLM backend.
vLLMDisaggregatedModelBackendTypeVLLMDisaggregated represents a disaggregated vLLM backend.
SGLangModelBackendTypeSGLang represents an SGLang backend.
MindIEModelBackendTypeMindIE represents a MindIE backend.
MindIEDisaggregatedModelBackendTypeMindIEDisaggregated represents a disaggregated MindIE backend.

ModelBooster

ModelBooster is the Schema for the models API.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelBooster
spec ModelBoosterSpec
status ModelStatus

ModelBoosterList

ModelBoosterList contains a list of ModelBooster.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelBoosterList
items ModelBooster array

ModelBoosterSpec

ModelBoosterSpec defines the desired state of ModelBooster.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the model. ModelBooster CR name is restricted by kubernetes, for example, can't contain uppercase letters.
So we use this field to specify the ModelBooster name.
MaxLength: 64
Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
owner stringOwner is the owner of the model.
backends ModelBackend arrayBackends is the list of model backends associated with this model. A ModelBooster CR at lease has one ModelBackend.
ModelBackend is the minimum unit of inference instance. It can be vLLM, SGLang, MindIE or other types.
MinItems: 1
autoscalingPolicy AutoscalingPolicySpecAutoscalingPolicy references the autoscaling policy to be used for this model.
costExpansionRatePercent integerCostExpansionRatePercent is the percentage rate at which the cost expands.Maximum: 1000
Minimum: 0
modelMatch ModelMatchModelMatch defines the predicate used to match LLM inference requests to a given
TargetModels. Multiple match conditions are ANDed together, i.e. the match will
evaluate to true only if all conditions are satisfied.

ModelServing

ModelServing is the Schema for the LLM Serving API

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelServing
spec ModelServingSpec
status ModelServingStatus

ModelServingList

ModelServingList contains a list of ModelServing

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelServingList
items ModelServing array

ModelServingSpec

ModelServingSpec defines the specification of the ModelServing resource.

Appears in:

FieldDescriptionDefaultValidation
replicas integerNumber of ServingGroups. That is the number of instances that run serving tasks
Default to 1.
1
schedulerName stringSchedulerName defines the name of the scheduler used by ModelServingvolcano
template ServingGroupTemplate defines the template for ServingGroup
rolloutStrategy RolloutStrategyRolloutStrategy defines the strategy that will be applied to update replicas
recoveryPolicy RecoveryPolicyRecoveryPolicy defines the recovery policy for the failed Pod to be rebuiltRoleRecreateEnum: [ServingGroupRecreate RoleRecreate None]
topologySpreadConstraints TopologySpreadConstraint array

ModelServingStatus

ModelServingStatus defines the observed state of ModelServing

Appears in:

FieldDescriptionDefaultValidation
observedGeneration integerobservedGeneration is the most recent generation observed for ModelServing. It corresponds to the
ModelServing's generation, which is updated on mutation by the API Server.
replicas integerReplicas track the total number of ServingGroup that have been created (updated or not, ready or not)
currentReplicas integerCurrentReplicas is the number of ServingGroup created by the ModelServing controller from the ModelServing version
updatedReplicas integerUpdatedReplicas track the number of ServingGroup that have been updated (ready or not).
availableReplicas integerAvailableReplicas track the number of ServingGroup that are in ready state (updated or not).

ModelStatus

ModelStatus defines the observed state of ModelBooster.

Appears in:

FieldDescriptionDefaultValidation
backendStatuses ModelBackendStatus arrayBackendStatuses contains the status of each backend.
observedGeneration integerObservedGeneration track of generation

ModelWorker

ModelWorker defines the model worker configuration.

Appears in:

FieldDescriptionDefaultValidation
type ModelWorkerTypeType is the type of the model worker.serverEnum: [server prefill decode controller coordinator]
image stringImage is the container image for the worker.
replicas integerReplicas is the number of replicas for the worker.Maximum: 1e+06
Minimum: 0
pods integerPods is the number of pods for the worker.Maximum: 1e+06
Minimum: 0
resources ResourceRequirementsResources specifies the resource requirements for the worker.
affinity AffinityAffinity specifies the affinity rules for scheduling the worker pods.
config JSONConfig contains worker-specific configuration in JSON format.
You can find vLLM config here https://docs.vllm.ai/en/stable/configuration/engine_args.html

ModelWorkerType

Underlying type: string

ModelWorkerType defines the type of model worker.

Validation:

  • Enum: [server prefill decode controller coordinator]

Appears in:

FieldDescription
serverModelWorkerTypeServer represents a server worker.
prefillModelWorkerTypePrefill represents a prefill worker.
decodeModelWorkerTypeDecode represents a decode worker.
controllerModelWorkerTypeController represents a controller worker.
coordinatorModelWorkerTypeCoordinator represents a coordinator worker.

OptimizerConfiguration

OptimizerConfiguration defines parameters for multi-target optimization across multiple ModelServing deployments with different hardware characteristics.

Appears in:

FieldDescriptionDefaultValidation
params OptimizerParam arrayParams contains the optimization parameters for each ModelServing group.
Each entry defines a different deployment type (e.g., different hardware) to optimize.
MinItems: 1
costExpansionRatePercent integerCostExpansionRatePercent defines the acceptable cost expansion percentage
when optimizing across multiple deployment types. A higher value allows more
flexibility in resource allocation but may increase overall costs.
200Minimum: 0

OptimizerParam

OptimizerParam defines optimization parameters for a specific ModelServing deployment type.

Appears in:

FieldDescriptionDefaultValidation
target TargetTarget specifies the ModelServing deployment and its monitoring configuration.
cost integerCost represents the relative cost factor for this deployment type.
Used in optimization calculations to balance performance vs. cost.
Minimum: 0
minReplicas integerMinReplicas is the minimum number of replicas to maintain for this deployment type.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas is the maximum number of replicas allowed for this deployment type.Maximum: 1e+06
Minimum: 1

PodTemplateSpec

PodTemplateSpec describes the data a pod should have when created from a template

Appears in:

FieldDescriptionDefaultValidation
metadata MetadataRefer to Kubernetes API documentation for fields of metadata.
spec PodSpecSpecification of the desired behavior of the pod.

RecoveryPolicy

Underlying type: string

Appears in:

FieldDescription
ServingGroupRecreateServingGroupRecreate will recreate all the pods in the ServingGroup if
1. Any individual pod in the group is recreated; 2. Any containers/init-containers
in a pod is restarted. This is to ensure all pods/containers in the group will be
started in the same time.
RoleRecreateRoleRecreate will recreate all pods in one Role if
1. Any individual pod in the group is recreated; 2. Any containers/init-containers
in a pod is restarted.
NoneNoneRestartPolicy will follow the same behavior as the default pod or deployment.

Role

Role defines the specific pod instance role that performs the inference task.

Appears in:

FieldDescriptionDefaultValidation
name stringThe name of a role. Name must be unique within an ServingGroupMaxLength: 12
Pattern: ^[a-zA-Z0-9]([-a-zA-Z0-9]*[a-zA-Z0-9])?$
replicas integerThe number of a certain role.
For example, in Disaggregated Prefilling, setting the replica count for both the P and D roles to 1 results in 1P1D deployment configuration.
This approach can similarly be applied to configure a xPyD deployment scenario.
Default to 1.
1
entryTemplate PodTemplateSpecEntryTemplate defines the template for the entry pod of a role.
Required: Currently, a role must have only one entry-pod.
workerReplicas integerWorkerReplicas defines the number for the worker pod of a role.
Required: Need to set the number of worker-pod replicas.
workerTemplate PodTemplateSpecWorkerTemplate defines the template for the worker pod of a role.

RollingUpdateConfiguration

RollingUpdateConfiguration defines the parameters to be used for RollingUpdateStrategyType.

Appears in:

FieldDescriptionDefaultValidation
maxUnavailable IntOrStringThe maximum number of replicas that can be unavailable during the update.
Value can be an absolute number (ex: 5) or a percentage of total replicas at the start of update (ex: 10%).
Absolute number is calculated from percentage by rounding down.
This can not be 0 if MaxSurge is 0.
By default, a fixed value of 1 is used.
1XIntOrString: {}
maxSurge IntOrStringThe maximum number of replicas that can be scheduled above the original number of
replicas.
Value can be an absolute number (ex: 5) or a percentage of total replicas at
the start of the update (ex: 10%).
Absolute number is calculated from percentage by rounding up.
By default, a value of 0 is used.
0XIntOrString: {}
partition integerPartition indicates the ordinal at which the ModelServing should be partitioned
for updates. During a rolling update, all ServingGroups from ordinal Replicas-1 to
Partition are updated. All ServingGroups from ordinal Partition-1 to 0 remain untouched.
The default value is 0.

RolloutStrategy

RolloutStrategy defines the strategy that the ModelServing controller will use to perform replica updates.

Appears in:

FieldDescriptionDefaultValidation
type RolloutStrategyTypeType defines the rollout strategy, it can only be “ServingGroupRollingUpdate” for now.ServingGroupRollingUpdateEnum: [ServingGroupRollingUpdate]
rollingUpdateConfiguration RollingUpdateConfigurationRollingUpdateConfiguration defines the parameters to be used when type is RollingUpdateStrategyType.
optional

RolloutStrategyType

Underlying type: string

Appears in:

FieldDescription
ServingGroupRollingUpdateServingGroupRollingUpdate indicates that ServingGroup replicas will be updated one by one.

ScalingConfiguration

ScalingConfiguration defines the scaling parameters for a single target deployment.

Appears in:

FieldDescriptionDefaultValidation
target TargetTarget specifies the ModelServing deployment to monitor and scale.
minReplicas integerMinReplicas is the minimum number of replicas to maintain.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas is the maximum number of replicas allowed.Maximum: 1e+06
Minimum: 1

SelectPolicyType

Underlying type: string

SelectPolicyType defines the type of select olicy.

Validation:

  • Enum: [Or And]

Appears in:

FieldDescription
Or
And

ServingGroup

ServingGroup is the smallest unit to complete the inference task

Appears in:

FieldDescriptionDefaultValidation
restartGracePeriodSeconds integerRestartGracePeriodSeconds defines the grace time for the controller to rebuild the ServingGroup when an error occurs
Defaults to 0 (ServingGroup will be rebuilt immediately after an error)
0
gangPolicy GangPolicyGangPolicy defines the gang scheduler config.
networkTopology NetworkTopologySpecNetworkTopology defines the network topology affinity scheduling policy for the roles of the group, it works only when the scheduler supports network topology feature. // +optional
roles Role arrayMaxItems: 4
MinItems: 1

Target

Target defines a ModelServing deployment that can be monitored and scaled.

Appears in:

FieldDescriptionDefaultValidation
targetRef ObjectReferenceTargetRef references the ModelServing object to monitor and scale.
additionalMatchLabels object (keys:string, values:string)AdditionalMatchLabels provides additional label selectors to refine
which pods within the ModelServing deployment should be monitored.
metricEndpoint MetricEndpointMetricEndpoint configures how to scrape metrics from the target pods.
If not specified, defaults to port 8100 and path "/metrics".

TopologySpreadConstraint

TopologySpreadConstraint defines the topology spread constraint.

Appears in:

FieldDescriptionDefaultValidation
maxSkew integerMaxSkew describes the degree to which ServingGroup may be unevenly distributed.
topologyKey stringTopologyKey is the key of node labels. Nodes that have a label with this key
and identical values are considered to be in the same topology.
whenUnsatisfiable stringWhenUnsatisfiable indicates how to deal with an ServingGroup if it doesn't satisfy
the spread constraint.