Skip to main content
Version: 0.1.0

API Reference

Packages

workload.serving.volcano.sh/v1alpha1

Resource Types

AutoscalingPolicy

AutoscalingPolicy is the Schema for the autoscalingpolicies API.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicy
spec AutoscalingPolicySpec
status AutoscalingPolicyStatus

AutoscalingPolicyBehavior

AutoscalingPolicyBehavior defines the scaling behaviors for up and down actions.

Appears in:

FieldDescriptionDefaultValidation
scaleUp AutoscalingPolicyScaleUpPolicyScaleUp defines the policy for scaling up (increasing replicas).
scaleDown AutoscalingPolicyStablePolicyScaleDown defines the policy for scaling down (decreasing replicas).

AutoscalingPolicyBinding

AutoscalingPolicyBinding is the Schema for the autoscalingpolicybindings API.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyBinding
spec AutoscalingPolicyBindingSpec
status AutoscalingPolicyBindingStatus

AutoscalingPolicyBindingList

AutoscalingPolicyBindingList contains a list of AutoscalingPolicyBinding.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyBindingList
items AutoscalingPolicyBinding array

AutoscalingPolicyBindingSpec

AutoscalingPolicyBindingSpec defines the desired state of AutoscalingPolicyBinding.

Appears in:

FieldDescriptionDefaultValidation
optimizerConfiguration OptimizerConfigurationIt dynamically adjusts replicas across different ModelServing objects based on overall computing power requirements - referred to as "optimize" behavior in the code.
For example:
When dealing with two types of ModelServing objects corresponding to heterogeneous hardware resources with different computing capabilities (e.g., H100/A100), the "optimize" behavior aims to:
Dynamically adjust the deployment ratio of H100/A100 instances based on real-time computing power demands
Use integer programming and similar methods to precisely meet computing requirements
Maximize hardware utilization efficiency
scalingConfiguration ScalingConfigurationAdjust the number of related instances based on specified monitoring metrics and their target values.

AutoscalingPolicyBindingStatus

AutoscalingPolicyBindingStatus defines the status of a autoscaling policy binding.

Appears in:

AutoscalingPolicyList

AutoscalingPolicyList contains a list of AutoscalingPolicy.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringAutoscalingPolicyList
items AutoscalingPolicy array

AutoscalingPolicyMetric

AutoscalingPolicyMetric defines a metric and its target value for scaling.

Appears in:

FieldDescriptionDefaultValidation
metricName stringMetricName is the name of the metric to monitor.
targetValue QuantityTargetValue is the target value for the metric to trigger scaling.

AutoscalingPolicyPanicPolicy

AutoscalingPolicyPanicPolicy defines the policy for panic scaling up.

Appears in:

FieldDescriptionDefaultValidation
percent integerPercent is the maximum percentage of instances to scale up.1000Maximum: 1000
Minimum: 0
panicThresholdPercent integerPanicThresholdPercent is the threshold percent to enter panic mode.200Maximum: 1000
Minimum: 110

AutoscalingPolicyScaleUpPolicy

Appears in:

FieldDescriptionDefaultValidation
stablePolicy AutoscalingPolicyStablePolicyStable policy usually makes decisions based on the average value of metrics calculated over the past few minutes and introduces a scaling-down cool-down period/delay.
This mechanism is relatively stable, as it can smooth out short-term small fluctuations and avoid overly frequent and unnecessary Pod scaling.
panicPolicy AutoscalingPolicyPanicPolicyWhen the load surges sharply within a short period (for example, encountering a sudden traffic peak or a rush of sudden computing tasks),
using the average value over a long time window to calculate the required number of replicas will cause significant lag.
If the system needs to scale out quickly to cope with such peaks, the ordinary scaling logic may fail to respond in time,
resulting in delayed Pod startup, slower service response time or timeouts, and may even lead to service paralysis or data backlogs (for workloads such as message queues).

AutoscalingPolicySpec

AutoscalingPolicySpec defines the desired state of AutoscalingPolicy.

Appears in:

FieldDescriptionDefaultValidation
tolerancePercent integerTolerancePercent is the percentage of deviation tolerated before scaling actions are triggered.
The current number of instances is current_replicas, and the expected number of instances inferred from monitoring metrics is target_replicas.
The scaling operation will only be actually performed when |current_replicas - target_replicas| >= current_replicas * TolerancePercent.
10Maximum: 100
Minimum: 0
metrics AutoscalingPolicyMetric arrayMetrics is the list of metrics used to evaluate scaling decisions.MinItems: 1
behavior AutoscalingPolicyBehaviorBehavior defines the scaling behavior for both scale up and scale down.

AutoscalingPolicyStablePolicy

AutoscalingPolicyStablePolicy defines the policy for stable scaling up or scaling down.

Appears in:

FieldDescriptionDefaultValidation
instances integerInstances is the maximum number of instances to scale.1Minimum: 0
percent integerPercent is the maximum percentage of instances to scaling.100Maximum: 1000
Minimum: 0
selectPolicy SelectPolicyTypeSelectPolicy determines the selection strategy for scaling up (e.g., Or, And).
'Or' represents the scaling operation will be performed as long as either the Percent requirement or the Instances requirement is met.
'And' represents the scaling operation will be performed as long as both the Percent requirement and the Instances requirement is met.
OrEnum: [Or And]

AutoscalingPolicyStatus

AutoscalingPolicyStatus defines the observed state of AutoscalingPolicy.

Appears in:

GangPolicy

GangPolicy defines the gang scheduling configuration.

Appears in:

FieldDescriptionDefaultValidation
minRoleReplicas object (keys:string, values:integer)MinRoleReplicas defines the minimum number of replicas required for each role
in gang scheduling. This map allows users to specify different
minimum replica requirements for different roles.
Key: role name
Value: minimum number of replicas required for that role

LoraAdapter

LoraAdapter defines a LoRA (Low-Rank Adaptation) adapter configuration.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the LoRA adapter.Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
artifactURL stringArtifactURL is the URL where the LoRA adapter artifact is stored.Pattern: ^(hf://|s3://|pvc://).+

Metadata

Metadata is a simplified version of ObjectMeta in Kubernetes.

Appears in:

FieldDescriptionDefaultValidation
labels object (keys:string, values:string)Map of string keys and values that can be used to organize and categorize
(scope and select) objects. May match selectors of replication controllers
and services.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels
annotations object (keys:string, values:string)Annotations is an unstructured key value map stored with a resource that may be
set by external tools to store and retrieve arbitrary metadata. They are not
queryable and should be preserved when modifying objects.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations

MetricEndpoint

Appears in:

FieldDescriptionDefaultValidation
uri stringThe metric uri, e.g. /metrics/metrics
port integerThe port of pods exposing metric endpoints8100

ModelBackend

ModelBackend defines the configuration for a model backend.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the backend. Can't duplicate with other ModelBackend name in the same ModelBooster CR.
Note: update name will cause the old modelInfer deletion and a new modelInfer creation.
Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
type ModelBackendTypeType is the type of the backend.Enum: [vLLM vLLMDisaggregated SGLang MindIE MindIEDisaggregated]
modelURI stringModelURI is the URI where you download the model. Support hf://, s3://, pvc://.Pattern: ^(hf://|s3://|pvc://).+
cacheURI stringCacheURI is the URI where the downloaded model stored. Support hostpath://, pvc://.Pattern: ^(hostpath://|pvc://).+
minReplicas integerMinReplicas is the minimum number of replicas for the backend.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas is the maximum number of replicas for the backend.Maximum: 1e+06
Minimum: 1
scalingCost integerScalingCost is the cost associated with running this backend.Minimum: 0
routeWeight integerRouteWeight is used to specify the percentage of traffic should be sent to the target backend.
It's used to create model route.
100Maximum: 100
Minimum: 0
workers ModelWorker arrayWorkers is the list of workers associated with this backend.MaxItems: 1000
MinItems: 1
loraAdapters LoraAdapter arrayLoraAdapter is a list of LoRA adapters.
autoscalingPolicy AutoscalingPolicySpecAutoscalingPolicyRef references the autoscaling policy for this backend.
schedulerName stringSchedulerName defines the name of the scheduler used by ModelServing for this backend.

ModelBackendStatus

ModelBackendStatus defines the status of a model backend.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the backend.
replicas integerReplicas is the number of replicas currently running for the backend.

ModelBackendType

Underlying type: string

ModelBackendType defines the type of model backend.

Validation:

  • Enum: [vLLM vLLMDisaggregated SGLang MindIE MindIEDisaggregated]

Appears in:

FieldDescription
vLLMModelBackendTypeVLLM represents a vLLM backend.
vLLMDisaggregatedModelBackendTypeVLLMDisaggregated represents a disaggregated vLLM backend.
SGLangModelBackendTypeSGLang represents an SGLang backend.
MindIEModelBackendTypeMindIE represents a MindIE backend.
MindIEDisaggregatedModelBackendTypeMindIEDisaggregated represents a disaggregated MindIE backend.

ModelBooster

ModelBooster is the Schema for the models API.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelBooster
spec ModelBoosterSpec
status ModelStatus

ModelBoosterList

ModelBoosterList contains a list of ModelBooster.

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelBoosterList
items ModelBooster array

ModelBoosterSpec

ModelBoosterSpec defines the desired state of ModelBooster.

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the model. ModelBooster CR name is restricted by kubernetes, for example, can't contain uppercase letters.
So we use this field to specify the ModelBooster name.
MaxLength: 64
Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
owner stringOwner is the owner of the model.
backends ModelBackend arrayBackends is the list of model backends associated with this model. A ModelBooster CR at lease has one ModelBackend.
ModelBackend is the minimum unit of inference instance. It can be vLLM, SGLang, MindIE or other types.
MinItems: 1
autoscalingPolicy AutoscalingPolicySpecAutoscalingPolicy references the autoscaling policy to be used for this model.
costExpansionRatePercent integerCostExpansionRatePercent is the percentage rate at which the cost expands.Maximum: 1000
Minimum: 0
modelMatch ModelMatchModelMatch defines the predicate used to match LLM inference requests to a given
TargetModels. Multiple match conditions are ANDed together, i.e. the match will
evaluate to true only if all conditions are satisfied.

ModelServing

ModelServing is the Schema for the LLM Serving API

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelServing
spec ModelServingSpec
status ModelServingStatus

ModelServingList

ModelServingList contains a list of ModelServing

FieldDescriptionDefaultValidation
apiVersion stringworkload.serving.volcano.sh/v1alpha1
kind stringModelServingList
items ModelServing array

ModelServingSpec

ModelServingSpec defines the specification of the ModelServing resource.

Appears in:

FieldDescriptionDefaultValidation
replicas integerNumber of ServingGroups. That is the number of instances that run serving tasks
Default to 1.
1
schedulerName stringSchedulerName defines the name of the scheduler used by ModelServing
template ServingGroupTemplate defines the template for ServingGroup
rolloutStrategy RolloutStrategyRolloutStrategy defines the strategy that will be applied to update replicas
recoveryPolicy RecoveryPolicyRecoveryPolicy defines the recovery policy for the failed Pod to be rebuiltRoleRecreateEnum: [ServingGroupRecreate RoleRecreate None]
topologySpreadConstraints TopologySpreadConstraint array

ModelServingStatus

ModelServingStatus defines the observed state of ModelServing

Appears in:

FieldDescriptionDefaultValidation
observedGeneration integerobservedGeneration is the most recent generation observed for ModelServing. It corresponds to the
ModelServing's generation, which is updated on mutation by the API Server.
replicas integerReplicas track the total number of ServingGroup that have been created (updated or not, ready or not)
currentReplicas integerCurrentReplicas is the number of ServingGroup created by the ModelServing controller from the ModelServing version
updatedReplicas integerUpdatedReplicas track the number of ServingGroup that have been updated (ready or not).
availableReplicas integerAvailableReplicas track the number of ServingGroup that are in ready state (updated or not).

ModelStatus

ModelStatus defines the observed state of ModelBooster.

Appears in:

FieldDescriptionDefaultValidation
backendStatuses ModelBackendStatus arrayBackendStatuses contains the status of each backend.
observedGeneration integerObservedGeneration track of generation

ModelWorker

ModelWorker defines the model worker configuration.

Appears in:

FieldDescriptionDefaultValidation
type ModelWorkerTypeType is the type of the model worker.serverEnum: [server prefill decode controller coordinator]
image stringImage is the container image for the worker.
replicas integerReplicas is the number of replicas for the worker.Maximum: 1e+06
Minimum: 0
pods integerPods is the number of pods for the worker.Maximum: 1e+06
Minimum: 0
config JSONConfig contains worker-specific configuration in JSON format.
You can find vLLM config here https://docs.vllm.ai/en/stable/configuration/engine_args.html

ModelWorkerType

Underlying type: string

ModelWorkerType defines the type of model worker.

Validation:

  • Enum: [server prefill decode controller coordinator]

Appears in:

FieldDescription
serverModelWorkerTypeServer represents a server worker.
prefillModelWorkerTypePrefill represents a prefill worker.
decodeModelWorkerTypeDecode represents a decode worker.
controllerModelWorkerTypeController represents a controller worker.
coordinatorModelWorkerTypeCoordinator represents a coordinator worker.

OptimizerConfiguration

Appears in:

FieldDescriptionDefaultValidation
params OptimizerParam arrayParameters of multiple Model Serving Groups to be optimized.MinItems: 1
costExpansionRatePercent integerCostExpansionRatePercent is the percentage rate at which the cost expands.200Minimum: 0

OptimizerParam

Appears in:

FieldDescriptionDefaultValidation
target TargetThe scaling instance configuration
cost integerCost is the cost associated with running this backend.Minimum: 0
minReplicas integerMinReplicas is the minimum number of replicas for the backend.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas is the maximum number of replicas for the backend.Maximum: 1e+06
Minimum: 1

PodTemplateSpec

PodTemplateSpec describes the data a pod should have when created from a template

Appears in:

FieldDescriptionDefaultValidation
metadata MetadataRefer to Kubernetes API documentation for fields of metadata.

RecoveryPolicy

Underlying type: string

Appears in:

FieldDescription
ServingGroupRecreateServingGroupRecreate will recreate all the pods in the ServingGroup if
1. Any individual pod in the group is recreated; 2. Any containers/init-containers
in a pod is restarted. This is to ensure all pods/containers in the group will be
started in the same time.
RoleRecreateRoleRecreate will recreate all pods in one Role if
1. Any individual pod in the group is recreated; 2. Any containers/init-containers
in a pod is restarted.
NoneNoneRestartPolicy will follow the same behavior as the default pod or deployment.

Role

Role defines the specific pod instance role that performs the inference task.

Appears in:

FieldDescriptionDefaultValidation
name stringThe name of a role. Name must be unique within an ServingGroupMaxLength: 12
Pattern: ^[a-zA-Z0-9]([-a-zA-Z0-9]*[a-zA-Z0-9])?$
replicas integerThe number of a certain role.
For example, in Disaggregated Prefilling, setting the replica count for both the P and D roles to 1 results in 1P1D deployment configuration.
This approach can similarly be applied to configure a xPyD deployment scenario.
Default to 1.
1
entryTemplate PodTemplateSpecEntryTemplate defines the template for the entry pod of a role.
Required: Currently, a role must have only one entry-pod.
workerReplicas integerWorkerReplicas defines the number for the worker pod of a role.
Required: Need to set the number of worker-pod replicas.
workerTemplate PodTemplateSpecWorkerTemplate defines the template for the worker pod of a role.

RollingUpdateConfiguration

RollingUpdateConfiguration defines the parameters to be used for RollingUpdateStrategyType.

Appears in:

FieldDescriptionDefaultValidation
maxUnavailable IntOrStringThe maximum number of replicas that can be unavailable during the update.
Value can be an absolute number (ex: 5) or a percentage of total replicas at the start of update (ex: 10%).
Absolute number is calculated from percentage by rounding down.
This can not be 0 if MaxSurge is 0.
By default, a fixed value of 1 is used.
1XIntOrString: {}
maxSurge IntOrStringThe maximum number of replicas that can be scheduled above the original number of
replicas.
Value can be an absolute number (ex: 5) or a percentage of total replicas at
the start of the update (ex: 10%).
Absolute number is calculated from percentage by rounding up.
By default, a value of 0 is used.
0XIntOrString: {}
partition integerPartition indicates the ordinal at which the ModelServing should be partitioned
for updates. During a rolling update, all ServingGroups from ordinal Replicas-1 to
Partition are updated. All ServingGroups from ordinal Partition-1 to 0 remain untouched.
The default value is 0.

RolloutStrategy

RolloutStrategy defines the strategy that the ModelServing controller will use to perform replica updates.

Appears in:

FieldDescriptionDefaultValidation
type RolloutStrategyTypeType defines the rollout strategy, it can only be “ServingGroupRollingUpdate” for now.ServingGroupRollingUpdateEnum: [ServingGroupRollingUpdate]
rollingUpdateConfiguration RollingUpdateConfigurationRollingUpdateConfiguration defines the parameters to be used when type is RollingUpdateStrategyType.
optional

RolloutStrategyType

Underlying type: string

Appears in:

FieldDescription
ServingGroupRollingUpdateServingGroupRollingUpdate indicates that ServingGroup replicas will be updated one by one.

ScalingConfiguration

Appears in:

FieldDescriptionDefaultValidation
target TargetTarget represents the objects be monitored and scaled.
minReplicas integerMinReplicas is the minimum number of replicas.Maximum: 1e+06
Minimum: 0
maxReplicas integerMaxReplicas is the maximum number of replicas.Maximum: 1e+06
Minimum: 1

SelectPolicyType

Underlying type: string

SelectPolicyType defines the type of select olicy.

Validation:

  • Enum: [Or And]

Appears in:

FieldDescription
Or
And

ServingGroup

ServingGroup is the smallest unit to complete the inference task

Appears in:

FieldDescriptionDefaultValidation
restartGracePeriodSeconds integerRestartGracePeriodSeconds defines the grace time for the controller to rebuild the ServingGroup when an error occurs
Defaults to 0 (ServingGroup will be rebuilt immediately after an error)
0
gangPolicy GangPolicyGangPolicy defines the gang scheduler config.
networkTopology NetworkTopologySpecNetworkTopology defines the network topology affinity scheduling policy for the roles of the group, it works only when the scheduler supports network topology feature. // +optional
roles Role arrayMaxItems: 4
MinItems: 1

Target

Appears in:

FieldDescriptionDefaultValidation
additionalMatchLabels object (keys:string, values:string)AdditionalMatchLabels is the additional labels to match the target object.
metricEndpoint MetricEndpointMetricEndpoint is the metric source.

TopologySpreadConstraint

TopologySpreadConstraint defines the topology spread constraint.

Appears in:

FieldDescriptionDefaultValidation
maxSkew integerMaxSkew describes the degree to which ServingGroup may be unevenly distributed.
topologyKey stringTopologyKey is the key of node labels. Nodes that have a label with this key
and identical values are considered to be in the same topology.
whenUnsatisfiable stringWhenUnsatisfiable indicates how to deal with an ServingGroup if it doesn't satisfy
the spread constraint.