API Reference
Packages
workload.serving.volcano.sh/v1alpha1
Resource Types
- AutoscalingPolicy
- AutoscalingPolicyBinding
- AutoscalingPolicyBindingList
- AutoscalingPolicyList
- ModelBooster
- ModelBoosterList
- ModelServing
- ModelServingList
AutoscalingPolicy
AutoscalingPolicy is the Schema for the autoscalingpolicies API.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | AutoscalingPolicy | ||
spec AutoscalingPolicySpec | |||
status AutoscalingPolicyStatus |
AutoscalingPolicyBehavior
AutoscalingPolicyBehavior defines the scaling behaviors for up and down actions.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
scaleUp AutoscalingPolicyScaleUpPolicy | ScaleUp defines the policy for scaling up (increasing replicas). | ||
scaleDown AutoscalingPolicyStablePolicy | ScaleDown defines the policy for scaling down (decreasing replicas). |
AutoscalingPolicyBinding
AutoscalingPolicyBinding is the Schema for the autoscalingpolicybindings API.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | AutoscalingPolicyBinding | ||
spec AutoscalingPolicyBindingSpec | |||
status AutoscalingPolicyBindingStatus |
AutoscalingPolicyBindingList
AutoscalingPolicyBindingList contains a list of AutoscalingPolicyBinding.
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | AutoscalingPolicyBindingList | ||
items AutoscalingPolicyBinding array |
AutoscalingPolicyBindingSpec
AutoscalingPolicyBindingSpec defines the desired state of AutoscalingPolicyBinding.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
optimizerConfiguration OptimizerConfiguration | It dynamically adjusts replicas across different ModelServing objects based on overall computing power requirements - referred to as "optimize" behavior in the code. For example: When dealing with two types of ModelServing objects corresponding to heterogeneous hardware resources with different computing capabilities (e.g., H100/A100), the "optimize" behavior aims to: Dynamically adjust the deployment ratio of H100/A100 instances based on real-time computing power demands Use integer programming and similar methods to precisely meet computing requirements Maximize hardware utilization efficiency | ||
scalingConfiguration ScalingConfiguration | Adjust the number of related instances based on specified monitoring metrics and their target values. |
AutoscalingPolicyBindingStatus
AutoscalingPolicyBindingStatus defines the status of a autoscaling policy binding.
Appears in:
AutoscalingPolicyList
AutoscalingPolicyList contains a list of AutoscalingPolicy.
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | AutoscalingPolicyList | ||
items AutoscalingPolicy array |
AutoscalingPolicyMetric
AutoscalingPolicyMetric defines a metric and its target value for scaling.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
metricName string | MetricName is the name of the metric to monitor. | ||
targetValue Quantity | TargetValue is the target value for the metric to trigger scaling. |
AutoscalingPolicyPanicPolicy
AutoscalingPolicyPanicPolicy defines the policy for panic scaling up.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
percent integer | Percent is the maximum percentage of instances to scale up. | 1000 | Maximum: 1000 Minimum: 0 |
panicThresholdPercent integer | PanicThresholdPercent is the threshold percent to enter panic mode. | 200 | Maximum: 1000 Minimum: 110 |
AutoscalingPolicyScaleUpPolicy
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
stablePolicy AutoscalingPolicyStablePolicy | Stable policy usually makes decisions based on the average value of metrics calculated over the past few minutes and introduces a scaling-down cool-down period/delay. This mechanism is relatively stable, as it can smooth out short-term small fluctuations and avoid overly frequent and unnecessary Pod scaling. | ||
panicPolicy AutoscalingPolicyPanicPolicy | When the load surges sharply within a short period (for example, encountering a sudden traffic peak or a rush of sudden computing tasks), using the average value over a long time window to calculate the required number of replicas will cause significant lag. If the system needs to scale out quickly to cope with such peaks, the ordinary scaling logic may fail to respond in time, resulting in delayed Pod startup, slower service response time or timeouts, and may even lead to service paralysis or data backlogs (for workloads such as message queues). |
AutoscalingPolicySpec
AutoscalingPolicySpec defines the desired state of AutoscalingPolicy.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
tolerancePercent integer | TolerancePercent is the percentage of deviation tolerated before scaling actions are triggered. The current number of instances is current_replicas, and the expected number of instances inferred from monitoring metrics is target_replicas. The scaling operation will only be actually performed when |current_replicas - target_replicas| >= current_replicas * TolerancePercent. | 10 | Maximum: 100 Minimum: 0 |
metrics AutoscalingPolicyMetric array | Metrics is the list of metrics used to evaluate scaling decisions. | MinItems: 1 | |
behavior AutoscalingPolicyBehavior | Behavior defines the scaling behavior for both scale up and scale down. |
AutoscalingPolicyStablePolicy
AutoscalingPolicyStablePolicy defines the policy for stable scaling up or scaling down.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
instances integer | Instances is the maximum number of instances to scale. | 1 | Minimum: 0 |
percent integer | Percent is the maximum percentage of instances to scaling. | 100 | Maximum: 1000 Minimum: 0 |
selectPolicy SelectPolicyType | SelectPolicy determines the selection strategy for scaling up (e.g., Or, And). 'Or' represents the scaling operation will be performed as long as either the Percent requirement or the Instances requirement is met. 'And' represents the scaling operation will be performed as long as both the Percent requirement and the Instances requirement is met. | Or | Enum: [Or And] |
AutoscalingPolicyStatus
AutoscalingPolicyStatus defines the observed state of AutoscalingPolicy.
Appears in:
GangPolicy
GangPolicy defines the gang scheduling configuration.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
minRoleReplicas object (keys:string, values:integer) | MinRoleReplicas defines the minimum number of replicas required for each role in gang scheduling. This map allows users to specify different minimum replica requirements for different roles. Key: role name Value: minimum number of replicas required for that role |
LoraAdapter
LoraAdapter defines a LoRA (Low-Rank Adaptation) adapter configuration.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
name string | Name is the name of the LoRA adapter. | Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ | |
artifactURL string | ArtifactURL is the URL where the LoRA adapter artifact is stored. | Pattern: ^(hf://|s3://|pvc://).+ |
Metadata
Metadata is a simplified version of ObjectMeta in Kubernetes.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
labels object (keys:string, values:string) | Map of string keys and values that can be used to organize and categorize (scope and select) objects. May match selectors of replication controllers and services. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels | ||
annotations object (keys:string, values:string) | Annotations is an unstructured key value map stored with a resource that may be set by external tools to store and retrieve arbitrary metadata. They are not queryable and should be preserved when modifying objects. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations |
MetricEndpoint
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
uri string | The metric uri, e.g. /metrics | /metrics | |
port integer | The port of pods exposing metric endpoints | 8100 |
ModelBackend
ModelBackend defines the configuration for a model backend.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
name string | Name is the name of the backend. Can't duplicate with other ModelBackend name in the same ModelBooster CR. Note: update name will cause the old modelInfer deletion and a new modelInfer creation. | Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ | |
type ModelBackendType | Type is the type of the backend. | Enum: [vLLM vLLMDisaggregated SGLang MindIE MindIEDisaggregated] | |
modelURI string | ModelURI is the URI where you download the model. Support hf://, s3://, pvc://. | Pattern: ^(hf://|s3://|pvc://).+ | |
cacheURI string | CacheURI is the URI where the downloaded model stored. Support hostpath://, pvc://. | Pattern: ^(hostpath://|pvc://).+ | |
minReplicas integer | MinReplicas is the minimum number of replicas for the backend. | Maximum: 1e+06 Minimum: 0 | |
maxReplicas integer | MaxReplicas is the maximum number of replicas for the backend. | Maximum: 1e+06 Minimum: 1 | |
scalingCost integer | ScalingCost is the cost associated with running this backend. | Minimum: 0 | |
routeWeight integer | RouteWeight is used to specify the percentage of traffic should be sent to the target backend. It's used to create model route. | 100 | Maximum: 100 Minimum: 0 |
workers ModelWorker array | Workers is the list of workers associated with this backend. | MaxItems: 1000 MinItems: 1 | |
loraAdapters LoraAdapter array | LoraAdapter is a list of LoRA adapters. | ||
autoscalingPolicy AutoscalingPolicySpec | AutoscalingPolicyRef references the autoscaling policy for this backend. | ||
schedulerName string | SchedulerName defines the name of the scheduler used by ModelServing for this backend. |
ModelBackendStatus
ModelBackendStatus defines the status of a model backend.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
name string | Name is the name of the backend. | ||
replicas integer | Replicas is the number of replicas currently running for the backend. |
ModelBackendType
Underlying type: string
ModelBackendType defines the type of model backend.
Validation:
- Enum: [vLLM vLLMDisaggregated SGLang MindIE MindIEDisaggregated]
Appears in:
| Field | Description |
|---|---|
vLLM | ModelBackendTypeVLLM represents a vLLM backend. |
vLLMDisaggregated | ModelBackendTypeVLLMDisaggregated represents a disaggregated vLLM backend. |
SGLang | ModelBackendTypeSGLang represents an SGLang backend. |
MindIE | ModelBackendTypeMindIE represents a MindIE backend. |
MindIEDisaggregated | ModelBackendTypeMindIEDisaggregated represents a disaggregated MindIE backend. |
ModelBooster
ModelBooster is the Schema for the models API.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | ModelBooster | ||
spec ModelBoosterSpec | |||
status ModelStatus |
ModelBoosterList
ModelBoosterList contains a list of ModelBooster.
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | ModelBoosterList | ||
items ModelBooster array |
ModelBoosterSpec
ModelBoosterSpec defines the desired state of ModelBooster.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
name string | Name is the name of the model. ModelBooster CR name is restricted by kubernetes, for example, can't contain uppercase letters. So we use this field to specify the ModelBooster name. | MaxLength: 64 Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ | |
owner string | Owner is the owner of the model. | ||
backends ModelBackend array | Backends is the list of model backends associated with this model. A ModelBooster CR at lease has one ModelBackend. ModelBackend is the minimum unit of inference instance. It can be vLLM, SGLang, MindIE or other types. | MinItems: 1 | |
autoscalingPolicy AutoscalingPolicySpec | AutoscalingPolicy references the autoscaling policy to be used for this model. | ||
costExpansionRatePercent integer | CostExpansionRatePercent is the percentage rate at which the cost expands. | Maximum: 1000 Minimum: 0 | |
modelMatch ModelMatch | ModelMatch defines the predicate used to match LLM inference requests to a given TargetModels. Multiple match conditions are ANDed together, i.e. the match will evaluate to true only if all conditions are satisfied. |
ModelServing
ModelServing is the Schema for the LLM Serving API
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | ModelServing | ||
spec ModelServingSpec | |||
status ModelServingStatus |
ModelServingList
ModelServingList contains a list of ModelServing
| Field | Description | Default | Validation |
|---|---|---|---|
apiVersion string | workload.serving.volcano.sh/v1alpha1 | ||
kind string | ModelServingList | ||
items ModelServing array |
ModelServingSpec
ModelServingSpec defines the specification of the ModelServing resource.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
replicas integer | Number of ServingGroups. That is the number of instances that run serving tasks Default to 1. | 1 | |
schedulerName string | SchedulerName defines the name of the scheduler used by ModelServing | ||
template ServingGroup | Template defines the template for ServingGroup | ||
rolloutStrategy RolloutStrategy | RolloutStrategy defines the strategy that will be applied to update replicas | ||
recoveryPolicy RecoveryPolicy | RecoveryPolicy defines the recovery policy for the failed Pod to be rebuilt | RoleRecreate | Enum: [ServingGroupRecreate RoleRecreate None] |
topologySpreadConstraints TopologySpreadConstraint array |
ModelServingStatus
ModelServingStatus defines the observed state of ModelServing
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
observedGeneration integer | observedGeneration is the most recent generation observed for ModelServing. It corresponds to the ModelServing's generation, which is updated on mutation by the API Server. | ||
replicas integer | Replicas track the total number of ServingGroup that have been created (updated or not, ready or not) | ||
currentReplicas integer | CurrentReplicas is the number of ServingGroup created by the ModelServing controller from the ModelServing version | ||
updatedReplicas integer | UpdatedReplicas track the number of ServingGroup that have been updated (ready or not). | ||
availableReplicas integer | AvailableReplicas track the number of ServingGroup that are in ready state (updated or not). |
ModelStatus
ModelStatus defines the observed state of ModelBooster.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
backendStatuses ModelBackendStatus array | BackendStatuses contains the status of each backend. | ||
observedGeneration integer | ObservedGeneration track of generation |
ModelWorker
ModelWorker defines the model worker configuration.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
type ModelWorkerType | Type is the type of the model worker. | server | Enum: [server prefill decode controller coordinator] |
image string | Image is the container image for the worker. | ||
replicas integer | Replicas is the number of replicas for the worker. | Maximum: 1e+06 Minimum: 0 | |
pods integer | Pods is the number of pods for the worker. | Maximum: 1e+06 Minimum: 0 | |
config JSON | Config contains worker-specific configuration in JSON format. You can find vLLM config here https://docs.vllm.ai/en/stable/configuration/engine_args.html |
ModelWorkerType
Underlying type: string
ModelWorkerType defines the type of model worker.
Validation:
- Enum: [server prefill decode controller coordinator]
Appears in:
| Field | Description |
|---|---|
server | ModelWorkerTypeServer represents a server worker. |
prefill | ModelWorkerTypePrefill represents a prefill worker. |
decode | ModelWorkerTypeDecode represents a decode worker. |
controller | ModelWorkerTypeController represents a controller worker. |
coordinator | ModelWorkerTypeCoordinator represents a coordinator worker. |
OptimizerConfiguration
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
params OptimizerParam array | Parameters of multiple Model Serving Groups to be optimized. | MinItems: 1 | |
costExpansionRatePercent integer | CostExpansionRatePercent is the percentage rate at which the cost expands. | 200 | Minimum: 0 |
OptimizerParam
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
target Target | The scaling instance configuration | ||
cost integer | Cost is the cost associated with running this backend. | Minimum: 0 | |
minReplicas integer | MinReplicas is the minimum number of replicas for the backend. | Maximum: 1e+06 Minimum: 0 | |
maxReplicas integer | MaxReplicas is the maximum number of replicas for the backend. | Maximum: 1e+06 Minimum: 1 |
PodTemplateSpec
PodTemplateSpec describes the data a pod should have when created from a template
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
metadata Metadata | Refer to Kubernetes API documentation for fields of metadata. |
RecoveryPolicy
Underlying type: string
Appears in:
| Field | Description |
|---|---|
ServingGroupRecreate | ServingGroupRecreate will recreate all the pods in the ServingGroup if 1. Any individual pod in the group is recreated; 2. Any containers/init-containers in a pod is restarted. This is to ensure all pods/containers in the group will be started in the same time. |
RoleRecreate | RoleRecreate will recreate all pods in one Role if 1. Any individual pod in the group is recreated; 2. Any containers/init-containers in a pod is restarted. |
None | NoneRestartPolicy will follow the same behavior as the default pod or deployment. |
Role
Role defines the specific pod instance role that performs the inference task.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
name string | The name of a role. Name must be unique within an ServingGroup | MaxLength: 12 Pattern: ^[a-zA-Z0-9]([-a-zA-Z0-9]*[a-zA-Z0-9])?$ | |
replicas integer | The number of a certain role. For example, in Disaggregated Prefilling, setting the replica count for both the P and D roles to 1 results in 1P1D deployment configuration. This approach can similarly be applied to configure a xPyD deployment scenario. Default to 1. | 1 | |
entryTemplate PodTemplateSpec | EntryTemplate defines the template for the entry pod of a role. Required: Currently, a role must have only one entry-pod. | ||
workerReplicas integer | WorkerReplicas defines the number for the worker pod of a role. Required: Need to set the number of worker-pod replicas. | ||
workerTemplate PodTemplateSpec | WorkerTemplate defines the template for the worker pod of a role. |
RollingUpdateConfiguration
RollingUpdateConfiguration defines the parameters to be used for RollingUpdateStrategyType.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
maxUnavailable IntOrString | The maximum number of replicas that can be unavailable during the update. Value can be an absolute number (ex: 5) or a percentage of total replicas at the start of update (ex: 10%). Absolute number is calculated from percentage by rounding down. This can not be 0 if MaxSurge is 0. By default, a fixed value of 1 is used. | 1 | XIntOrString: {} |
maxSurge IntOrString | The maximum number of replicas that can be scheduled above the original number of replicas. Value can be an absolute number (ex: 5) or a percentage of total replicas at the start of the update (ex: 10%). Absolute number is calculated from percentage by rounding up. By default, a value of 0 is used. | 0 | XIntOrString: {} |
partition integer | Partition indicates the ordinal at which the ModelServing should be partitioned for updates. During a rolling update, all ServingGroups from ordinal Replicas-1 to Partition are updated. All ServingGroups from ordinal Partition-1 to 0 remain untouched. The default value is 0. |
RolloutStrategy
RolloutStrategy defines the strategy that the ModelServing controller will use to perform replica updates.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
type RolloutStrategyType | Type defines the rollout strategy, it can only be “ServingGroupRollingUpdate” for now. | ServingGroupRollingUpdate | Enum: [ServingGroupRollingUpdate] |
rollingUpdateConfiguration RollingUpdateConfiguration | RollingUpdateConfiguration defines the parameters to be used when type is RollingUpdateStrategyType. optional |
RolloutStrategyType
Underlying type: string
Appears in:
| Field | Description |
|---|---|
ServingGroupRollingUpdate | ServingGroupRollingUpdate indicates that ServingGroup replicas will be updated one by one. |
ScalingConfiguration
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
target Target | Target represents the objects be monitored and scaled. | ||
minReplicas integer | MinReplicas is the minimum number of replicas. | Maximum: 1e+06 Minimum: 0 | |
maxReplicas integer | MaxReplicas is the maximum number of replicas. | Maximum: 1e+06 Minimum: 1 |
SelectPolicyType
Underlying type: string
SelectPolicyType defines the type of select olicy.
Validation:
- Enum: [Or And]
Appears in:
| Field | Description |
|---|---|
Or | |
And |
ServingGroup
ServingGroup is the smallest unit to complete the inference task
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
restartGracePeriodSeconds integer | RestartGracePeriodSeconds defines the grace time for the controller to rebuild the ServingGroup when an error occurs Defaults to 0 (ServingGroup will be rebuilt immediately after an error) | 0 | |
gangPolicy GangPolicy | GangPolicy defines the gang scheduler config. | ||
networkTopology NetworkTopologySpec | NetworkTopology defines the network topology affinity scheduling policy for the roles of the group, it works only when the scheduler supports network topology feature. // +optional | ||
roles Role array | MaxItems: 4 MinItems: 1 |
Target
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
additionalMatchLabels object (keys:string, values:string) | AdditionalMatchLabels is the additional labels to match the target object. | ||
metricEndpoint MetricEndpoint | MetricEndpoint is the metric source. |
TopologySpreadConstraint
TopologySpreadConstraint defines the topology spread constraint.
Appears in:
| Field | Description | Default | Validation |
|---|---|---|---|
maxSkew integer | MaxSkew describes the degree to which ServingGroup may be unevenly distributed. | ||
topologyKey string | TopologyKey is the key of node labels. Nodes that have a label with this key and identical values are considered to be in the same topology. | ||
whenUnsatisfiable string | WhenUnsatisfiable indicates how to deal with an ServingGroup if it doesn't satisfy the spread constraint. |