Version: Next

API Reference

Packages

networking.serving.volcano.sh/v1alpha1

networking.serving.volcano.sh/v1alpha1

Resource Types

ModelRoute
ModelRouteList
ModelServer
ModelServerList

BodyMatch

BodyMatch defines the predicate used to match request body content

Appears in:

ModelMatch

Field	Description	Default	Validation
`model` string	Model is the name of the model or lora adapter to match. If this field is not specified, any model or lora adapter will be matched.

GlobalRateLimit

GlobalRateLimit contains configuration for global rate limiting

Appears in:

RateLimit

Field	Description	Default	Validation
`redis` RedisConfig	Redis contains configuration for Redis-based global rate limiting.

InferenceEngine

Underlying type: string

InferenceEngine defines the inference framework used by the modelServer to serve LLM requests.

Validation:

Enum: [vLLM SGLang]

Appears in:

ModelServerSpec

Field	Description
`vLLM`	https://github.com/vllm-project/vllm
`SGLang`	https://github.com/sgl-project/sglang

KVConnectorSpec

KVConnectorSpec defines KV connector configuration for PD disaggregated routing

Appears in:

ModelServerSpec

Field	Description	Default	Validation
`type` KVConnectorType	Type specifies the connector type. If you do not know which type to use, please use "http" as default.	http	Enum: [http lmcache nixl mooncake]

KVConnectorType

Underlying type: string

Appears in:

KVConnectorSpec

Field	Description
`http`
`nixl`
`lmcache`
`mooncake`

ModelMatch

ModelMatch defines the predicate used to match LLM inference requests to a given TargetModels. Multiple match conditions are ANDed together, i.e. the match will evaluate to true only if all conditions are satisfied.

Appears in:

ModelBoosterSpec
Rule

Field	Description	Default	Validation
`headers` object (keys:string, values:StringMatch)	Header to match: prefix, exact, regex If unset, any header will be matched.
`uri` StringMatch	URI to match: prefix, exact, regex If this field is not specified, a default prefix match on the "/" path is provided.
`body` BodyMatch	Body contains conditions to match request body content

ModelRoute

ModelRoute is the Schema for the Modelroutes API.

Appears in:

ModelRouteList

Field	Description	Default	Validation
`apiVersion` string	`networking.serving.volcano.sh/v1alpha1`
`kind` string	`ModelRoute`
`spec` ModelRouteSpec
`status` ModelRouteStatus

ModelRouteList

ModelRouteList contains a list of ModelRoute.

Field	Description	Default	Validation
`apiVersion` string	`networking.serving.volcano.sh/v1alpha1`
`kind` string	`ModelRouteList`
`items` ModelRoute array

ModelRouteSpec

ModelRouteSpec defines the desired state of ModelRoute.

Appears in:

ModelRoute

Field	Description	Validation
`modelName` string	`model` in the LLM request, it could be a base model name, lora adapter name or even a virtual model name. This field is used to match scenarios other than model adapter name and this field could be empty, but it and `ModelAdapters` can't both be empty.
`loraAdapters` string array	`model` in the LLM request could be lora adapter name, here is a list of Lora Adapter Names to match.	MaxItems: 10
`rules` Rule array	An ordered list of route rules for LLM traffic. The first rule matching an incoming request will be used. If no rule is matched, an HTTP 404 status code MUST be returned.	MaxItems: 16
`rateLimit` RateLimit	Rate limit for the LLM request based on prompt tokens or output tokens. There is no limitation if this field is not set.

ModelRouteStatus

ModelRouteStatus defines the observed state of ModelRoute.

Appears in:

ModelRoute

ModelServer

ModelServer is the Schema for the modelservers API.

Appears in:

ModelServerList

Field	Description	Default	Validation
`apiVersion` string	`networking.serving.volcano.sh/v1alpha1`
`kind` string	`ModelServer`
`spec` ModelServerSpec
`status` ModelServerStatus

ModelServerList

ModelServerList contains a list of ModelServer.

Field	Description	Default	Validation
`apiVersion` string	`networking.serving.volcano.sh/v1alpha1`
`kind` string	`ModelServerList`
`items` ModelServer array

ModelServerSpec

ModelServerSpec defines the desired state of ModelServer.

Appears in:

ModelServer

Field	Description	Validation
`model` string	The real model that the modelServers are running. If the `model` in LLM inference request is different from this field, it should be overwritten by this field. Otherwise, the `model` in LLM inference request will not be mutated.	MaxLength: 256
`inferenceEngine` InferenceEngine	The inference engine used to serve the model.	Enum: [vLLM SGLang] Required: {}
`workloadSelector` WorkloadSelector	WorkloadSelector is used to match the model serving instances. Currently, they must be pods within the same namespace as modelServer object.	Required: {}
`workloadPort` WorkloadPort	WorkloadPort defines the port and protocol configuration for the model server.
`trafficPolicy` TrafficPolicy	Traffic Policy for accessing the model server instance.
`kvConnector` KVConnectorSpec	KVConnector specifies the KV connector configuration for PD disaggregated routing

ModelServerStatus

ModelServerStatus defines the observed state of ModelServer.

Appears in:

ModelServer

PDGroup

PDGroup is used to specify the group key of PD instances. Also, the labels to match the model serving instances for prefill and decode.

Appears in:

WorkloadSelector

Field	Description	Default	Validation
`groupKey` string	GroupKey is the key to distinguish different PD groups. Only PD instances with the same group key and value could be paired.
`prefillLabels` object (keys:string, values:string)	The labels to match the model serving instances for prefill.
`decodeLabels` object (keys:string, values:string)	The labels to match the model serving instances for decode.

RateLimit

Appears in:

ModelRouteSpec

Field	Description	Default	Validation
`inputTokensPerUnit` integer	InputTokensPerUnit is the maximum number of input tokens allowed per unit of time. If this field is not set, there is no limit on input tokens.		Minimum: 1
`outputTokensPerUnit` integer	OutputTokensPerUnit is the maximum number of output tokens allowed per unit of time. If this field is not set, there is no limit on output tokens.		Minimum: 1
`unit` RateLimitUnit	Unit is the time unit for the rate limit.	second	Enum: [second minute hour day month]
`global` GlobalRateLimit	Global contains configuration for global rate limiting using distributed storage. If this field is set, global rate limiting will be used; otherwise, local rate limiting will be used.

RateLimitUnit

Underlying type: string

Validation:

Enum: [second minute hour day month]

Appears in:

RateLimit

Field	Description
`second`
`minute`
`hour`
`day`
`month`

RedisConfig

RedisConfig contains Redis connection configuration

Appears in:

GlobalRateLimit

Field	Description	Default	Validation
`address` string	Address is the Redis server address in the format "host:port".		Required: {}

Retry

Appears in:

TrafficPolicy

Field	Description	Default	Validation
`attempts` integer	The maximum number of times an individual inference request to a model server should be retried. If the maximum number of retries has been done without a successgful response, the request will be considered failed.

Rule

Appears in:

ModelRouteSpec

Field	Description	Validation
`name` string	Name is the name of the rule.
`modelMatch` ModelMatch	Match conditions to be satisfied for the rule to be activated. Empty `modelMatch` means matching all requests.
`targetModels` TargetModel array		MaxItems: 16

StringMatch

StringMatch defines the matching conditions for string fields. Only one of the fields may be set.

Appears in:

ModelMatch

Field	Description	Default	Validation
`exact` string
`prefix` string
`regex` string

TargetModel

LLM inference traffic target model

Appears in:

Rule

Field	Description	Default	Validation
`modelServerName` string	ModelServerName is used to specify the correlated modelServer within the same namespace.
`weight` integer	Weight is used to specify the percentage of traffic should be sent to the target model. The value should be in the range of [0, 100].	100	Maximum: 100 Minimum: 0

TrafficPolicy

Appears in:

ModelServerSpec

Field	Description	Default	Validation
`retry` Retry	The retry policy for the inference request.

WorkloadPort

WorkloadPort defines the port and protocol configuration for the model server.

Appears in:

ModelServerSpec

Field	Description	Default	Validation
`port` integer	The port of the model server. The number must be between 1 and 65535.		Maximum: 65535 Minimum: 1 Required: {}
`protocol` string	The protocol of the model server. Supported values are "http" and "https".	http	Enum: [http https]

WorkloadSelector

WorkloadSelector is used to match the model serving instances. Currently, they must be pods within the same namespace as modelServer object.

Appears in:

ModelServerSpec

Field	Description	Default	Validation
`matchLabels` object (keys:string, values:string)	The base labels to match the model serving instances. All serving instances must match these labels.
`pdGroup` PDGroup	PDGroup is used to further match different roles of the model serving instances, mainly used in case like PD disaggregation.

Packages​

networking.serving.volcano.sh/v1alpha1​

Resource Types​

BodyMatch​

GlobalRateLimit​

InferenceEngine​

KVConnectorSpec​

KVConnectorType​

ModelMatch​

ModelRoute​

ModelRouteList​

ModelRouteSpec​

ModelRouteStatus​

ModelServer​

ModelServerList​

ModelServerSpec​

ModelServerStatus​

PDGroup​

RateLimit​

RateLimitUnit​

RedisConfig​

Retry​

Rule​

StringMatch​

TargetModel​

TrafficPolicy​

WorkloadPort​

WorkloadSelector​

Packages

networking.serving.volcano.sh/v1alpha1

Resource Types

BodyMatch

GlobalRateLimit

InferenceEngine

KVConnectorSpec

KVConnectorType

ModelMatch

ModelRoute

ModelRouteList

ModelRouteSpec

ModelRouteStatus

ModelServer

ModelServerList

ModelServerSpec

ModelServerStatus

PDGroup

RateLimit

RateLimitUnit

RedisConfig

Retry

Rule

StringMatch

TargetModel

TrafficPolicy

WorkloadPort

WorkloadSelector