Skip to main content
Version: Next

API Reference

Packages

networking.serving.volcano.sh/v1alpha1

Resource Types

BodyMatch

BodyMatch defines the predicate used to match request body content

Appears in:

FieldDescriptionDefaultValidation
model stringModel is the name of the model or lora adapter to match.
If this field is not specified, any model or lora adapter will be matched.

GlobalRateLimit

GlobalRateLimit contains configuration for global rate limiting

Appears in:

FieldDescriptionDefaultValidation
redis RedisConfigRedis contains configuration for Redis-based global rate limiting.

InferenceEngine

Underlying type: string

InferenceEngine defines the inference framework used by the modelServer to serve LLM requests.

Validation:

  • Enum: [vLLM SGLang]

Appears in:

FieldDescription
vLLMhttps://github.com/vllm-project/vllm
SGLanghttps://github.com/sgl-project/sglang

KVConnectorSpec

KVConnectorSpec defines KV connector configuration for PD disaggregated routing

Appears in:

FieldDescriptionDefaultValidation
type KVConnectorTypeType specifies the connector type.
If you do not know which type to use, please use "http" as default.
httpEnum: [http lmcache nixl mooncake]

KVConnectorType

Underlying type: string

Appears in:

FieldDescription
http
nixl
lmcache
mooncake

ModelMatch

ModelMatch defines the predicate used to match LLM inference requests to a given TargetModels. Multiple match conditions are ANDed together, i.e. the match will evaluate to true only if all conditions are satisfied.

Appears in:

FieldDescriptionDefaultValidation
headers object (keys:string, values:StringMatch)Header to match: prefix, exact, regex
If unset, any header will be matched.
uri StringMatchURI to match: prefix, exact, regex
If this field is not specified, a default prefix match on the "/" path is provided.
body BodyMatchBody contains conditions to match request body content

ModelRoute

ModelRoute is the Schema for the Modelroutes API.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringnetworking.serving.volcano.sh/v1alpha1
kind stringModelRoute
spec ModelRouteSpec
status ModelRouteStatus

ModelRouteList

ModelRouteList contains a list of ModelRoute.

FieldDescriptionDefaultValidation
apiVersion stringnetworking.serving.volcano.sh/v1alpha1
kind stringModelRouteList
items ModelRoute array

ModelRouteSpec

ModelRouteSpec defines the desired state of ModelRoute.

Appears in:

FieldDescriptionDefaultValidation
modelName stringmodel in the LLM request, it could be a base model name, lora adapter name or even
a virtual model name. This field is used to match scenarios other than model adapter name and
this field could be empty, but it and ModelAdapters can't both be empty.
loraAdapters string arraymodel in the LLM request could be lora adapter name,
here is a list of Lora Adapter Names to match.
MaxItems: 10
rules Rule arrayAn ordered list of route rules for LLM traffic. The first rule
matching an incoming request will be used.
If no rule is matched, an HTTP 404 status code MUST be returned.
MaxItems: 16
rateLimit RateLimitRate limit for the LLM request based on prompt tokens or output tokens.
There is no limitation if this field is not set.

ModelRouteStatus

ModelRouteStatus defines the observed state of ModelRoute.

Appears in:

ModelServer

ModelServer is the Schema for the modelservers API.

Appears in:

FieldDescriptionDefaultValidation
apiVersion stringnetworking.serving.volcano.sh/v1alpha1
kind stringModelServer
spec ModelServerSpec
status ModelServerStatus

ModelServerList

ModelServerList contains a list of ModelServer.

FieldDescriptionDefaultValidation
apiVersion stringnetworking.serving.volcano.sh/v1alpha1
kind stringModelServerList
items ModelServer array

ModelServerSpec

ModelServerSpec defines the desired state of ModelServer.

Appears in:

FieldDescriptionDefaultValidation
model stringThe real model that the modelServers are running.
If the model in LLM inference request is different from this field, it should be overwritten by this field.
Otherwise, the model in LLM inference request will not be mutated.
MaxLength: 256
inferenceEngine InferenceEngineThe inference engine used to serve the model.Enum: [vLLM SGLang]
Required: {}
workloadSelector WorkloadSelectorWorkloadSelector is used to match the model serving instances.
Currently, they must be pods within the same namespace as modelServer object.
Required: {}
workloadPort WorkloadPortWorkloadPort defines the port and protocol configuration for the model server.
trafficPolicy TrafficPolicyTraffic Policy for accessing the model server instance.
kvConnector KVConnectorSpecKVConnector specifies the KV connector configuration for PD disaggregated routing

ModelServerStatus

ModelServerStatus defines the observed state of ModelServer.

Appears in:

PDGroup

PDGroup is used to specify the group key of PD instances. Also, the labels to match the model serving instances for prefill and decode.

Appears in:

FieldDescriptionDefaultValidation
groupKey stringGroupKey is the key to distinguish different PD groups.
Only PD instances with the same group key and value could be paired.
prefillLabels object (keys:string, values:string)The labels to match the model serving instances for prefill.
decodeLabels object (keys:string, values:string)The labels to match the model serving instances for decode.

RateLimit

Appears in:

FieldDescriptionDefaultValidation
inputTokensPerUnit integerInputTokensPerUnit is the maximum number of input tokens allowed per unit of time.
If this field is not set, there is no limit on input tokens.
Minimum: 1
outputTokensPerUnit integerOutputTokensPerUnit is the maximum number of output tokens allowed per unit of time.
If this field is not set, there is no limit on output tokens.
Minimum: 1
unit RateLimitUnitUnit is the time unit for the rate limit.secondEnum: [second minute hour day month]
global GlobalRateLimitGlobal contains configuration for global rate limiting using distributed storage.
If this field is set, global rate limiting will be used; otherwise, local rate limiting will be used.

RateLimitUnit

Underlying type: string

Validation:

  • Enum: [second minute hour day month]

Appears in:

FieldDescription
second
minute
hour
day
month

RedisConfig

RedisConfig contains Redis connection configuration

Appears in:

FieldDescriptionDefaultValidation
address stringAddress is the Redis server address in the format "host:port".Required: {}

Retry

Appears in:

FieldDescriptionDefaultValidation
attempts integerThe maximum number of times an individual inference request to a model server should be retried.
If the maximum number of retries has been done without a successgful response, the request will be considered failed.

Rule

Appears in:

FieldDescriptionDefaultValidation
name stringName is the name of the rule.
modelMatch ModelMatchMatch conditions to be satisfied for the rule to be activated.
Empty modelMatch means matching all requests.
targetModels TargetModel arrayMaxItems: 16

StringMatch

StringMatch defines the matching conditions for string fields. Only one of the fields may be set.

Appears in:

FieldDescriptionDefaultValidation
exact string
prefix string
regex string

TargetModel

LLM inference traffic target model

Appears in:

FieldDescriptionDefaultValidation
modelServerName stringModelServerName is used to specify the correlated modelServer within the same namespace.
weight integerWeight is used to specify the percentage of traffic should be sent to the target model.
The value should be in the range of [0, 100].
100Maximum: 100
Minimum: 0

TrafficPolicy

Appears in:

FieldDescriptionDefaultValidation
retry RetryThe retry policy for the inference request.

WorkloadPort

WorkloadPort defines the port and protocol configuration for the model server.

Appears in:

FieldDescriptionDefaultValidation
port integerThe port of the model server. The number must be between 1 and 65535.Maximum: 65535
Minimum: 1
Required: {}
protocol stringThe protocol of the model server. Supported values are "http" and "https".httpEnum: [http https]

WorkloadSelector

WorkloadSelector is used to match the model serving instances. Currently, they must be pods within the same namespace as modelServer object.

Appears in:

FieldDescriptionDefaultValidation
matchLabels object (keys:string, values:string)The base labels to match the model serving instances.
All serving instances must match these labels.
pdGroup PDGroupPDGroup is used to further match different roles of the model serving instances,
mainly used in case like PD disaggregation.