Gateway Inference Extension Support
Overview
Gateway Inference Extension provides a standardized way to expose AI/ML inference services through Kubernetes Gateway API. This guide demonstrates how to integrate Kthena-deployed models with the upstream Gateway API Inference Extension, enabling intelligent routing and traffic management for inference workloads.
The Gateway API Inference Extension extends the standard Kubernetes Gateway API with inference-specific resources:
- InferencePool: Manages collections of model server endpoints with automatic discovery and health monitoring
- InferenceObjective: Defines priority and capacity policies for inference requests
- Gateway Integration: Seamless integration with popular gateway implementations including Kthena Router (native support), Envoy Gateway, Istio and Kgateway
- Model-Aware Routing: Advanced routing capabilities based on model names, adapters, and request characteristics
- OpenAI API Compatibility: Full support for OpenAI-compatible endpoints (
/v1/chat/completions,/v1/completions)
Prerequisites
- Kubernetes cluster with Kthena installed (see Installation)
- Gateway API installed (see Gateway API)
- Basic understanding of Gateway API and Gateway Inference Extension
Getting Started
Deploy Sample Model Server
First, deploy a model that will serve as the backend for the Gateway Inference Extension. Follow the Quick Start guide to deploy a model in the default namespace and ensure it's in Active state.
After deployment, identify the labels of your model pods as these will be used to associate the InferencePool with your model instances:
# Get the model pods and their labels
kubectl get pods -n <your-namespace> -l workload.serving.volcano.sh/managed-by=workload.serving.volcano.sh --show-labels
# Example output shows labels like:
# modelserving.volcano.sh/name=demo-backend1
# modelserving.volcano.sh/group-name=demo-backend1-0
# modelserving.volcano.sh/role=leader
# workload.serving.volcano.sh/model-name=demo
# workload.serving.volcano.sh/backend-name=backend1
# workload.serving.volcano.sh/managed-by=workload.serving.volcano.sh
Install the Inference Extension CRDs
Install the Gateway API Inference Extension CRDs in your cluster:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml
Deploy the InferencePool and Endpoint Picker Extension
Choose one of the following options based on your gateway implementation:
- Kthena Router
- Istio
Kthena Router natively supports Gateway Inference Extension and does not require the Endpoint Picker Extension. You can directly create an InferencePool resource that selects your Kthena model endpoints:
cat <<EOF | kubectl apply -f -
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
name: kthena-demo
spec:
targetPorts:
- number: 8000 # Adjust based on your model server port
selector:
matchLabels:
workload.serving.volcano.sh/model-name: demo
EOF
Install an InferencePool that selects from Kthena model endpoints with the appropriate labels. The Helm install command automatically installs the endpoint-picker, inferencepool along with provider specific resources.
For Istio deployment:
export GATEWAY_PROVIDER=istio
export IGW_CHART_VERSION=v1.0.1-rc.1
# Install InferencePool and Endpoint Picker pointing to your Kthena model pods
helm install kthena-demo \
--set inferencePool.modelServers.matchLabels."workload\.serving\.volcano\.sh/model-name"=demo \
--set provider.name=$GATEWAY_PROVIDER \
--version $IGW_CHART_VERSION \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
Deploy an Inference Gateway
- Kthena Router
- Istio
Kthena Router natively supports Gateway API and Gateway Inference Extension. You don't need to deploy additional gateway components, but you need to enable the Gateway API and Gateway Inference Extension flags in your Kthena Router deployment.
-
Enable Gateway API and Gateway Inference Extension in Kthena Router:
You need to add the required flags to your Kthena Router deployment. You can do this by patching the deployment:
kubectl patch deployment kthena-router -n kthena-system --type='json' -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/args/-",
"value": "--enable-gateway-api=true"
},
{
"op": "add",
"path": "/spec/template/spec/containers/0/args/-",
"value": "--enable-gateway-api-inference-extension=true"
}
]'Alternatively, you can edit the deployment directly:
kubectl edit deployment kthena-router -n kthena-systemThen add the following flags to the
argssection of the kthena-router container:args:
- --port=8080
- --enable-webhook=true
# ... other existing args ...
- --enable-gateway-api=true
- --enable-gateway-api-inference-extension=trueWait for the deployment to roll out:
kubectl rollout status deployment/kthena-router -n kthena-system -
Deploy the Gateway:
Create a Gateway resource that uses the
kthena-routerGatewayClass:cat <<EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: kthena-router
listeners:
- name: http
port: 8080
protocol: HTTP
EOF -
Deploy the HTTPRoute:
Create and apply the HTTPRoute configuration that connects the gateway to your InferencePool:
cat <<EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: kthena-demo-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
namespace: kthena-system
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: kthena-demo
matches:
- path:
type: PathPrefix
value: /
timeouts:
request: 300s
EOF
Deploy the Istio-based inference gateway and routing configuration:
-
Install Istio (if not already installed):
TAG=1.27.1
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.27.1 TARGET_ARCH=x86_64 sh -
cd istio-$TAG/bin
./istioctl install --set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true -
Deploy the Gateway:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/istio/gateway.yaml -
Deploy the HTTPRoute:
Create and apply the HTTPRoute configuration that connects the gateway to your InferencePool:
cat <<EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: kthena-demo-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: kthena-demo
matches:
- path:
type: PathPrefix
value: /
timeouts:
request: 300s
EOF
Verify Gateway Installation
Confirm that the Gateway was assigned an IP address and reports a Programmed=True status:
kubectl get gateway inference-gateway
# Expected output:
# NAME CLASS ADDRESS PROGRAMMED AGE
# inference-gateway istio <GATEWAY_IP> True 30s
Verify that all components are properly configured:
# Check Gateway status
kubectl get gateway inference-gateway -o yaml
# Check HTTPRoute status - should show Accepted=True and ResolvedRefs=True
kubectl get httproute kthena-demo-route -o yaml
# Check InferencePool status
kubectl get inferencepool kthena-demo -o yaml
Try it out
Wait until the gateway is ready and test inference through the gateway:
- Kthena Router
- Istio
# Get the kthena-router IP or hostname
ROUTER_IP=$(kubectl get service kthena-router -n kthena-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# If LoadBalancer is not available, use NodePort or port-forward
# kubectl port-forward -n kthena-system service/kthena-router 80:80
# Test the default port
curl http://${ROUTER_IP}:80/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-0.5B-Instruct",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
# Get the gateway IP address
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80
# Test completions endpoint
curl -i ${IP}:${PORT}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen2.5-0.5B-Instruct",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
Cleanup
To clean up all resources created in this guide:
-
Uninstall the InferencePool and model resources:
kubectl delete inferencepool kthena-demo --ignore-not-found
kubectl delete modelbooster demo -n <your-namespace> --ignore-not-found -
Remove Gateway API Inference Extension CRDs:
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml --ignore-not-found -
Clean up Gateway resources:
kubectl delete gateway inference-gateway --ignore-not-found
kubectl delete httproute kthena-demo-route --ignore-not-found -
Remove Istio (if you want to clean up everything):
istioctl uninstall -y --purge
kubectl delete ns istio-system