Kubernetes Native
Declarative CRDs for end-to-end AI inference lifecycle management. Gang scheduling, network-topology-aware placement, and Volcano integration bring enterprise-grade orchestration to your existing K8s infrastructure.
Intelligent Routing
Request-level scheduling with pluggable scoring plugins—least latency, KV-cache awareness, prefix-cache matching, and LoRA affinity routing. Per-model fair queuing and token-based rate limiting ensure optimal throughput for every model.
Hierarchical PD Disaggregation Orchestration
Separate prefill and decode phases into independently scalable serving groups. Prefill nodes maximize compute throughput while decode nodes optimize for low latency, enabling fine-grained GPU utilization and flexible scaling ratios.