📄️ Model Booster
The rules of generated resource name
📄️ Router Routing
This page describes the router routing features and capabilities in Kthena, based on real-world examples and configurations.
📄️ Multi-Node Inference
This page describes the multi-node inference capabilities in Kthena, base on real-world examples and configurations.
📄️ Inference Router Customization
Overview
📄️ Autoscaler Features
This page describes the autoscaling features and capabilities in Kthena.
📄️ Router Rate Limiting
Unlike traditional microservices that use request count or connection-based rate limiting, AI inference scenarios require token-based rate limiting. This is because AI requests can vary dramatically in computational cost - a single request with 10,000 tokens consumes far more GPU resources than 100 requests with 10 tokens each. Token-based limits ensure fair resource allocation based on actual computational consumption rather than simple request counts.
📄️ Runtime
Kthena Runtime is a lightweight sidecar service designed to standardize Prometheus metrics from inference engines, provides LoRA adapter download/load/unload capabilities, and supports model downloading.
🗃️ Prefill Decode Disaggregation
1 item