User Guide Overview

📄️ Router Routing

This page describes the router routing features and capabilities in Kthena, based on real-world examples and configurations.

📄️ Multi-Node Inference

This page describes the multi-node inference capabilities in Kthena, base on real-world examples and configurations.

📄️ Inference Router Customization

Overview

📄️ Autoscaler Features

This page describes the autoscaling features and capabilities in Kthena.

Unlike traditional microservices that use request count or connection-based rate limiting, AI inference scenarios require token-based rate limiting. This is because AI requests can vary dramatically in computational cost - a single request with 10,000 tokens consumes far more GPU resources than 100 requests with 10 tokens each. Token-based limits ensure fair resource allocation based on actual computational consumption rather than simple request counts.

📄️ Runtime

Kthena Runtime is a lightweight sidecar service designed to standardize Prometheus metrics from inference engines, provides LoRA adapter download/load/unload capabilities, and supports model downloading.

🗃️ Prefill Decode Disaggregation

1 item