User Guide Overview

📄️ Multi-Node Inference

This page describes the multi-node inference capabilities in Kthena, based on real-world examples and configurations.

In distributed AI inference, communication latency between nodes directly affects inference efficiency. By being aware of the network topology, frequently communicating tasks can be scheduled onto nodes that are closer in network distance, significantly reducing communication overhead. Since bandwidth varies across different network links, efficient task scheduling can avoid network congestion and fully utilize high-bandwidth links, thereby improving overall data transmission efficiency.

📄️ Autoscaler

Overview

🗃️ Router

5 items

🗃️ Observability

1 item

📄️ Runtime

Kthena Runtime is a lightweight sidecar service designed to standardize Prometheus metrics from inference engines, provides LoRA adapter download/load/unload capabilities, and supports model downloading.

📄️ Binpack scale down

Binpack scaling down is designed to maximize available node capacity, helping the cluster efficiently prepare for upcoming resource-intensive tasks.

🗃️ Prefill Decode Disaggregation