🗃️ Model Deployment
1 item
📄️ Multi-Node Inference
This page describes the multi-node inference capabilities in Kthena, based on real-world examples and configurations.
📄️ Network Topology
In distributed AI inference, communication latency between nodes directly affects inference efficiency. By being aware of the network topology, frequently communicating tasks can be scheduled onto nodes that are closer in network distance, significantly reducing communication overhead. Since bandwidth varies across different network links, efficient task scheduling can avoid network congestion and fully utilize high-bandwidth links, thereby improving overall data transmission efficiency.
📄️ Autoscaler
Overview
🗃️ Router
5 items
🗃️ Observability
1 item
📄️ Runtime
Kthena Runtime is a lightweight sidecar service designed to standardize Prometheus metrics from inference engines, provides LoRA adapter download/load/unload capabilities, and supports model downloading.
📄️ Binpack scale down
Binpack scaling down is designed to maximize available node capacity, helping the cluster efficiently prepare for upcoming resource-intensive tasks.
🗃️ Prefill Decode Disaggregation
1 item