📄️ Model Deployment
ModelBooster vs ModelServing Deployment Approaches
📄️ Multi-Node Inference
This page describes the multi-node inference capabilities in Kthena, based on real-world examples and configurations.
📄️ Binpack scale down
Binpack scaling down is designed to maximize available node capacity, helping the cluster efficiently prepare for upcoming resource-intensive tasks.
📄️ Network Topology
In distributed AI inference, communication latency between nodes directly affects inference efficiency. By being aware of the network topology, frequently communicating tasks can be scheduled onto nodes that are closer in network distance, significantly reducing communication overhead. Since bandwidth varies across different network links, efficient task scheduling can avoid network congestion and fully utilize high-bandwidth links, thereby improving overall data transmission efficiency.
📄️ Autoscaler
Overview
🗃️ Router
5 items
📄️ Runtime
Kthena Runtime is a lightweight sidecar service designed to standardize Prometheus metrics from inference engines, provides LoRA adapter download/load/unload capabilities, and supports model downloading.
🗃️ Prefill Decode Disaggregation
1 item