Skip to main content
Version: Next

Autoscaler

As inference requests change in real-time, the required hardware resources also fluctuate dynamically. Kthena Autoscaler is an optional component of the Kthena system that runs in Kubernetes environments and dynamically adjusts the number of deployed serving instances based on real-time load. It maintains healthy business metrics (such as SLO indicators) while optimizing computational resource consumption.

Feature Description

Kthena Autoscaler periodically collects runtime metrics from the Pods of managed serving instances. Based on user-specified monitoring metrics and their target values, it calculates the required number of serving instances and performs scaling operations according to configured scaling policies.

AutoScaler
AutoScaler
ModelServingController
Model Serving Controller
Inference Instance
Inference Instance
Inference Instance
Inference Instance
Instance-1
Instance-1
Instance-0
Instance-0
Pod
Prefill-0
Pod...
Pod
Decode-1
Pod...
Pod
Decode-0
Pod...
Instance-2
Instance-2
1. Fetch metric
1. Fetch metric
2. Update replicas
2. Update replicas
3. Watch
3. Watch 
4. Reconcile
(Scale up)
4. Reconcile...
Model Serving CR
Model Serving CR
Text is not SVG - cannot display

Kthena Autoscaler provides two scaling granularities: Homogeneous Instances Autoscale and Heterogeneous Instances Autoscale.

Homogeneous Instances Autoscale

Homogeneous Instances Autoscale targets scaling for a single type of inference instance. This method is similar to the behavior of KPA, supporting both Stable and Panic modes. It scales deployments of the same type (engines deployed via the same Deployment or ModelServing CR) based on business metrics.

Heterogeneous Instances Autoscale

For the same model, serving instances can be deployed in multiple configurations, including:

  • Heterogeneous resource types (GPU/NPU)
  • Different inference engines (vLLM/Sglang)
  • Various runtime parameters (e.g., TP/DP configurations)

While these differently deployed serving instances provide identical inference services and expose consistent business functionality externally, they differ in hardware resource requirements and performance capabilities. The figure below illustrates a sample deployment scenario.

Heterogeneous
Heterogeneous
Group2(Strong Computing Power)
 Group2(Strong Computing Power)
Inference Instance
Inference Instance
Heterogeneous
Heterogeneous
Optimize
Optimize
Group1(Low Computing Power )
Group1(Low Computing Power )
Heterogeneous
Heterogeneous
Incoming
Request
Raise
Incoming...
Group1(Low Computing Power )
Group1(Low Computing Power )
Group2(Strong Computing Power)
 Group2(Strong Computing Power)
Group2(Strong Computing Power)
 Group2(Strong Computing Power)
Group1(Low Computing Power )
Group1(Low Computing Power )
Optimize
Optimize
Replicas
Replicas
Group1: 2
Group1: 2
Group2: 2
Group2: 2
Incoming
Request
Down
Incoming...
Replicas
Replicas
Group1: 2
Group1: 2
Group2: 3
Group2: 3
Replicas
Replicas
Group1: 1
Group1: 1
Group2: 0
Group2: 0
Inference Instance
Inference Instance
Text is not SVG - cannot display

Heterogeneous Instances Autoscale functionality consists of two main components: instance prediction and instance scheduling.

Instance Prediction follows the same logic as Homogeneous Instances Autoscale, dynamically calculating the total desired number of instances for the serving instance group in each scheduling cycle based on runtime metrics.

Instance Scheduling dynamically adjusts the proportion of functionally identical but differently deployed serving instances to maximize hardware resource utilization. From a problem modeling perspective, this scheduling phase can be viewed as an integer programming problem, considering the significant state transitions between scheduling cycles due to model cold start overhead.

Heterogeneous Instances Autoscale employs a greedy algorithm with a doubling strategy. It first calculates the available capacity as the sum of differences between maxReplicas and minReplicas for each instance type, representing the manageable instance capacity. During each scaling operation, it selects a portion of this capacity to scale up.

Based on the multiplication principle, the capacity is divided into multiple chunks according to powers of costExpansionRate. These chunks serve as scaling batches. At the batch level, batches from different serving instances are mixed and sorted in ascending order by cost. The sorted batches are then expanded into corresponding instance combinations, resulting in a scaling sequence seq with length equal to the capacity.

seq=sorted(i=1NPkcik(0,1,,Mi)(Cik=0MiPk)ci)seq = sorted(\bigcup_{i=1}^{N} P^k \cdot c_i \mid k \in (0,1,\ldots,M_i) \cup (C_i - \sum_{k=0}^{M_i} P^k) \cdot c_i)

Where:

  • NN: Number of serving instance types
  • PP: costExpansionRate for serving instances
  • cic_i: Cost of the i-th type of inference instance
  • MiM_i: Number of explicit power terms for the i-th type of inference instance
  • CiC_i: Capacity of the i-th type of inference instance (maxReplicas - minReplicas)

The costExpansionRate can be interpreted as the "cost expansion ratio for the next batch" within each inference instance type.

When the prediction requires K instances, the first K instances in the seq order are retained. This strategy helps reuse previously launched instances during scaling operations, thereby reducing cold start overhead.

Constraints

When using Kthena Autoscaler, note the following constraints:

  • An inference instance can only be managed by either Homogeneous Autoscale or Heterogeneous Autoscale, but not both simultaneously. In practice, this means you should not configure the same Deployment or Kthena CR with multiple autoscaling policies.

  • Before using Heterogeneous Autoscale, you must configure the cost for each inference instance type. It is recommended to set the cost parameter to represent the computational power of each instance. Operators can use scripts provided in the Kthena project to obtain these values.