Skip to main content
Version: 0.1.0

Autoscaler

Facing real-time changing inference requests, the required hardware resources also fluctuate dynamically. Kthena Autoscaler, as an optional component of the Kthena project running in a Kubernetes environment, dynamically adjusts the number of deployed serving instances based on their real-time load. It ensures healthy business metrics (such as SLO indicators) while reducing the consumption of computational resources.

Feature Description

Kthena Autoscaler periodically collects runtime metrics from the Pods of managed serving instances. Based on one or more monitoring metrics specified by the user and their corresponding target values, it estimates the required number of serving instances. Finally, it performs scaling operations according to the configured scaling policies.

AutoScaler
AutoScaler
ModelServingController
Model Serving Controller
Inference Instance
Inference Instance
Inference Instance
Inference Instance
Instance-1
Instance-1
Instance-0
Instance-0
Pod
Prefill-0
Pod...
Pod
Decode-1
Pod...
Pod
Decode-0
Pod...
Instance-2
Instance-2
1. Fetch metric
1. Fetch metric
2. Update replicas
2. Update replicas
3. Watch
3. Watch 
4. Reconcile
(Scale up)
4. Reconcile...
Model Serving CR
Model Serving CR
Text is not SVG - cannot display

Kthena Autoscaler provides two granularities of scaling methods: Homogeneous Instances Autoscale and Heterogeneous Instances Autoscale.

Homogeneous Instances Autoscale

Homogeneous Instances Autoscale targets scaling a single type of inference instance: This method is similar to the behavior of KPA, supporting both Stable and Panic modes. It scales a single type of deployment (engines deployed via the same Deployment or Model Infer CR) based on business metrics.

Heterogeneous Instances Autoscale

For the same model, serving instances can be deployed in multiple different ways: such as heterogeneous resource types (GPU/NPU), types of inference engines (vLLM / Sglang), or even different runtime parameters (e.g., TP/DP configuration parameters). While these differently deployed serving instances can all provide normal inference services and expose consistent business functionality externally, they differ in required hardware resources and provided business capabilities.The figure below shows a sample of the running effect.

Heterogeneous
Heterogeneous
Group2(Strong Computing Power)
 Group2(Strong Computing Power)
Inference Instance
Inference Instance
Heterogeneous
Heterogeneous
Optimize
Optimize
Group1(Low Computing Power )
Group1(Low Computing Power )
Heterogeneous
Heterogeneous
Incoming
Request
Raise
Incoming...
Group1(Low Computing Power )
Group1(Low Computing Power )
Group2(Strong Computing Power)
 Group2(Strong Computing Power)
Group2(Strong Computing Power)
 Group2(Strong Computing Power)
Group1(Low Computing Power )
Group1(Low Computing Power )
Optimize
Optimize
Replicas
Replicas
Group1: 2
Group1: 2
Group2: 2
Group2: 2
Incoming
Request
Down
Incoming...
Replicas
Replicas
Group1: 2
Group1: 2
Group2: 3
Group2: 3
Replicas
Replicas
Group1: 1
Group1: 1
Group2: 0
Group2: 0
Inference Instance
Inference Instance
Text is not SVG - cannot display

The functionality of Heterogeneous Instances Autoscale can be divided into two parts: predicting instance count and scheduling instances.

Predicting instances follows the same logic as Homogeneous Instances Autoscale: dynamically calculating the total desired number of instances for the group of serving instances in each scheduling cycle based on runtime metrics.

The scheduling phase primarily dynamically adjusts the proportion of these functionally identical but differently deployed serving instances to achieve the optimization goal of maximizing hardware resource utilization. From a problem modeling perspective, the scheduling phase can be viewed as an integer programming problem. Considering the significant state transitions between scheduling cycles (due to the overhead of model cold starts).

Heterogeneous Instances Autoscale adopts a greedy algorithm with a doubling strategy. It first treats the sum of the differences between the maximum instance count maxReplicas and the minimum instance count minReplicas for each type of instance as the available capacity, meaning there are capacity manageable instances. Each time, it selects to scale up a portion of them.

Based on the idea of multiplication, the capacity is divided into multiple chunks according to the power of costExpansionRate. These chunks serve as batches during scaling. Then, at the batch level, the batches of various serving instances are mixed and sorted by cost in ascending order. The sorted batches are then expanded into corresponding multiple instance combinations. Finally, a sequence seq with a length of capacity is obtained as the scaling order.

seq=sorted(i=1NPkcik(0,1,,Mi)(Cik=0MiPk)ci)seq = sorted(\bigcup_{i=1}^{N} P^k \cdot c_i \mid k \in (0,1,\ldots,M_i) \cup (C_i - \sum_{k=0}^{M_i} P^k) \cdot c_i)
  • NN: Number of types of serving instances

  • PP: costExpansionRate for serving instances

  • cic_i: Cost of the i-th type of inference instance

  • MiM_i: Number of explicit power terms of the i-th type of inference instance

  • CiC_i: Capacity of the i-th type of inference instance (maxReplicas - minReplicas)

Thus, costExpansionRate can be considered as the "cost expansion ratio for the next batch" within each type of inference instance.

When the prediction result requires K instances, the first K instances in the seq order will be retained. This strategy can ensure to a certain extent that previously launched instances can be reused during scaling, thereby reducing the cold start overhead.

Constraints

When using Kthena Autoscaler, note the following:

  • The same type of inference instance cannot have both Homogeneous Autoscale and Heterogeneous Autoscale enabled simultaneously. In other words, from an operational perspective, do not bind the same Deployment or Kthena CR to different Binding CRs at the same time.

  • Before using Heterogeneous Autoscale, you need to configure the cost for each type of inference instance. It is recommended to set the cost parameter to the computational power of each instance. Operators can run the script in the Kthena project to obtain this value.