Version: Next

Network Topology

In distributed AI inference, communication latency between nodes directly affects inference efficiency. By being aware of the network topology, frequently communicating tasks can be scheduled onto nodes that are closer in network distance, significantly reducing communication overhead. Since bandwidth varies across different network links, efficient task scheduling can avoid network congestion and fully utilize high-bandwidth links, thereby improving overall data transmission efficiency.

Overview

Kthena leverages Volcano to achieve network topology-aware scheduling. To shield the differences in data center network types, Volcano defines a new CRD HyperNode to represent the network topology, providing a standardized API interface. A HyperNode represents a network topology performance domain, typically mapped to a switch or tor. Multiple HyperNodes are connected hierarchically to form a tree structure. For example, the following diagram shows a network topology composed of multiple HyperNodes:

In this structure, the communication efficiency between nodes depends on the HyperNode hierarchy span between them. For example:

node0 and node1 belong to s0, achieving the highest communication efficiency.
node1 and node2 need to cross two layers of HyperNodes (s0→s2→s1), resulting in lower communication efficiency.

Among the structures above, the Tier represents the hierarchy of the HyperNode. The lower the tier, the higher the communication efficiency between nodes within the HyperNode. For more details about HyperNode CRD, please refer to the Volcano Network Topology Aware documentation.

Volcano PodGroup can set the topology constraints of the job through the networkTopology field, supporting the following configurations:

mode: Supports hard and soft modes.
- hard: Hard constraint, tasks within the job must be deployed within the same HyperNode.
- soft: Soft constraint, tasks are deployed within the same HyperNode as much as possible.
highestTierAllowed: Used with hard mode, indicating the highest tier of HyperNode allowed for job deployment. This field is not required when mode is soft.

For example, the following configuration means the job can only be deployed within HyperNodes of tier 1 or lower, such as s0 and s1. Otherwise, the job will remain in the Pending state:

spec:
  networkTopology:
    mode: hard
    highestTierAllowed: 1

kthena's model serving, provides fields to configure Network Topology constraints for Serving Groups and Roles. For example, the following configuration means the ServingGroup can be deployed within HyperNodes of tier 2 or lower, such as s2. The Role can be deployed within HyperNodes of tier 1 or lower, such as s0 and s1.

spec:
  replicas: 1  # servingGroup replicas
  template:
    networkTopology:
      rolePolicy:
        mode: hard
        highestTierAllowed: 1
      groupPolicy:
        mode: hard
        highestTierAllowed: 2

SubGroup

Starting with kthena version 0.3.0, kthena utilizes subGroupPolicy within podGroup for multidimensional network topology aware scheduling.

RolePolicy is configured within the subGroupPolicy. Following Volcano 1.14, podGroup introduced subGroupPolicy.

subGroupPolicy: 
- subGroupSize: 3
  minSubGroups: 2
  name: task
  matchPolicy:
    - labelKey: volcano.sh/task-subgroup-id
  networkTopology:
    mode: hard 
    highestTierAllowed: 1

A subGroupPolicy has been added to the podGroup to ensure task-level gang scheduling and network topology.

subGroupSize: The number of pods in a subGroup.
minSubGroups: The minimum replicas of subGroups.
matchPolicy: The label key used to match pods.
networkTopology: The network topology of a subGroup.

After configuring NetworkTopology, the following podGroup will be created:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: network-topology-podgroup
spec:
  networkTopology:
    mode: hard 
    highestTierAllowed: 2
  minResources:
    cpu: 600m
  subGroupPolicy: 
  - labelSelector:
      matchLabels:
        modelserving.volcano.sh/name: sample
        modelserving.volcano.sh/role: prefill
    matchLabelKeys:
    - modelserving.volcano.sh/role-id
    minSubGroups: 2
    name: prefill
    subGroupSize: 1
    networkTopology:
      mode: hard
      highestTierAllowed: 1

When creating a pod, ModelServing will add some labels to it. For example, modelServing controller creates a prefill-0 pod, this pod will add the following labels:

modelserving.volcano.sh/group-name
modelserving.volcano.sh/name
modelserving.volcano.sh/role
modelserving.volcano.sh/role-id

So Volcano can use the labels "modelserving.volcano.sh/role" and "modelserving.volcano.sh/role-id" to group the pods that need to be deployed.

Prerequisites

A running Kubernetes cluster with Kthena installed.
Install Volcano. If you need to experiment with role-based network topology aware scheduling, the Volcano component requires the latest image.

Get Started

Create hyperNode resources

We need to create a HyperNode resource to represent the network topology of our local cluster. The cluster I used for the demonstration example is a three-node Kubernetes cluster created by KinD, with the three nodes named kthena-control-plane, kthena-worker, and kthena-worker2.

The HyperNode I created is as follows:

apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
  name: s0
spec:
  tier: 1 # s0 is at tier1
  members:
    - type: Node
      selector:
        labelMatch:
          matchLabels:
            kubernetes.io/hostname: kthena-worker
---
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
  name: s2
spec:
  tier: 2
  members:
    - type: HyperNode
      selector:
        exactMatch:
          name: "s0"

This creates a network topology structure of s2 → s0 → kthena-worker.

Create a modelServing instance:

kubectl apply -f examples/model-serving/network-topology.yaml

# View results
kubectl get podGroup

NAME       STATUS    MINMEMBER   RUNNINGS   AGE
sample-0   Running   4                      2s

kubectl get pod -owide

NAME                   READY   STATUS    RESTARTS   AGE     IP            NODE         
sample-0-decode-0-0    1/1     Running   0          2d23h   10.244.1.40   kthena-worker2
sample-0-decode-0-1    1/1     Running   0          2d23h   10.244.1.41   kthena-worker2
sample-0-prefill-0-0   1/1     Running   0          2d23h   10.244.1.42   kthena-worker2
sample-0-prefill-0-1   1/1     Running   0          2d23h   10.244.1.43   kthena-worker2

As can be seen, Kthena creates a PodGroup, and then Volcano deploys all pods to the compliant kthena-worker based on the network topology policy configured within the PodGroup.

If the deployed model serving instance lacks a NetworkTopology policy, the pod will be deployed randomly in either kthena-worker or kthena-worker2.

# After deleting the Network Topology configuration
kubectl apply -f examples/model-serving/network-topology.yaml

# view results
kubectl get pod -owide

NAME                   READY   STATUS    RESTARTS   AGE     IP            NODE         
sample-0-decode-0-0    1/1     Running   0          2d23h   10.244.1.44   kthena-worker2
sample-0-decode-0-1    1/1     Running   0          2d23h   10.244.1.45   kthena-worker2
sample-0-prefill-0-0   1/1     Running   0          2d23h   10.244.1.46   kthena-worker2
sample-0-prefill-0-1   1/1     Running   0          2d23h   10.244.1.47   kthena-worker

NOTE: When using Network Topology Aware Scheduling with your own configuration, ensure that resources are specified in role.entryTemplate and role.workerTemplate.

Clean up

kubectl delete modelserving sample

kubectl delete hypernode s0 s2

Overview​

SubGroup​

Prerequisites​

Get Started​

Clean up​

Overview

SubGroup

Prerequisites

Get Started

Clean up