Inside Kubernetes [Part 3]: Filtering, Scoring & The Scheduler’s Masterplan

A quick look at Scheduler

As the name suggests, the Kube Scheduler is responsible for placing Pods (running applications) onto the right Nodes in the Data Plane (Worker Nodes).

Its primary job is to detect new Pod creation requests and assign them to the best-suited Node based on resource availability and constraints.

The Scheduler selects a Node through a two-step process: Filtering (eliminating unsuitable Nodes) and Scoring (ranking the remaining Nodes to pick the best one).

How Does Scheduling Work?

Running a Pod: You use kubectl to send a request to create a Pod.

API Server Processing:

kubectl sends the request to the Kube API Server.
The API Server authenticates and authorizes the request.
Admission Controllers validate and modify the request if needed.

Storing the Desired State: The Pod definition is stored in etcd as the desired state.

Scheduling the Pod: The Scheduler selects the best Node based on resource availability, affinity rules, and other constraints.

Assigning the Pod: The API Server updates the Pod’s assigned Node. The kubelet on that Node then takes over, pulling images and starting the container.

Scheduler's Filtering of Nodes

The Kubernetes Scheduler filters out the unsuitable Nodes before ranking and selecting the best Node among them, Filtering ensures that the Pods are placed on the nodes where the resource requirements are met.

Steps involved in Filtering

The scheduler Retrieves the list of available Nodes.
PodFitsResources filter checks whether a candidate Node has enough available resources to meet a Pod's specific resource requests
After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.

Filtering Criteria

1. Resource Requests & Limits

A Pod's requirements are defined using Requests and Limits for CPU, Memory, and Ephemeral Storage.

Requests specify the bare minimum resources a Pod needs to run.
Limits define the maximum resources a Pod can consume.

For a Node to be considered, it must have enough free resources to meet the Pod's Requests—if not, it's filtered out.

spec:
  containers:
  - name: my-app
    image: my-app:v1
    resources:
      requests: # defining bare minimum
        cpu: "500m"     # 0.5 vCPU
        memory: "256Mi" # 256MB RAM
      limits: # defining maximum it can consume
        cpu: "1"        # 1 vCPU
        memory: "512Mi" # 512MB RAM

2. Taints & Toleration

Taints restrict Pods from running on certain Nodes unless the Pod has a matching Toleration. Only Pods with the right Toleration can be scheduled on a Tainted Node.

Taints are applied to nodes as a repulsive mechanism.

Tolerations are applied on Pods overriding the repulsive mechanism.

If a Node has a taint and a Pod lacks the corresponding toleration, the Node is filtered out.

For example, if we want to taint a specific Node with environment=prod and mark it as unschedulable (NoSchedule), we can define it as follows:

kubectl taint nodes node-1 environment=production:NoSchedule

And if we still want to run a Pod on node-1 we need to have the Toleration on the specific Pods

spec:
  tolerations:
  - key: "environment"
    operator: "Equal"
    value: "production"
    effect: "NoSchedule"

Now the Pod can run on node-1 as the key: value pair is present, as the Pod has the Toleration matching the needs of the Taint applied on the Node.

3. Node Affinity and Anti-Affinity:

Node Affinity allows Pods to specify which Nodes they prefer or are required to schedule on where the Node must match the required labels.

Required Rules: It provides a stringent criterion where if Node doesn't match the criteria it is filtered out.

Required Node Affinity: (requiredDuringSchedulingIgnoredDuringExecution) It uses the nodeSelectorTerms label expressions.

# Ensuring the Pod Runs on SSD Nodes
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd

Preferred Rules: It defines Node that matches and gets a higher ranking but is not mandatory.

Preferred Node Affinity: (preferredDuringSchedulingIgnoredDuringExecution) tries to schedule the Pod on preferred Nodes but falls back to others if necessary, uses a weighted-based preference

# Prefer SSD Nodes but Allow Others
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 10
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd

IgnoredDuringExecution vs RequiredDuringExecution
IgnoredDuringExecution (Default) → Only applies at scheduling time; if a Node later becomes unsuitable, the Pod remains running.
RequiredDuringExecution (Not implemented yet) → Would evict Pods if the Node no longer meets the affinity conditions.

Anti-affinity prevents multiple PODs from running on the same Node or within the same topology.

Required Pod AntiAffinity: (requiredDuringSchedulingIgnoredDuringExecution)
If the condition is not met, Pod will not be scheduled, uses topologyKey to define the scope

# Prevent Multiple Database Pods on the Same Node
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: database
        topologyKey: "kubernetes.io/hostname"

Preferred Pod Anti-Affinity: (preferredDuringSchedulingIgnoredDuringExecution)
It tries to distribute Pods across Nodes but allows for fallback, and it uses a weight-based preference (higher the weight; higher the priority)

# Prefer Spreading Web Apps Across Nodes, but Allow Overlap if Necessary
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 5
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: web
          topologyKey: "kubernetes.io/hostname"

Topology Keys in Affinity & Anti-Affinity

Topology keys in K8s define the scope of scheduling constraints based on Node labels

It helps to spread the Pod efficiently to improve the Fault Tolerance.

Topology Key	Scope	Use Case
kubernetes.io/hostname	Individual Nodes	Spread replicas across different Nodes
topology.kubernetes.io/zone	Availability Zones	Distribute workloads across multiple zones
topology.kubernetes.io/region	Geographic Regions	Run workloads in different regions

Scheduler's Scoring & Ranking of Nodes

Once the K8s Scheduler filters out the unsuitable nodes; so among all the suitable nodes, it must rank the remaining nodes to determine the best placement for the Pod and this all is determined by the Scoring Plugin.

The Scheduler Assigns the Score to the Node Ranging from [0-100] and evaluates all on the multiple criteria; so the node with the highest score is chosen for the Pod Placement.

Each Scoring plugin evaluates Nodes Independently assigning a score between 0 to 100.

The Score Evaluates on the criteria

Final Score = Plugin Score * Plugin Weight

Points to Remember about Scoring

Plugin Score and Plugin Weight depend on the type of plugin we use and the respective scoring against the type of plugin's criterion.

One Node can have multiple Scoring Plugins assigned to it so the Final Score will act as the Summation of all Plugin Score and Plugin Weight

Final Score = ∑ (Plugin Score * Plugin Weight)

Major Default Scoring Plugins

NodeResourcesFit:

They prefer the Nodes with the most available CPU and Memory;

Scores range the highest for more CPU & Memory.

InterPodAffinity:

Prefers the Nodes with Similar Pods;

Score ranges span high when the related pod is already running in the Node.

Image Locality:

Prefers the Nodes that already have the container image to reduce the pull time

The score gets high when the Node already has the required image in the Node.

Balanced Resource Allocation:

Ensures that CPU and Memory usage is evenly balanced across Nodes.

Prevents overloading some Nodes while others remain underutilized.

Does K8s allow K8s Custom Scoring and If allowed can it run parallel with the default K8s Scheduler?

YES, In K8s we are allowed to make Custom Scoring via Custom Scheduler.

Custom schedulers can run in parallel with the default Kubernetes scheduler.

In Custom Scheduler also we can either extend the Default Scheduler (via KubeSchedulerConfiguration) or create our own Custom Scheduler (using a custom scoring plugin) designed in Golang (as K8s is based in Golang).

Example for Extending the Default Scheduler for Prioritizing CPU Allocation in Scoring

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: LeastAllocated
            resources:
              - name: cpu
                weight: 2 # CPU availability get the 2 times importance compared to memory
              - name: memory
                weight: 1

Conclusion:

The Kubernetes Scheduler is responsible for placing Pods on the best-suited Nodes based on resource availability, constraints, and policies. It follows a two-step process:

Filtering – Eliminates unsuitable Nodes.
Scoring – Ranks Nodes and selects the best one.

Supports Node Affinity, Taints/Tolerations, and Topology Constraints.
Allows custom scoring and running custom schedulers in parallel.
Ensures efficient resource allocation and workload distribution.

EzyInfra.dev is a DevOps and Infrastructure consulting company helping clients in Setting up the Cloud Infrastructure (AWS, GCP), Cloud cost optimization, and manage Kubernetes-based infrastructure. If you have any requirements or want a free consultation for your Infrastructure or architecture, feel free to schedule a call here.

Share this post

K8s Got You Stuck? We’ve got you covered!

We design, deploy, and optimize K8s so you don’t have to. Let’s talk!