Post

kubernetes is admitting that a pod is not always the unit of work

kubernetes is admitting that a pod is not always the unit of work

For a long time, the Kubernetes scheduler had a pleasantly simple job.

Look at a Pod. Look at the available nodes. Find somewhere the Pod can run.

That model is still useful. It is also starting to show its age.

Kubernetes v1.36 keeps pushing workload-aware scheduling forward with a revised Workload API, a new runtime PodGroup API, and an atomic PodGroup scheduling cycle. The short version is that Kubernetes is getting better at treating a group of related Pods as one scheduling decision.

This sounds like scheduler plumbing because it is scheduler plumbing.

But the reason it matters is much less abstract: sometimes running three quarters of a workload is not graceful degradation. It is just an expensive way to do nothing.

pods were a good unit of abstraction

The Pod has been one of Kubernetes’ best ideas.

It gives us a small unit that can be scheduled, restarted, observed, scaled, and reasoned about. For web applications and a lot of background work, Pod-by-Pod scheduling is exactly right. If ten replicas are desired and only eight fit immediately, those eight can start serving traffic while the cluster figures out the rest.

Distributed training jobs, HPC workloads, and some batch systems are different.

Imagine a training job with 64 workers. Each worker needs an accelerator. The workers also need to communicate with each other. If Kubernetes schedules 47 of them and leaves the others pending, the running Pods are not necessarily useful. They may sit there holding expensive resources while the job waits.

The cluster looks busy.

The cloud bill looks busy.

The actual work does not.

That is the gap workload-aware scheduling is trying to close.

gang scheduling is a very honest name

The basic idea is gang scheduling: related Pods should be scheduled together.

Kubernetes v1.35 introduced the first tranche of workload-aware scheduling improvements, including the foundational Workload API and basic gang scheduling support. In v1.36, the architecture gets clearer.

The Workload API acts as the static template: what is this multi-Pod application supposed to look like?

The new PodGroup API represents runtime state: which Pods belong together for scheduling?

The scheduler then gets a PodGroup scheduling cycle that evaluates the group atomically. Either all Pods in the group can be bound together, or none of them are.

That last part is the important one.

Previously, it was easy to end up with fragmented scheduling: some workers running, others pending, and scarce hardware stranded in the middle. Atomic scheduling makes the scheduler acknowledge the real application contract.

The application does not need a Pod.

It needs the workload.

cloud abstractions keep meeting physics

I wrote earlier this week about Kubernetes Dynamic Resource Allocation and the way AI hardware is becoming an API contract. DRA makes specialized devices easier for the scheduler and platform to reason about.

Workload-aware scheduling is the next layer of the same problem.

A distributed job does not only need accelerators. It may need enough accelerators at the same time, in a useful topology, with networking that does not turn synchronization into a tax.

This is where clean cloud abstractions run into physical constraints.

You cannot autoscale your way out of a workload that needs 64 devices together when the cluster has 47 available. You cannot call partial scheduling a success if every allocated device is waiting for the missing workers. You cannot hide topology forever when communication costs dominate the job.

The scheduler needs to understand more of the intent.

That does not mean every user should become a scheduler expert. It means the platform needs a better place to express the dangerous constraints.

the job controller integration is the practical bit

The upstream architecture is interesting, but the Job controller integration is what makes this feel real.

With the WorkloadWithJob feature gate enabled, Kubernetes v1.36 can create and manage the Workload and PodGroup objects for qualifying Jobs. It sets the scheduling group on the Pods and lets the scheduler treat them as a single gang.

There are deliberately narrow conditions for this first iteration:

  • parallelism must be greater than one
  • the completion mode must be Indexed
  • completions must equal parallelism
  • the Pod template must not already set a scheduling group

That is a reasonable starting point. Static, fully parallel Jobs are the easiest workload shape to reason about. The scheduler knows how many Pods need to arrive together, and each Pod has a stable identity.

Elastic jobs and more complex controller shapes can come later.

This is also all alpha in v1.36. The relevant feature gates are not something to casually flip on in production because a blog post made them sound neat. But alpha features are useful signals. They show which problems are important enough that Kubernetes is making room for them in the control plane.

partial capacity is not always better than zero

Most platform instincts are built around graceful degradation.

If you cannot have ten replicas, run eight. If the preferred zone is full, schedule elsewhere. If the ideal node type is unavailable, use a slightly more expensive one. Keep moving.

Those instincts are good for services. They can be wrong for coordinated workloads.

For a tightly coupled job, partial capacity can be worse than zero:

  • it consumes expensive devices without producing useful progress
  • it blocks other jobs that could use those devices immediately
  • it makes utilization dashboards look healthier than the business outcome
  • it creates operational confusion because Pods are technically running

This is the kind of infrastructure failure that hides inside a green dashboard.

The platform reports allocation.

The user experiences waiting.

The finance team experiences a bill.

Gang scheduling makes that mismatch harder to ignore.

this changes capacity planning too

Once the scheduler can reason about whole workloads, capacity planning becomes more honest.

It is not enough to say that the cluster has 300 accelerators free. The useful questions become:

  • Can it place the next 64-worker training job?
  • Which workloads are blocked by fragmented capacity?
  • How much hardware is allocated to incomplete groups?
  • Should a smaller job run first while the large job waits?
  • When is preemption worth the disruption?
  • Which topology constraints are driving pending time?

Those questions are harder than counting free devices.

They are also the questions platform teams actually need answered when expensive compute is involved.

This is why I think scheduling work deserves more attention than it usually gets. The scheduler is not merely packing containers onto machines. It is deciding whether scarce capacity turns into useful work or expensive waiting.

what i would do now

If you run batch, AI, or HPC workloads on Kubernetes, I would not start by enabling every alpha feature.

I would start by measuring how often partial scheduling hurts today.

Look for jobs with some Pods running and others pending. Measure how long expensive devices sit allocated before the full job starts making progress. Check whether fragmentation is preventing useful work even when aggregate free capacity looks reasonable. Ask whether your existing batch system or custom scheduler is carrying logic the native platform may eventually absorb.

Then experiment in a test cluster.

The useful question is not “can we adopt the new API immediately?”

It is “which hidden scheduling assumptions are already costing us money?”

That answer will tell you whether workload-aware scheduling is a future curiosity or a roadmap item.

the punchline

Kubernetes spent years teaching us to think in Pods.

That was a good abstraction. It still is.

But abstractions have boundaries, and distributed workloads are making this one visible. A Pod can be schedulable while the application is not. A cluster can look utilized while useful work is stalled. A partial allocation can be more expensive than waiting.

The scheduler is learning to represent that reality.

First it learned to place Pods.

Now it is learning that sometimes the Pod was never the real unit of work.

references

This post is licensed under CC BY 4.0 by the author.