During a recent customer engagement, a discussion about Kubernetes
NodeSelectors came up. There was some confusion about whether and how to use them for a multi-tenant cluster deployment. In the end, we decided to leverage the Kubernetes
PodNodeSelector admission controller. Since not all implementation details were clear to me, I did some experiments and wanted to share my findings with you.
NodeSelectors in Kubernetes is a common practice to influence scheduling decisions, which determine on which node (or group of nodes) a pod should be run.
NodeSelectors are based on key-value pairs as labels. Common use cases include:
- Dedicate nodes to certain teams or customers (multi-tenancy)
- Distinguish between different types of nodes (“expensive” nodes with specialized hardware, e.g. GPUs and FPGAs, or resources, ephemeral “spot” instances)
- Define topologies for rack/zone/region awareness and high-availability
You can read more about
NodeSelectors and other options in Assigning Pods to Nodes.
Given these use cases are often requirements in enterprises, this topic comes up in almost every architecture workshop or design review I conduct with customers deploying Kubernetes on VMware vSphere. VMware vSphere VMs can be of any size and may have different quality of service (QoS) policies, for example resource guarantees or limits (cpu, memory, disk, network) applied.
Just like with Kubernetes resource management, it is a service level agreement (SLA) and contract the infrastructure provider agrees to with the consumer. Thus
NodeSelectors play a critical role to ensure that, for example, production workloads (Kubernetes deployments, or pods to be precise) run on VMs with QoS == “production”. Otherwise we would be at risk breaking the contract. As a best practice,
NodeSelectors are typically set during CI/CD pipeline stages without human intervention.
You might be wondering why I am not discussing related concepts, like
node (anti-) affinity or
taints and tolerations. Please see the section “Roadmap and alternative Approaches” at the end of this post.
Quoting from the Kubernetes documentation:
An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized. The controllers […] are compiled into the kube-apiserver binary, and may only be configured by the cluster administrator. […] If any of the controllers in either phase reject the request, the entire request is rejected immediately and an error is returned to the end-user.
There are many admission controllers in the Kubernetes core, for example
PodPreset. A complete list is provided here.
The Use Case for PodNodeSelector
PodNodeSelector admission controller has been in the source for a while, based on an upstream contribution from Red Hat (Openshift’s project node selectors). Quoting again from the Kubernetes documentation:
This admission controller defaults and limits what node selectors may be used within a namespace by reading a namespace annotation and a global configuration.
Ok, so why would I use that one instead of simply specifying
NodeSelectors in the pod manifest? Well, using this admission controller has some advantages:
- As the name implies, it enforces node selectors at admission time before creating the pod object
- This puts less pressure on the master components, for example the scheduler; if there is no matching namespace or a conflict, the pod object will not be created
- You can define a default selector for namespaces that have no label selector specified
- Ultimately, it gives the cluster administrator more control over the node selection process (hard enforcement)
The last one is very important if you have large distributed teams or different departments within the organization. I have seen customer environments, where there was no communication between infrastructure operations, the Kubernetes platform engineers and the development teams. The lack of alignment led to performance issues and, sometimes even worse, production outages. Simply speaking, every team had a different understanding and definition of “production”.
Translated into an enterprise deployment, the following (highly abstracted) picture tries to make it clear. Here, a
kubectl user is only allowed to deploy nginx pods into the namespace “development”, enforced by Kubernetes role-based access controls (RBAC). However, if she doesn’t specify a
NodeSelector, this pod could end up being scheduled on the production VMs. The
PodNodeSelector automatically injects a
NodeSelector, based on a default policy.
On the other side, the CI/CD pipeline has a workflow defined to deploy mission-critical nginx proxies into the namespace “production”. Those should always land on a production VM, because these have a higher QoS policy applied by the VMware vSphere cluster administrator (remember the “contract” we spoke about earlier?).
I am working on a follow-up article describing distributed resource management and QoS in the enterprise. A fascinating topic if you ask me :)
But using admission controllers, in their current implementation, can also have downsides, most notably:
- They need to be compiled into the Kubernetes API server
- A change in configuration requires a restart of the API server
- If you use a managed Kubernetes environment, like Google Kubernetes Engine (GKE), configuration options might be limited
This is why Dynamic Admission Control has been added to Kubernetes recently.
Back to our admission controller. Now that we understand the pros and cons, let’s deploy it and see it in action.
Setting Things up
My demo environment is based on a three node
kubeadm deployment, running in VMware Fusion:
- Kubernetes version: 1.9.2
- 1x “master” VM: Just system pods allowed, i.e. core Kubernetes services
- 1x “production” VM: Only run pods from namespace “production”
- 1x “development” VM: Run any pod from any namespaces, except for “production”
PodNodeSelectoradmission controller enabled in the API server
Terms used in the Examples
To better understand the examples, let me first explain some terminology used in the
clusterDefaultNodeSelector(global) - Unless a
whitelisthas been specified, add this
NodeSelectorto all pods (applies to all namespaces without a
Whitelist(per namespace) - Only allow this
NodeSelector(requires a matching namespace
annotation- Key/value Metadata attached to namespaces; the differences between
labelsare described here
Preparing the API Server
Assuming your cluster has been deployed with
kube-apiserver.yaml file is the configuration manifest for the Kubernetes API server. It is located in
/etc/kubernetes/manifests. Add the
PodNodeSelector admission controller in the
Strictly speaking, the plugin order matters. We will ignore this in our demo example. You can read more about it in Using Admission Controllers.
volumes sections as depicted below. They relate to the default selectors and whitelist, which we will create in a second.
Create the file
This file defines the
default NodeSelector for the cluster, as well as whitelist for each namespace. Note the following:
- Every pod created in the “production” namespace will be injected (i.e. get assigned) the
- Every pod in the “development” namespace will inherit the
clusterDefaultNodeSelector, in this case “env=development”
- Effectively, you do not have to mention namespaces with an empty whitelist in this file
- The “noannotation” namespace is to demonstrate what happens when there is no matching
annotationin the namespace properties but a whitelist has been specified
For whitelists to work, every namespace has to have a matching
annotation (this is not a label!). “noannotation” shows what happens when there is no (matching) annotation.
Prepare the Nodes (kubelets)
Apply labels to the nodes, matching the whitelist
Now, the cluster looks like this:
Prepare the Namespaces
This is how namespace “production” looks like now:
If you also wonder about the
alpha in the
annotations spec, there is an issue discussing it.
Ready to go!
Deploy some Pods
What Kubernetes created for us:
First, let’s examine each pod in “production” and “development” (the pod in “default” is equal to the “development” case):
But where’s the nginx pod from the “noannotation” deployment? Let’s ask Kubernetes:
As you can see, no pod is being created due to the admission controller. We intentionally triggered this since we did not specify a correct
namespace annotation matching the
whitelist (as we did with the “production” namespace).
Deploy a Pod with a conflicting NodeSelector
Now let’s try to create a pod in the “production” namespace, but specify a conflicting
NodeSelector in the pod manifest. This could happen for various reasons: A typo, a greedy user, not cleaning up pod manifests after changing labels, etc.
This is nice. The admission controller checks for conflicts. And since the “nginx2” pod is supposed to be deployed in the “production” namespace, we only allow label selectors listed in the whitespace section of
To keep things simple and understandable, I usually follow these best practices when using this admission controller:
- Come up with a good node labelling scheme (sounds easy, but can be challenging at times)
--node-labelsargument can be very useful (warning: this is an alpha feature)
- Use a
clusterDefaultNodeSelectordefaulting to nodes != “production”, which has the following effect:
- Namespaces with no
annotationsand no whitelists specified will inherit this setting
NodeSelectorsare allowed in the pod manifest
- Namespaces with no
- For mission-critical namespaces specify
- Pods deployed therein will have
NodeSelectorsstrictly checked and enforced, that is no pod manifest conflicts/drift allowed
- Pods deployed therein will have
- Use role-based access controls to secure your namespaces
Roadmap and alternative Approaches
You might be wondering why I am not touching on
node (anti-) affinity,
external admission webhooks or
taints and tolerations. The latter would be appropriate as well, but one would have to write a custom admission controller.
node (anti-) affinity (currently in beta) is supposed to become the successor of
NodeSelectors, with richer query logic. When they graduate from
NodeSelectors eventually will be deprecated.
external admission webhooks, as part of
dynamic admission control, it’s still early days and for some simple use cases they might be overkill. But this is just my opinion at the time of writing this article. Kubernetes is moving fast and the amazing community quickly resolving gaps and issues.
So why would you use
PodNodeSelector admission controller today? Here’s my point of view:
- So far,
NodeSelectorsand as such the corresponding admission controller have not been marked deprecated
- The implementation and configuration is straight forward, robust and easy to understand
- At the time of writing, there is no admission controller for
node (anti-) affinity, so you lose the advantages of an admission controller described in this article
- I am not the only one interested in fixing this: https://github.com/kubernetes/kubernetes/issues/58198
- Of course, you could write your own admission controller/webhook
Thanks for taking the time to read this post. I hope you found it interesting and it at least answered some questions. As always, feel free to share it on social media and reach out to me on Twitter for any feedback, comments or corrections.