Kubernetes: Setting Resource Constraints on Pods

Kubernetes: Setting Resource Constraints on Pods

I'm going to write about my experiments with tweaking resource constraints for an application that has been deployed in kubernetes. This post will help you find the right values for requests and limits for your pods in a large system. It assumes that you know what kubernetes is, and how pods and deployments work.

Why resource constraints?

When you have multiple services deployed in a k8s cluster (microservices or not), you will need to set the CPU and Memory resource values. You may have started off without specifying these, or given some random values like 500m and 500MiB, as these are optional fields. But eventually, before going to production, you need to find the right values. Let's see why, and how.

What are resource constraints?

Requests: These are the minimum amount of resources the pod is guaranteed to get, and is allocated while the pod is scheduled.

Limits: These are the maximum resource values permissible to be used by the pod. If CPU usage of pod exceeds this, it is throttled, whereas excess memory usage will get the pod terminated.

Let's take a simple example, say you have one node with 8 cores, and you have 10 pods running, each requesting 500m (millicores), then all the pods will get scheduled (request = 0.5 10 = 5 cores; you have total 8 cores available). However, if the limits are a much higher value, say 1000m, then total limits becomes 10 cores (1 10), but only 8 cores are available. In this case, it is clear that not all pods can use the max of 1000m at the same time. This is called overcommiting. If you describe a node, you will see this:

> kubectl describe nodes
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests           Limits
  --------                       --------           ------
  cpu                            7499m (95%)        27800m (355%)
  memory                         12101121920 (91%)  39547844864 (298%)
  ephemeral-storage              0 (0%)             0 (0%)
  hugepages-1Gi                  0 (0%)             0 (0%)
  hugepages-2Mi                  0 (0%)             0 (0%)
  attachable-volumes-azure-disk  0                  0

The problem with overcommitting is that when at load, your pod may crash or not respond as it does not have enough resources.


The concept of Quality of Service or QoS also comes from this. If you set requests = limits, then it is called Guaranteed QoS, because the above scenario will never happen. If the pod is scheduled, it is guaranteed to get that resource, and it doesn't need anything more.

If requests are less than limits (or no limits are set), then it is called a Burstable QoS, when there is a need, the pod can use more. By setting lower requests, you can fit more pods into the node, thus reducing number of nodes needed, and hence cost. This is generally helpful if you know that not all pods will hit their peaks at the same time. It might be ok as long as you are only over-committed to a small extent, say 110%-120%, but not the crazy values I have above like 355%.

There is a third type of QoS which is called BestEffort, that is when you have neither requests nor limits set. Don't use this except during experimentation.

If you describe a pod with kubectl describe po _podname_, you will see the QoS.

Note that, there are lots of system pods which also need to be factored in your calculation. The above simplistic calculation of 10 user pods is not sufficient.

You can read more about requests and limits in this article.

Now that we have seen what resource constraints are and why we need to specify them, let's take a look at how we can find the ideal values for our services.

Finding the 'right' resource constraints

The manual way

When I had less number of application services, I used to manually find the resource constraints of each service by performing a load test on each of them. The process goes something like this:

  1. Stress-test one pod to find the point of breaking Do not set any requests or limits on the pods. Do load testing on one particular pod/service (1 replica), and increase requests/sec gradually. Find the point at which the response grows exponentially, or the point at which the response time is more than your SLA. You will typically see that after a particular load, the response time jumps (or pod starts crashing)

Note: If you find that a single replica is not processing as many requests/sec as you would like it to, you need to tune that service first. Bottlenecks in individual services should be resolved before this test is performed, to get right numbers.

  1. Find the resource values Now that we know the breaking point - something like upto 250 requests/sec, the response time of each request is under 2 seconds. We can proceed to find the CPU and memory at which this condition can be met.

    1. Start with smaller values like 0.5 CPU, 250MiB etc. for limits (requests can be same) and find the point beyond which increased resources has no effect on overall throughput. This is done just to optimize the cpu and memory for different services (we could start with higher numbers and it would work too).
    2. Run a similar load test now, but at constant load – say, 250 requests / second (or a little less, say 200) continuously for some time, say a minute or so. Keep the cpu and memory to low values as per your service needs. You should see the response time increasing in the same way after a point – but now the reason would be not enough cpu.
    3. Repeat the above but with a higher CPU value, such that you get the original throughput, say 200 or 250 requests/sec.
  2. Add HPA Now you have found out the capacity / limit of a single pod. Adding Horizontal Pod AutoScaler (HPA) is easy. If you are expecting maximum 1000 requests/sec, set minimum pods to 1 or 2 and maximum to 5 in HPA, and you will be able to handle load as efficiently as possible. You can test again with HPA enabled and handle 1000 requests/sec easily.

  3. Verify Finally, do a load test which includes a ramp up, plateau, and a ramp down, to see if HPA works properly.

The above manual way is adapted from this blog post.

Make sure one service handles similar kind of load only – it does not really have to be a microservice, but it should not be a monolith either. You should be able to do simple math like, if one pod can handle 100 operations/second, 10 pods can handle 1000 operations/sec. Having operations with vastly different load behaviour in the same service makes it tricky.

Vertical Pod Autoscaler (VPA)

While the above manual way works for a small number of services, it is cumbersome as the system grows. Also, it might not be possible to load test a service in isolation. We need a more automated way to do these tests, and to do them regularly (as you add new features, the behaviour of your services will change, so these tests might need to be repeated every few months or so).

I recently came across the AutoScaler project in kubernetes. This is a set of kubernetes objects that can find the ideal cpu and memory values for each pod, based on history. It is called Vertical Pod AutoScaler (VPA) - it has a Recommender mode, which recommends pod resource values, and that is what we want here. We will not use VPA directly, instead the beautiful Goldilocks utility

High level logic:

  • Run load tests and tweak the number of replicas manually such that there are no failures. Now you have the minimum number of replicas needed to handle the load.
  • Look at the request values from VPA and tweak the number replicas to optimize the size of each pod.
  • Update the pod values for the new number of replicas.


  1. Install VPA and Goldilocks

    helm repo add fairwinds-stable https://charts.fairwinds.com/stable
    helm repo update
    helm install vpa fairwinds-stable/vpa --namespace vpa --create-namespace --set recommender.extraArgs.prometheus-address=http://kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 --set recommender.extraArgs.storage=prometheus
    helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks --create-namespace

(Pick the address of prometheus based on your installation, generally it will be http://<prometheus-service-name>.<prometheus-namespace>.svc.cluster.local:<port>)

Note that without Prometheus, the VPA will not work properly - it will check the pods only once on startup - More details here

  1. Mark the namespace you want VPA to monitor
kubectl label ns loadtest-env-01 goldilocks.fairwinds.com/enabled=true`
  1. Visualize the recommendations after the load test here:
kubectl -n goldilocks port-forward svc/goldilocks-dashboard 8080:80`

Running the load tests:

  • Set some request values (say 50m and 100MiB) and no limits for all pods. The request that VPA recommends is the most important value, you can tweak the limit as needed for your QoS. Also, by setting no limit, we are testing the limits of the pod.
  • Disable HPA - VPA and HPA don't work together.
  • Run your load test by manually keeping the number of replicas for all services to a number where there are no failures. For example, for one service, I had to keep 10 replicas for the load test to succeed, anything fewer would cause application failures as some pod(s) will not be able to handle the load.
  • Once you have a working load test, find the replicas and the VPA recommended request values. If the requests are small enough you may stop here. But if you get large requests values, then I suggest to increase the replicas a little bit more and see if the request values come down. Reason is that I prefer horizontal scaling over vertical, even in pods. See below
    • In my example above, say for 10 replicas I got CPU required as 0.5 cores. Assume this is for a load test that generates 1000 req/sec. I would rerun the test with 15 replicas, so that the avg load on a pod reduces from 100reqs/sec to 67 req/sec. If this reduces my cpu requirement from 0.5 cores, I'll go with 15 replicas and not 10. Or repeat the test to get pod sizes that you want.
  • Tweaking number of replicas for this test is based on knowledge of your application and some assumptions about the most used services. The important point here is to be able to finish your load test without any failures by making sure there are enough pods available.

Some points to note:

  • When you run multiple iterations of tests, I used to redeploy the app into different namespaces each time to make sure the metrics are not skewed by previous runs. I'm not sure if there is a better way.
  • I always keep minimum number of replicas to at least 2-3, even on a light service.
  • Once you have found good values for requests, and limits, turn off VPA, and re-run your load test with HPA enabled, and ensure that there are no failures and your SLAs are met.
  • Ensure your load test has a ramp-up, plateau, and a ramp-down. Make sure your HPA works.
  • Finally, install this recommender on your final environments and keep monitoring for any drastic changes.

Prefer horizontal scaling

I prefer horizontal scaling of pods over vertical scaling for multiple reasons

  • Vertical scaling restarts the pod
  • Better chance of getting scheduled - Smaller pods can be fit into available space in VMs more easily.
  • Less pods need to run when usage is low

If the application will work the same way with 'one pod with 500m CPU' or 'two pods with CPU 250m', it is better to choose the latter. The pod has better distribution possibility now, can be scaled down and resources released when not used. Obviously, you cannot take this idea to the extremes and give very less resources to the pod. My rule of thumb is:

  • I generally start with 50m/100MiB for requests (for node/.net5 services based on alpine). Even if VPA recommends lesser values, I don't go below this.
  • If VPA recommends small request values, say 75m, or 175MiB, I generally keep limits to be 2-3x of the recommended request.
  • If the service is IO heavy, load tests don't affect them much in terms of CPU/Memory.
  • If the service has any CPU work, I tend to lean towards more 'Guaranteed' QoS.

Relation between CPU and memory

One very important aspect is to find the relation between CPU and Memory in these load tests. As long as they are linear (mostly the case), it is easy to pin the constraint values by measuring one of them (typically CPU) and changing the other appropriately. If for some reason they are skewed, then you need to profile them separately, and this would become complex.

Special cases

In the application I'm working on, there are these special "engine" services, which are pure C# algorithms, complex stuff. Each request could run for minutes, have a high degree of multi-threadedness in them. Such special cases might need different strategies.

Hope this long post is useful for you in some way, do let me know your thoughts in the comments section, and all feedback is welcome!

Cover photo by Alec Favale on Unsplash