Enhance Kubernetes Cluster Performance and Optimise Cost with Karpenter – Part 2 

TL;DR

This blog explains how migrating from Cluster Autoscaler to Karpenter on Amazon EKS improves scaling speed, resource utilisation, and cost efficiency. It also introduces EKS Auto Mode, which builds on Karpenter to reduce operational overhead through automated scaling, patching, and upgrades—making EKS clusters more efficient and easier to manage.

Table of Contents

Introduction 

In the previous blog “Enhance Kubernetes cluster performance and optimise costs with Karpenter”, we discussed: What is Karpenter?, its components, advantages, how it works, and how to deploy it in a AWS EKS cluster. 

In this blog, let’s dive deeper into the following items: 

  1. How to migrate from Cluster AutoScaler to Karpenter 
  1. What is EKS Auto Mode (Advanced Karpenter Mode)? 
  1. How to migrate from Karpenter to EKS Auto Mode and reap full benefits of cloud-managed services 

To explain these points, let’s use a hypothetical example. Imagine two groups, Dogs and Cats are at war with each other. Each group has a leader who must securely and efficiently distribute war plans to their generals stationed across the globe. They decide to use a containerized application deployed in EKS for communication. 

They share the same Amazon EKS cluster named “PetsCluster” for cost efficiency. Within this cluster, each of them has their own Kubernetes namespaces created to host their applications & for data isolation: 

  • fordogs – to deploy apps for dogs 
  • forcats – to deploy apps for cats 

Even though both groups are focused on defeating each other, they still value AWS’s principle of frugality. Instead of duplicating infrastructure, they share a common namespace/infrastructure named platform to host logging, monitoring, and other shared services. This allows logical separation of workloads (forcats and fordogs) without compromising isolation between the two groups. 

Full codebase: https://github.com/cevoaustralia/aws-eks-karpenter-and-automode  

Cluster Auto-Scaler and its limitations 

Currently, “PetsCluster” is using Cluster AutoScaler, which relies on AWS Auto Scaling Groups (ASGs). While it handles basic scaling, it has key limitations: 

  • One size must fit all – A Node Group can only use one instance type and size (e.g., t3.large). Workloads requiring other instance types cannot benefit without creating or modifying a separate Node Group. 
  • Overprovisioning – Cluster Autoscaler scales at the node group level, so GPU workloads force scaling of GPU node groups, causing non-GPU pods to land on GPU instances. Additionally, even small pods trigger provisioning of the full, often large, node group instance type, resulting in overprovisioning and higher infrastructure costs 
  • Underutilisation – If a tiny app runs on a t3.8xlarge, the server stays active until the app stops, leading to wasted capacity. There’s no optimization or consolidation. 
  • Operational overhead – New instance types require manual configuration changes, preventing users from quickly leveraging AWS innovations. 
  • Delayed provisioning – Cluster AutoScaler provisions via ASGs, which can be slow, leading to poor user experience under volatile load. 

EKS Cluster Config: (PetsCluster) 

				
					module "eks" { 

  source  = "terraform-aws-modules/eks/aws" 

  version = "~> 20.0" 

 

  cluster_name                             = var.cluster_name #PetsCluster 

  cluster_version                          = "1.32" 

  cluster_endpoint_public_access           = true 

  enable_cluster_creator_admin_permissions = true 

 

  vpc_id     = module.vpc.vpc_id 

  subnet_ids = module.vpc.private_subnets 

 

  enable_irsa = true 

 

} 
				
			

EKS Managed Node Group Configs: (Platform, Forcats and ForDogs) 

				
					eks_managed_node_groups = { 

  platform = { 

    min_size       = 2 

    max_size       = 2 

    desired_size   = 2 

    instance_types = ["t3.medium"] 

 

    labels = { 

      "nodegroup/type" = "platform" 

    } 

 

    taints = { 

      dedicated = { 

        key    = "dedicated" 

        value  = "platform" 

        effect = "NO_SCHEDULE" 

      } 

    } 

  } 

 

  forcats = { 

    min_size       = 1 

    max_size       = 3 

    desired_size   = 1 

    instance_types = ["t3.small"] 

 

    labels = { 

      "nodegroup/type" = "forcats" 

    } 

 

    taints = { 

      dedicated = { 

        key    = "dedicated" 

        value  = "forcats" 

        effect = "NO_SCHEDULE" 

      } 

    } 

  } 

 

  fordogs = { 

    min_size       = 1 

    max_size       = 3 

    desired_size   = 1 

    instance_types = ["t3.small"] 

 

    labels = { 

      "nodegroup/type" = "fordogs" 

    } 

 

    taints = { 

      dedicated = { 

        key    = "dedicated" 

        value  = "fordogs" 

        effect = "NO_SCHEDULE" 

      } 

    } 

  } 

} 
				
			

EKS Cluster Auto Scaling Config: 

				
					resource "helm_release" "cluster_autoscaler" { 

  name       = "cluster-autoscaler" 

  repository = "https://kubernetes.github.io/autoscaler" 

  chart      = "cluster-autoscaler" 

  namespace  = "kube-system" 

  version    = "9.29.0" 

  timeout    = 600 

 

  set { 

    name  = "autoDiscovery.clusterName" 

    value = var.cluster_name 

  } 

 

  set { 

    name  = "awsRegion" 

    value = var.region 

  } 

 

  set { 

    name  = "rbac.serviceAccount.create" 

    value = "true" 

  } 

 

  set { 

    name  = "rbac.serviceAccount.name" 

    value = "cluster-autoscaler" 

  } 

 

  set { 

    name  = "rbac.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn" 

    value = module.cluster_autoscaler_irsa.iam_role_arn 

  } 

 

  set { 

    name  = "image.tag" 

    value = "v1.30.0" 

  } 

 

  set { 

    name  = "nodeSelector.nodegroup\\/type" 

    value = "platform" 

  } 

 

  set { 

    name  = "tolerations[0].key" 

    value = "dedicated" 

  } 

 

  set { 

    name  = "tolerations[0].value" 

    value = "platform" 

  } 

 

  set { 

    name  = "tolerations[0].operator" 

    value = "Equal" 

  } 

 

  set { 

    name  = "tolerations[0].effect" 

    value = "NoSchedule" 

  } 

 

  depends_on = [module.eks] 

} 
				
			

Scenario 1: Generals from Cats & Dogs demand frugality & agility 

EKS Karpenter Arch Diag 1 - Karpenter on Amazon EKS

Modern apps come in different shapes and sizes. Auto scaling should provision the right instance types dynamically, rather than enforcing “one size fits all.” 

Also, if 9–5, Mon–Fri is considered business hours, that’s only 23.8% of the week. The remaining 76.2% are non-business hours when non-production apps don’t need full capacity. By configuring appropriate disruption budgets, scaling down policies or shutting non-production workloads during off-business hours using AWS Instance Scheduler, customers can save up to ~75% in infrastructure costs . 

Karpenter enables this by consolidating nodes—it moves workloads from underutilized / empty servers to other nodes in the cluster to maintain server utilization at ~80%. It then terminates idle servers saving cost. 

 

Note: A dedicated platform Node Group (1 node) must host the Karpenter controller. Deploying Karpenter on Karpenter-managed nodes creates a chicken-and-egg problem.  

 

Reason being, Karpenter controller responsible for dynamically provisioning and deprovisioning worker nodes based on pending pods and scheduling requirements. Because of this role, it cannot depend on itself to exist. 

If the Karpenter controller pod were scheduled onto Karpenter-managed nodes, you would create a circular dependency (chicken-and-egg problem): 

Let’s deploy Karpenter to “PetsCluster.” 

 

Karpenter Setup Config:  

				
					resource "helm_release" "karpenter" { 

  name             = "karpenter" 

  namespace        = "karpenter" 

  create_namespace = true 

 

  repository = "oci://public.ecr.aws/karpenter" 

  chart      = "karpenter" 

  version    = "1.6.1" 

  timeout    = 600 

 

  # Core settings 

  set { 

    name  = "settings.clusterName" 

    value = var.cluster_name 

  } 

 

  set { 

    name  = "settings.clusterEndpoint" 

    value = module.eks.cluster_endpoint 

  } 

 

  set { 

    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn" 

    value = module.karpenter_irsa.iam_role_arn 

  } 

 

  # Controller configuration 

  set { 

    name  = "controller.resources.requests.cpu" 

    value = "1" 

  } 

 

  set { 

    name  = "controller.resources.requests.memory" 

    value = "1Gi" 

  } 

 

  set { 

    name  = "controller.resources.limits.cpu" 

    value = "1" 

  } 

 

  set { 

    name  = "controller.resources.limits.memory" 

    value = "1Gi" 

  } 

 

  # Node selector for platform nodes 

  set { 

    name  = "nodeSelector.nodegroup\\/type" 

    value = "platform" 

  } 

 

  # Tolerations for platform nodes 

  set { 

    name  = "tolerations[0].key" 

    value = "dedicated" 

  } 

 

  set { 

    name  = "tolerations[0].value" 

    value = "platform" 

  } 

 

  set { 

    name  = "tolerations[0].operator" 

    value = "Equal" 

  } 

 

  set { 

    name  = "tolerations[0].effect" 

    value = "NoSchedule" 

  } 

 

  depends_on = [module.eks, module.karpenter_irsa] 

} 
				
			

Karpenter NodePool Config:  

				
					--- 

apiVersion: karpenter.sh/v1 

kind: NodePool 

metadata: 

  name: platform 

spec: 

  template: 

    metadata: 

      labels: 

        "nodegroup/type": "platform" 

    spec: 

      nodeClassRef: 

        group: karpenter.k8s.aws 

        kind: EC2NodeClass 

        name: platform 

      requirements: 

        - key: "karpenter.k8s.aws/instance-category" 

          operator: In 

          values: ["t", "m"] 

        - key: "karpenter.k8s.aws/instance-cpu" 

          operator: In 

          values: ["2", "4", "8"] 

      taints: 

        - key: "dedicated" 

          value: "platform" 

          effect: "NoSchedule" 

  limits: 

    cpu: 100 

  disruption: 

    consolidationPolicy: WhenEmptyOrUnderutilized 

    consolidateAfter: 30s 

 

--- 

apiVersion: karpenter.sh/v1 

kind: NodePool 

metadata: 

  name: forcats 

spec: 

  template: 

    metadata: 

      labels: 

        "nodegroup/type": "forcats" 

    spec: 

      nodeClassRef: 

        group: karpenter.k8s.aws 

        kind: EC2NodeClass 

        name: forcats 

      requirements: 

        - key: "karpenter.k8s.aws/instance-category" 

          operator: In 

          values: ["c", "m", "r"] 

        - key: "karpenter.k8s.aws/instance-cpu" 

          operator: In 

          values: ["4", "8", "16"] 

      taints: 

        - key: "dedicated" 

          value: "forcats" 

          effect: "NoSchedule" 

  limits: 

    cpu: 500 

  disruption: 

    consolidationPolicy: WhenEmptyOrUnderutilized 

    consolidateAfter: 30s 

 

--- 

apiVersion: karpenter.sh/v1 

kind: NodePool 

metadata: 

  name: fordogs 

spec: 

  template: 

    metadata: 

      labels: 

        "nodegroup/type": "fordogs" 

    spec: 

      nodeClassRef: 

        group: karpenter.k8s.aws 

        kind: EC2NodeClass 

        name: fordogs 

      requirements: 

        - key: "karpenter.k8s.aws/instance-category" 

          operator: In 

          values: ["c", "m", "r"] 

        - key: "karpenter.k8s.aws/instance-cpu" 

          operator: In 

          values: ["4", "8", "16"] 

      taints: 

        - key: "dedicated" 

          value: "fordogs" 

          effect: "NoSchedule" 

  limits: 

    cpu: 500 

  disruption: 

    consolidationPolicy: WhenEmptyOrUnderutilized 

    consolidateAfter: 30s 
				
			

Karpenter NodeClass Config: 

				
					--- 

apiVersion: karpenter.k8s.aws/v1 

kind: EC2NodeClass 

metadata: 

  name: platform 

spec: 

  amiFamily: AL2 

  amiSelectorTerms: 

    - alias: al2@latest 

  role: ${instance_profile_name} 

  subnetSelectorTerms: 

    - tags: 

        "kubernetes.io/role/internal-elb": "1" 

  securityGroupSelectorTerms: 

    - name: "${cluster_name}-node-*" 

  tags: 

    "karpenter.sh/discovery": "${cluster_name}" 

    "nodegroup/type": "platform" 

    "Name": "${cluster_name}-platform-karpenter" 

  userData: | 

    #!/bin/bash 

    /etc/eks/bootstrap.sh ${cluster_name} 

 

--- 

apiVersion: karpenter.k8s.aws/v1 

kind: EC2NodeClass 

metadata: 

  name: forcats 

spec: 

  amiFamily: AL2 

  amiSelectorTerms: 

    - alias: al2@latest 

  role: ${instance_profile_name} 

  subnetSelectorTerms: 

    - tags: 

        "kubernetes.io/role/internal-elb": "1" 

  securityGroupSelectorTerms: 

    - name: "${cluster_name}-node-*" 

  tags: 

    "karpenter.sh/discovery": "${cluster_name}" 

    "nodegroup/type": "forcats" 

    "Name": "${cluster_name}-forcats-karpenter" 

  userData: | 

    #!/bin/bash 

    /etc/eks/bootstrap.sh ${cluster_name} 

 

--- 

apiVersion: karpenter.k8s.aws/v1 

kind: EC2NodeClass 

metadata: 

  name: fordogs 

spec: 

  amiFamily: AL2 

  amiSelectorTerms: 

    - alias: al2@latest 

  role: ${instance_profile_name} 

  subnetSelectorTerms: 

    - tags: 

        "kubernetes.io/role/internal-elb": "1" 

  securityGroupSelectorTerms: 

    - name: "${cluster_name}-node-*" 

  tags: 

    "karpenter.sh/discovery": "${cluster_name}" 

    "nodegroup/type": "fordogs" 

    "Name": "${cluster_name}-fordogs-karpenter" 

  userData: | 

    #!/bin/bash 

    /etc/eks/bootstrap.sh ${cluster_name} 
				
			

Scenario 2: Generals demand “All hands on deck” using EKS Auto Mode 

EKS Auto Mode Arch Diag 2 - Karpenter on Amazon EKS

Karpenter improves frugality and agility, but operational overhead remains—security patching, upgrades, and maintenance are still needed for better security posture. This is where EKS Auto Mode comes in. It uses Karpenter behind the scenes, but with additional automation: 

  • Streamlined cluster management – Production-ready EKS clusters with minimal overhead, like Elastic Beanstalk. 
  • Application availability – Dynamically adds/removes nodes per workload demand. 
  • Efficiency – Consolidates workloads, removes idle nodes, and reduces cost. 
  • Security – Nodes are recycled every 21 days (can be reduced), aligning with security best practices. 
  • Automated upgrades – Keeps clusters/nodes up to date while respecting PDBs and NDBs. 
  • Managed components – Includes DNS, Pod networking, GPU plug-ins, health checks, and EBS CSI out-of-the-box. 
  • Customizable NodePools – Still allows defining custom storage, compute, or networking requirements. 
  • EKS Auto Mode is enabled by setting `compute_config.enabled = true`. 

 

EKS Cluster Config with Auto Mode: 

				
					module "eks" { 

  source  = "terraform-aws-modules/eks/aws" 

  version = "~> 20.0" 

 

  cluster_name                             = var.cluster_name 

  cluster_version                          = "1.32" 

  cluster_endpoint_public_access           = true 

  enable_cluster_creator_admin_permissions = true 

 

  vpc_id     = module.vpc.vpc_id 

  subnet_ids = module.vpc.private_subnets 

 

  enable_irsa = true 

 

  compute_config = { 

    enabled = true 

  } 

} 
				
			

EKS with Karpenter v/s EKS Auto Mode 

Both Managed EKS with Karpenter & Amazon EKS Auto Mode offer a powerful solution for managing Kubernetes clusters. 

Auto Mode vs Karpenter - Karpenter on Amazon EKS

Choose Managed EKS with Karpenter option if you need: 

  • Control over Data Plane 
  • Custom AMI 
  • Install specific Agents or software requiring DaemonSet 
  • Custom Networking 
  • Granular control over patching & upgrades 

Choose EKS Auto Mode if: 

  • Reduce operational overhead on upgrade and patching 
  • Don’t require granular control over the AMI, Custom Networking, upgrade and patching 

Conclusion  

In conclusion, now both Dogs and Cats generals have EKS clusters with auto-mode enabled that automatically scales, patches, and provides enhanced security to workloads without manual intervention. 

Migrating from Cluster AutoScaler to Karpenter was the tectonic shift that optimized cluster efficiency. Karpenter, originally built by AWS, is now an open-source project maintained by the community. 

As the proverb goes: “Trust, but verify.” While Karpenter is powerful, it shouldn’t be treated as a black box. The last thing anyone wants is a production outage because Karpenter decided to consolidate or terminate nodes during business hours. 

So, in the next blogwe’ll explore observability for Karpenter forwarding Karpenter controller logs to Grafana and building dashboards to monitor its actions. 

Enjoyed this blog?

Share it with your network!

Move faster with confidence