EKS Networking Deep Dive Series Part 1 — IP Machine

How pods get addresses, how ipamd manages them, and the ENI limits that break clusters at the worst possible moment

Series: Inside EKS Networking · Post 1 of 7 · ~30 min read


The moment a pod gets scheduled on EKS, a race begins.

Your application is waiting. Kubernetes is waiting. And somewhere on the EC2 control plane, a daemon is making API calls that will determine whether your pod starts in 300 milliseconds or 30 seconds.

Most engineers never look at what happens in that gap. They see Pending flip to Running and move on. This post doesn’t move on.

We’re going into the gap — the ENI allocation loop, the ipamd warm pool, the exact routing a packet takes through a pod’s network namespace, and the hard limits baked into every EC2 instance type that can stop scheduling cold while a node still has available CPU and memory.

By the end you’ll be able to read ipamd logs and know precisely what state your cluster’s IP allocation is in. You’ll know which failure mode you’re looking at before you start guessing.

Section 1 — The overlay model and why AWS abandoned it

When AWS built the VPC CNI, they made a choice most CNI plugins don’t make: every pod gets a real IP address from your VPC subnet.

That sounds like a small implementation detail. It isn’t. It changes what your security groups can see, what your load balancers can route to, what your Flow Logs record, and how quickly your subnets run out of space.

To understand why that choice matters — and what it costs — you need to understand what the alternative looks like.

Most CNI plugins implement pod networking with an overlay. Pods get IPs from a private CIDR that exists only inside the cluster — something like 10.244.0.0/16. The VPC has never heard of it. When a pod on Node A sends traffic to a pod on Node B, the packet gets wrapped in a VXLAN frame with node IPs on the outside. The VPC routes the outer packet. The destination node unwraps it. The inner pod packet is delivered.

Your VPC sees two EC2 instances talking to each other. Everything inside — which pod talked to which, which service made that outbound connection — is invisible to it.

Overlay model — what the VPC sees

───────────────────────────────────────────────────────────

Node A (10.0.1.10) Node B (10.0.1.20)

┌───────────────────────┐ ┌───────────────────────┐

│ Pod A 10.244.0.5 │ │ Pod B 10.244.1.7 │

│ Pod C 10.244.0.6 │ │ Pod D 10.244.1.8 │

└───────────────────────┘ └───────────────────────┘

Packet: Pod A → Pod B

┌──────────────────────────────────────────────────────┐

│ outer src: 10.0.1.10 outer dst: 10.0.1.20 │

│ ┌────────────────────────────────────────────────┐ │

│ │ inner src: 10.244.0.5 dst: 10.244.1.7 │ │

│ └────────────────────────────────────────────────┘ │

└──────────────────────────────────────────────────────┘

VPC sees: 10.0.1.10 → 10.0.1.20

VPC cannot see: 10.244.0.5 → 10.244.1.7

This creates three operational problems that compound with cluster size.

Security groups operate on the node, not the pod. Every pod on a node shares that node’s security group. You cannot write a rule allowing only your payments service to reach your database — you have to allow the entire node. All 29 pods get the same network access.

Flow Logs record node IPs. When an incident fires and you need to know which pod made an unexpected outbound connection, Flow Logs give you a node IP and a timestamp. You correlate that against Kubernetes events to find the pod, cross-reference deployment history to find the service. It works slowly, under pressure, when you have the least patience for it.

AWS services can’t route to pod IPs. ALBs and NLBs work with VPC IPs. Getting traffic from a load balancer into a pod through an overlay requires a NodePort hop through kube-proxy’s iptables chains. Extra latency, extra failure surface, source IP lost at the first NAT.

The VPC CNI solves all three by giving every pod a real IP from your VPC subnet. No encapsulation. Pod A at 10.0.1.11 and Pod B at 10.0.2.16 communicate directly through the VPC routing table.

VPC-native model — what EKS VPC CNI gives you

───────────────────────────────────────────────────────────

Node A (10.0.1.10) Node B (10.0.2.15)

┌───────────────────────┐ ┌───────────────────────┐

│ Pod A 10.0.1.11 │ │ Pod D 10.0.2.16 │

│ Pod B 10.0.1.12 │ │ Pod E 10.0.2.17 │

└───────────────────────┘ └───────────────────────┘

Packet: Pod A → Pod D

┌──────────────────────────────────────────────────────┐

│ src: 10.0.1.11 dst: 10.0.2.16 │

│ No encapsulation. No outer header. Native MTU. │

└──────────────────────────────────────────────────────┘

VPC sees: the actual pods

Flow Logs: record real pod IPs

ALB IP mode: routes directly to pod IPs

Security groups: per-pod isolation via SG for Pods

The upside is real. So is the cost.

Every pod consumes a real VPC IP address. A 200-node cluster running 30 pods per node holds 6,000 pod IPs in your subnets — before accounting for node IPs, ENI primary IPs, and warm pool reserves. In a /24 with 254 usable addresses, this arithmetic becomes a crisis fast.

That’s the trade AWS made. VPC visibility in exchange for IP pressure. Everything that follows in this post — ipamd, the warm pool, ENI limits, prefix delegation — is the engineering that manages that pressur

Section 2 — ipamd: the daemon that runs before your pod does

Before your pod exists, ipamd is already working.

ipamd is a long-running daemon inside the aws-node DaemonSet running on every EKS node. Its entire job is to make sure you never wait for an EC2 API call when a pod needs an IP. It runs those calls in advance, keeps the results in a pool, and hands them off the moment kubelet asks.

This matters because EC2 API calls are slow. Attaching a new ENI takes 3 to 10 seconds. Assigning secondary IPs takes another 1 to 3 seconds. If every pod start waited for these calls, a burst of 20 pods during a scale-out event would take minutes. ipamd front-loads the slow work so your pod start is fast.

Two binaries, one DaemonSet

Inside aws-node, two separate binaries handle two separate concerns. Most engineers conflate them. That conflation makes debugging harder.

aws-node DaemonSet pod — per node, hostNetwork: true

───────────────────────────────────────────────────────────────────────

┌──────────────────────────────────┐ ┌──────────────────────────────┐

│ ipamd │ │ CNI plugin │

│ │ │ /opt/cni/bin/aws-cni │

│ • Manages ENI pool │ │ • Called by kubelet on │

│ • Maintains warm IP pool │ │ every pod ADD / DEL │

│ • Calls EC2 API (async) │ │ • Requests IP from ipamd │

│ • Serves IPs over gRPC │ │ • Creates veth pair │

│ • Recovers state on restart │ │ • Configures network ns │

│ │ │ • Sets up routes + ARP │

│ Does NOT touch network stack │ │ Does NOT call EC2 API │

└──────────────┬───────────────────┘ └──────────────┬───────────────┘

└──────────── gRPC (unix socket) ───────┘

ipamd manages the pool. The CNI plugin consumes from it. Their boundary is the gRPC socket — and understanding where one’s responsibility ends and the other’s begins is the difference between a 10-minute debug session and a 3-hour one.

When a pod starts, kubelet invokes the CNI plugin binary directly. The CNI plugin calls ipamd over the socket, gets an IP in roughly a millisecond, then — entirely on its own — creates the veth pair, configures the pod’s network namespace, sets up routing rules, and returns success to kubelet. ipamd does not touch the Linux network stack. It does not create veth pairs. It does not set up routes. Once it hands off the IP, its job for that pod is done.

This matters when you’re diagnosing a pod stuck in ContainerCreating. If ipamd handed off an IP successfully but the pod still hasn’t come up, the problem is in the CNI plugin’s network namespace setup — not in ipamd. Looking at ipamd logs won’t help you. Looking at CNI plugin logs will.

The warm pool

ipamd’s central mechanism is the warm pool — a buffer of pre-allocated IP addresses sitting ready before any pod requests them.

ipamd startup and steady-state loop

───────────────────────────────────────────────────────────────────────

Node boots → ipamd starts

├── DescribeNetworkInterfaces ← find ENIs already on this node

├── Query kubelet local API ← find pods already running

├── Calculate free IPs ← total_allocated – in_use

Pool below target?

│ │

YES NO

│ └── serve requests from pool (~1ms)

Need new ENI?

│ │

YES NO

│ │

▼ ▼

AttachNetworkInterface AssignPrivateIpAddresses

(3–10s) to existing ENI (1–3s)

Wait for ENI in kernel → configure routing → add IPs to pool

Pod scheduled → kubelet calls CNI plugin → CNI calls ipamd: GetIPForPod

├── ipamd returns IP from pool ← ~1ms

├── CNI plugin creates veth pair ← CNI’s job, not ipamd’s

├── CNI plugin configures pod namespace ← CNI’s job, not ipamd’s

└── kubelet gets success, starts containers

ipamd detects pool dropped below target

└── asynchronously refills ← next pod equally fast

The startup sequence is worth reading carefully because it describes exactly what happens when ipamd crashes on a node with running pods.

On restart, ipamd calls DescribeNetworkInterfaces to recover all ENIs and their secondary IPs from EC2’s state — this data lives in EC2, not in ipamd’s local memory. It then queries kubelet to find which pods are running and which IPs they hold. It reconstructs the warm pool from the difference. Running pods are completely unaffected — their network namespaces live in the kernel, and ipamd doesn’t own them. The only consequence of an ipamd crash is that new pods cannot start until it recovers, which typically takes 10 to 30 seconds.

Warm pool configuration

Three environment variables control the pool target, evaluated in priority order:

WARM_IP_TARGET Keep exactly N free IPs in the pool at all times.

Best for workloads with predictable burst size.

Example: pods burst in groups of 10 → set to 15.

MINIMUM_IP_TARGET Always have at least N total IPs allocated,

used + free combined. Sets a floor for idle nodes.

Use alongside WARM_IP_TARGET.

WARM_ENI_TARGET Keep N fully-allocated spare ENIs beyond pod usage.

Default value: 1. Coarsest control. Used when

neither of the above is set.

The default WARM_ENI_TARGET=1 is where most teams get hurt without realizing it.

An m5.large holds 10 secondary IPs per ENI. With WARM_ENI_TARGET=1, every idle m5.large permanently reserves 10 VPC IPs that no pod is using. In a 100-node cluster that’s 1,000 IPs consumed by warm pools before a single application pod is scheduled. In a /24 subnet with 254 usable addresses, warm pools alone consume nearly 40% of the space.

Set WARM_IP_TARGET and MINIMUM_IP_TARGET explicitly based on your actual burst pattern. Reserve WARM_ENI_TARGET for situations where you genuinely need ENI-level pre-allocation — high-throughput nodes where a full ENI’s worth of IPs is regularly consumed in seconds.

Section 3 — Inside the pod’s network namespace

Run ip route inside any EKS pod:

default via 169.254.1.1 dev eth0

169.254.1.1 dev eth0 scope link

There is no host at 169.254.1.1. It’s a link-local address in the IANA-reserved range. Nothing on your network lives there. The pod has a default gateway that doesn’t exist, and it’s communicating normally.

This is the thing that confuses almost every engineer the first time they look at EKS pod networking from the inside. Understanding it requires three Linux primitives: network namespaces, veth pairs, and proxy ARP.

Network namespaces

A Linux network namespace is an isolated copy of the network stack — its own interfaces, its own routing table, its own ARP table, its own iptables rules. A process inside a namespace sees only what’s in that namespace. It cannot see the host’s eth0. It cannot see other pods.

Every EKS pod runs in its own network namespace. The host node has its own. For a pod to communicate with anything outside — another pod, a service, the internet — traffic must cross a namespace boundary.

veth pairs

A veth pair is a virtual Ethernet cable with two ends. Anything injected into one end immediately appears on the other. The CNI plugin creates one veth pair per pod at startup:

  • One end, named eth0, lives inside the pod’s namespace. This is the interface the pod sees.
  • The other end, named eniXXXXXX, lives in the host namespace. This is how the host handles the pod’s traffic.

Host network namespace

───────────────────────────────────────────────────────────────────────

eth0 (primary ENI)

primary: 10.0.1.10

secondary: 10.0.1.11, 10.0.1.12, 10.0.1.13

eth1 (secondary ENI)

primary: 10.0.1.20

secondary: 10.0.1.21, 10.0.1.22

eniXXXXXX ← veth host-end for Pod A (paired with Pod A’s eth0)

eniYYYYYY ← veth host-end for Pod B (paired with Pod B’s eth0)

Policy routing rules:

from 10.0.1.11 lookup table 101 ← Pod A exits via eth0

from 10.0.1.12 lookup table 101 ← Pod B exits via eth0

from 10.0.1.21 lookup table 102 ← Pod C exits via eth1

───────────────────────────────────────────────────────────────────────

Pod A network namespace

eth0 ← veth pod-end, paired with eniXXXXXX in host

IP: 10.0.1.11/32

Routes:

default via 169.254.1.1 dev eth0

169.254.1.1 dev eth0 scope link

The 169.254.1.1 trick

The pod needs a default gateway. Without one it can only talk to itself. The obvious candidates — the node’s primary IP, another pod’s IP — create ordering dependencies and portability problems. The CNI plugin uses 169.254.1.1, a link-local address guaranteed not to conflict with any real infrastructure.

When the pod wants to send a packet anywhere, it ARP-resolves its gateway first: “who has 169.254.1.1?” The CNI plugin has already installed a proxy ARP entry on the host side of the veth pair — eniXXXXXX answers that ARP request with its own MAC address. The pod receives the reply, caches it, and believes it has a working gateway. Every outbound packet goes to that MAC, crosses the veth into the host namespace, and from there the host’s routing table and policy rules take over.

The gateway is a controlled fiction. Real routing happens entirely in the host namespace.

One detail that matters: it’s the CNI plugin that sets up this ARP proxy — not ipamd. ipamd’s job ends at handing off the IP. Everything that touches the Linux network stack, including 169.254.1.1, is the CNI plugin’s responsibility.

Per-ENI routing tables

Each secondary ENI gets its own routing table in the host namespace. Each pod IP gets a policy routing rule that maps it to the correct table.

When a packet from Pod A (10.0.1.11) arrives in the host namespace via the veth, the kernel checks ip rule list. It finds from 10.0.1.11 lookup table 101. Table 101 has a default route via eth0. The packet exits through eth0 — the ENI that Pod A’s IP belongs to.

Without these per-source rules, a pod whose IP came from eth1 might send traffic through eth0. The VPC sees a packet sourced from an IP that doesn’t belong to that ENI and drops it silently. No error. The sender times out. This is one of the harder-to-diagnose failure modes in EKS networking — we’ll see it in detail in Section 5.

Why pod IPs are /32

Pod IPs are assigned as /32 host routes, not as members of a subnet. This means the pod has no direct L2 connectivity to any other address. Every packet, regardless of destination, goes through the default gateway, crosses the veth, and enters the host routing stack.

The consequence is simplicity at scale. The host doesn’t need a bridge or an L2 domain shared across pods. Each pod has exactly one route, one policy rule, one veth pair. The model scales linearly. Adding the 50th pod to a node is identical to adding the 5th.

Section 4 – The ENI limit nobody warns you about

Every EC2 instance type has hard limits on how many ENIs it can attach and how many secondary private IPs each ENI can hold. These limits are not configurable. They are physical constraints baked into the instance type by AWS. And they directly determine the maximum number of pods a node can run.

The formula:

max_pods = (max_ENIs × (IPs_per_ENI – 1)) + 2

The -1 per ENI: each ENI’s primary IP belongs to the node, not pods

The +2: kube-proxy and aws-node use host networking,

they don’t consume secondary IPs

Applied to common instance types:

Instance type ENIs IPs/ENI Max pods

─────────────────────────────────────────────

t3.small 3 4 11

t3.medium 3 6 17

m5.large 3 10 29

m5.xlarge 4 15 58

m5.2xlarge 4 15 58

m5.4xlarge 8 30 234

c5.xlarge 4 15 58

c5.4xlarge 8 30 234

c5.18xlarge 15 50 737

Twenty-nine pods on an m5.large. The node has 2 vCPUs and 8 GiB of memory that may be nowhere near exhausted, but at pod 30 Kubernetes refuses to schedule. Too many pods in the event log. The pod sits Pending. Engineers start looking at application logs. There is nothing in the application logs. The problem is in the network.

The m5.xlarge trap

The m5.xlarge and m5.2xlarge share the same 58-pod ceiling despite the 2xlarge having twice the CPU and memory. If your workload is pod-dense — many small containers, sidecars, short-lived batch jobs — you can exhaust the m5.xlarge’s pod capacity with CPU utilization below 20%. You pay for compute that sits idle because you ran out of IP slots.

When choosing instance types, pod density and compute requirements are separate constraints. Check both independently before committing.

Prefix delegation

Prefix delegation changes the unit of IP assignment from individual addresses to /28 blocks — 16 contiguous IPs per block. One ENI slot that previously gave one pod IP now gives sixteen.

Secondary IP mode — m5.large

─────────────────────────────────────────────────────────────────

eth0: 1 primary + 9 secondary IPs = 9 pod slots

eth1: 1 primary + 9 secondary IPs = 9 pod slots

eth2: 1 primary + 9 secondary IPs = 9 pod slots

─────────────

27 + 2 = 29 max pods

Prefix delegation mode — same m5.large

─────────────────────────────────────────────────────────────────

eth0: 1 primary + 9 prefixes × 16 = 144 pod slots

eth1: 1 primary + 9 prefixes × 16 = 144 pod slots

eth2: 1 primary + 9 prefixes × 16 = 144 pod slots

─────────────

432 + 2 = 434 theoretical

434 is raw IP-slot math. It is not your –max-pods value.

On smaller instances, CPU and memory are exhausted long before IP capacity. AWS provides a max-pods calculator that accounts for both constraints. For most workloads on instances up to m5.2xlarge, the recommended value lands around 110 regardless of IP slot math. On larger instances the IP ceiling becomes the binding constraint and the calculator reflects that. Use the calculator. Do not use 434.

The subnet requirement

Prefix delegation requires 16 contiguous addresses aligned to a /28 boundary in the subnet. If your subnet is fragmented — individual IPs scattered across the address space from years of other workloads — EC2 may fail to assign a /28 even when total free addresses are sufficient. This failure is invisible in the console until allocation fails and pods don’t come up.

Prefix delegation works reliably on subnets that are either large enough that fragmentation doesn’t matter (/22 or bigger) or dedicated to EKS nodes and not shared with other workloads. Dedicated node subnets are the right architecture for any cluster beyond small scale. They make the IP math transparent and eliminate the fragmentation problem entirely.

Migration: new node group, not in-place drain

Prefix delegation only takes effect on new nodes. The operationally safe migration path is to provision a new node group with prefix delegation configured from the start, migrate workloads across, and delete the old group.

Enabling the flag and draining existing nodes in place risks nodes coming up in a mixed state — some ENIs holding individual IPs from the old configuration, some holding prefixes from the new one. The resulting inconsistent capacity reporting is painful to diagnose under pressure.

The secondary CIDR escape

If your VPC is running low on address space and you can’t get more IPv4 range from your network team quickly, there is one more option worth knowing.

AWS supports associating a secondary CIDR from the 100.64.0.0/10 range with your VPC. This comes from RFC 6598 — originally defined as carrier-grade NAT space. It is fully routable within AWS but does not appear on the public internet. Your pods get IPs from this range, freeing your primary VPC CIDR for everything else.

A 100.64.0.0/10 block gives you over 4 million addresses. It won’t solve a fundamentally broken subnet architecture, but as runway while you plan a proper fix it’s the fastest path available without a VPC redesign.

Section 5: How this breaks in production

The three failure patterns below account for the majority of EKS IP allocation incidents. They share a common feature: they present as application problems until you know where to look.

Pattern 1 — The pod start latency cliff

What you see: Pod start times are consistently under 2 seconds. Then, during a scale-out event, they spike to 15–30 seconds with no change to the application or deployment configuration.

What’s actually happening: The warm pool is sized correctly for normal operations but too small for the burst rate of a large scale-out. When the autoscaler provisions 30 new nodes simultaneously, each node’s ipamd starts filling its warm pool at the same time. The combined EC2 API call volume across the fleet triggers throttling. AssignPrivateIpAddresses calls that normally take 1 second start taking 10–20. The pool can’t refill fast enough. Pods that arrive after the pool empties wait for synchronous EC2 API calls before they get an IP.

The key signal: latency spikes that correlate with node provisioning events, not deployment events. ipamd logs show RequestLimitExceeded on EC2 API calls. The warm pool metric drops to zero and stays there.

bash

# Confirm throttling in ipamd logs

kubectl logs -n kube-system $AWSNODE -c aws-node \

| grep -i ‘throttl\|RequestLimitExceeded\|rate exceed’

Fix: Increase WARM_IP_TARGET so each node pre-allocates more IPs during bootstrap before competing with the rest of the fleet. Consider prefix delegation — each EC2 API call assigns 16 IPs instead of 1, cutting total call volume by 16x during scale-out.


Pattern 2 — Subnet exhaustion from warm pool leak

What you see: New nodes fail to provision. Pods on existing nodes fail to start. ipamd logs show InsufficientFreeAddressesInSubnet. The subnet in the AWS console shows near-full address usage, but the number of running pods doesn’t explain it.

What’s actually happening: ipamd is holding more IPs than pods are using. This happens in two ways. First, WARM_ENI_TARGET=1 on a large fleet permanently reserves a full ENI’s worth of IPs per node — on 100 m5.large nodes that’s 1,000 IPs sitting in warm pools. Second, after a cluster scales out and back in, the remaining nodes may still hold warm pool IPs sized for the peak fleet rather than the current one. ipamd is conservative about releasing IPs to avoid churning EC2 API calls.

bash

# Calculate warm pool size across the cluster

# Secondary IPs allocated by EC2:

aws ec2 describe-network-interfaces \

–filters “Name=attachment.instance-id,Values=$INSTANCE_ID” \

–query ‘NetworkInterfaces[*].PrivateIpAddresses[?Primary==`false`]’ \

–output json | jq ‘[.[][][].PrivateIpAddress] | length’

# Pod IPs in use on the same node:

kubectl get pods -A \

–field-selector spec.nodeName=$NODE \

-o jsonpath='{.items[*].status.podIP}’ \

| tr ‘ ‘ ‘\n’ | grep -v ‘^$’ | wc -l

# The gap between these two numbers is your warm pool.

# If it’s consistently larger than your WARM_IP_TARGET

# multiplied by node count, ipamd is holding excess IPs.

Fix: Replace WARM_ENI_TARGET with explicit WARM_IP_TARGET and MINIMUM_IP_TARGET values sized to your actual burst pattern. This gives ipamd clear boundaries for when to release IPs rather than holding a full ENI in reserve indefinitely.


Pattern 3 — The asymmetric routing silent drop

What you see: Intermittent connection failures between pods on different nodes. Not all connections fail — just some, with no apparent pattern by service, time of day, or node. tcpdump on the sending pod shows packets leaving. The receiving pod never sees them. No error is returned to the application. Connections time out.

What’s actually happening: A pod whose IP was assigned from eth1’s secondary pool is sending traffic through eth0. The VPC sees a packet arriving via Node A’s eth0 ENI with a source IP that belongs to Node A’s eth1 ENI. It interprets this as spoofed source traffic and drops it silently.

This happens when the policy routing rules in the host namespace are incomplete. Each pod IP needs a rule mapping it to the correct ENI’s routing table. If ipamd crashed mid-allocation and the CNI plugin was invoked before recovery completed, the rule may be missing. The pod’s traffic falls through to the default routing table, exits via the wrong ENI, and the VPC drops the return path.

bash

# SSH to the affected node

aws ssm start-session –target $INSTANCE_ID

# Check policy routing rules

ip rule list

# Every pod IP should have a corresponding entry:

# 1024: from 10.0.1.21 lookup 102

# If a pod IP is missing its rule, that pod has asymmetric routing.

# Verify each ENI’s routing table has a correct default route

ip route show table 101 # should exit via eth0

ip route show table 102 # should exit via eth1

# If a rule is missing, the fix is to reschedule the pod.

# Deleting the pod forces the CNI plugin to re-run ADD,

# which reinstalls the correct rule.

kubectl delete pod

The long-term fix is alerting on ipamd restarts and verifying routing rule consistency after any ipamd recovery. A missing policy rule is the silent result of a race condition between ipamd recovery and pod scheduling — it won’t appear in application logs, and it won’t appear in Kubernetes events. It only shows up when you look directly at the host network namespace.

Section 6: Build: observe, understand, break, fix

The commands below follow a deliberate sequence. Each step reveals something that makes the next interpretable. Don’t skip ahead.

Prerequisites: EKS cluster with at least two m5.large nodes in private subnets. kubectl and AWS CLI configured.


Step 1 — Observe what ipamd is managing

bash

# Get a node and its EC2 instance ID

NODE=$(kubectl get nodes -o name | head -1 | cut -d/ -f2)

INSTANCE_ID=$(kubectl get node $NODE \

-o jsonpath='{.spec.providerID}’ | cut -d/ -f5)

# List all ENIs and their secondary IPs from EC2

aws ec2 describe-network-interfaces \

–filters “Name=attachment.instance-id,Values=$INSTANCE_ID” \

–query ‘NetworkInterfaces[*].{

ENI:NetworkInterfaceId,

Device:Attachment.DeviceIndex,

Primary:PrivateIpAddress,

Secondaries:PrivateIpAddresses[?Primary==`false`].PrivateIpAddress

}’ \

–output json

# Count pod IPs currently in use on this node

kubectl get pods -A \

–field-selector spec.nodeName=$NODE \

-o jsonpath='{.items[*].status.podIP}’ \

| tr ‘ ‘ ‘\n’ | grep -v ‘^$’ | wc -l

The difference between the secondary IP count from EC2 and the pod IP count from Kubernetes is ipamd’s warm pool. IPs allocated, paid for, sitting ready, assigned to no pod. On a freshly provisioned node with no workload this gap will equal roughly one ENI’s worth of IPs — the default WARM_ENI_TARGET=1 at work.


Step 2 — Trace the routing inside the node

bash

# SSH to the node via SSM

aws ssm start-session –target $INSTANCE_ID

# Show policy routing rules

ip rule list

# 0: from all lookup local

# 512: from all to 10.0.1.11 lookup main

# 1024: from 10.0.1.11 lookup 101 ← Pod A, exits via eth0

# 1024: from 10.0.1.21 lookup 102 ← Pod C, exits via eth1

# 32766: from all lookup main

# Verify each ENI’s routing table

ip route show table 101

# default via 10.0.1.1 dev eth0

ip route show table 102

# default via 10.0.1.1 dev eth1

# Enter a pod’s network namespace

POD_NS=$(ls /var/run/netns/ | grep cni | head -1)

sudo ip netns exec $POD_NS ip route

# default via 169.254.1.1 dev eth0

# 169.254.1.1 dev eth0 scope link

# Confirm the ARP proxy entry

sudo ip netns exec $POD_NS arp -n

# 169.254.1.1 ether a2:f0:b1:c3:d4:e5 C eth0

# That MAC belongs to the host-side veth — eniXXXXXX in the host namespace.

# The pod believes it has a gateway. It has a veth.

# Confirm from the host side

ip link show | grep -B1 a2:f0:b1:c3:d4:e5

# This shows the eniXXXXXX interface — the host-end of Pod A’s veth pair


Step 3 — Watch ipamd respond to pod scheduling

Open two terminals.

Terminal 1 — stream ipamd logs filtered to allocation events:

bash

AWSNODE=$(kubectl get pod -n kube-system \

–field-selector spec.nodeName=$NODE \

-l k8s-app=aws-node -o name | head -1)

kubectl logs -n kube-system $AWSNODE -c aws-node -f \

| grep -iE ‘warm|pool|assign|attach|eni|throttl’

Terminal 2 — create pods to trigger allocation:

bash

kubectl create deployment flood \

–image=public.ecr.aws/amazonlinux/amazonlinux:2 \

–replicas=20 — sleep 3600

Watch Terminal 1. If the warm pool absorbs the burst you’ll see a quiet refill — ipamd detects the pool dropped below target and makes EC2 API calls asynchronously. If you scale fast enough to outrun the pool you’ll see the difference: pod starts slow while ipamd waits for synchronous EC2 API calls to return before it can hand off IPs.

This is the warm pool’s value made visible. Fast pod starts mean the pool was ready. Slow pod starts mean the pool was empty and ipamd is blocking on EC2.


Step 4 — Hit the max-pods limit deliberately

bash

# Scale until pods go Pending

kubectl scale deployment flood –replicas=40

# Watch for Pending

kubectl get pods -w | grep Pending

# Describe a Pending pod

kubectl describe pod | grep -A5 Events

# Warning FailedScheduling 0/2 nodes available: 2 Too many pods.

# Confirm the node has compute headroom

kubectl top node $NODE

# NAME CPU% MEMORY%

# ip-10-0-1-10… 13% 36%

# Confirm the pod ceiling

kubectl get node $NODE \

-o jsonpath='{.status.allocatable.pods}’

# 29

13% CPU. 36% memory. 0 pods schedulable. The node is not full by any compute measure. It is full by IP measure. This is the scenario that shows up in production as a stalled rollout while engineers search application logs for a problem that isn’t there.


Step 5 — Enable prefix delegation on a new node group

bash

# Enable on the aws-node DaemonSet

kubectl set env daemonset aws-node \

-n kube-system \

ENABLE_PREFIX_DELEGATION=true \

WARM_PREFIX_TARGET=1

# This takes effect on new nodes only.

# Provision a new node group. In your launch template userdata:

#

# /etc/eks/bootstrap.sh \

# –use-max-pods false \

# –kubelet-extra-args ‘–max-pods=110’

#

# Use the EKS max-pods calculator for the correct value.

# Do not use 434.

# Migrate workloads to the new node group.

# Drain and delete the old node group.

# Do not drain and recycle existing nodes in place.

Verify on the new node:

bash

NEW_NODE=$(kubectl get nodes \

–sort-by=.metadata.creationTimestamp \

-o name | tail -1 | cut -d/ -f2)

AWSNODE_NEW=$(kubectl get pod -n kube-system \

–field-selector spec.nodeName=$NEW_NODE \

-l k8s-app=aws-node -o name | head -1)

kubectl logs -n kube-system $AWSNODE_NEW -c aws-node \

| grep -i prefix

# Prefix delegation enabled

# Assigned prefix 10.0.1.16/28 to eni-0abc…

# warm prefix target: 1, current: 1, pool is warm

kubectl get node $NEW_NODE \

-o jsonpath='{.status.allocatable.pods}’

# 110


Step 6 — Verify warm pool sizing

Before moving on, check whether your current warm pool configuration is consuming more IPs than it should. This takes 2 minutes and most teams have never done it.

bash

# For each node in your cluster, compare allocated vs in-use

for NODE in $(kubectl get nodes -o name | cut -d/ -f2); do

INSTANCE_ID=$(kubectl get node $NODE \

-o jsonpath='{.spec.providerID}’ | cut -d/ -f5)

ALLOCATED=$(aws ec2 describe-network-interfaces \

–filters “Name=attachment.instance-id,Values=$INSTANCE_ID” \

–query ‘NetworkInterfaces[*].PrivateIpAddresses[?Primary==`false`]’ \

–output json | jq ‘[.[][][].PrivateIpAddress] | length’)

IN_USE=$(kubectl get pods -A \

–field-selector spec.nodeName=$NODE \

-o jsonpath='{.items[*].status.podIP}’ \

| tr ‘ ‘ ‘\n’ | grep -v ‘^$’ | wc -l)

echo “$NODE allocated=$ALLOCATED in_use=$IN_USE warm_pool=$((ALLOCATED – IN_USE))”

done

If the warm pool column is consistently at 9 or 10 per node, you’re running the default WARM_ENI_TARGET=1. Multiply that by your node count to see how many VPC IPs you’re holding in reserve across the fleet. On a 100-node cluster of m5.large nodes that number is typically between 800 and 1,000.

Whether that’s acceptable depends on your subnet sizing. If you’re running in a /24 it’s a problem. If you’re running in a /20 it’s noise. The point is to know the number rather than discover it during a subnet exhaustion incident.

What you now know — and what comes next

You have a complete mental model of how EKS pods get their IP addresses.

The VPC CNI gives every pod a real VPC IP rather than running an overlay. That decision unlocks native AWS integrations — Flow Logs that record actual pod IPs, load balancers that route directly to pods, security group enforcement at pod granularity via SG for Pods. It also creates IP pressure that has to be managed deliberately: warm pool sizing, subnet planning, ENI limits, prefix delegation.

ipamd front-loads the slow EC2 API work so pod starts are fast. The CNI plugin handles the Linux side — veth pairs, network namespaces, the 169.254.1.1 ARP proxy, per-ENI routing rules. These are separate binaries with a clean boundary between them, and knowing where that boundary is cuts debugging time significantly.

The ENI limits are real and they bite teams that don’t plan for pod density independently from compute requirements. Prefix delegation solves the capacity problem but requires the right subnet setup and a clean migration path. The warm pool default burns more IPs than most teams realize until they’re diagnosing subnet exhaustion under pressure.


That’s the IP machine. Pods have addresses. The node knows how to route their traffic.

The next question is what happens to a packet after it leaves the pod’s network namespace.

In Post 2 we follow a single packet across the full journey — from the moment it leaves Pod A’s eth0, through the veth pair, through the host routing table, out the ENI, across the VPC routing table, into Node B’s ENI, through its host routing stack, and into Pod B’s network namespace. We run tcpdump on both nodes simultaneously so you can see the packet arrive without a single byte of encapsulation.

Then we look at what happens when the destination isn’t a pod IP but a Kubernetes Service — and why that path looks completely different. Services have no process listening on their ClusterIP. The IP is virtual, maintained by thousands of iptables rules that rewrite the destination before the packet ever reaches the network. At 10,000 services those rules become a performance problem. At 100,000 they become a crisis.

Post 2 covers all of it: cross-node packet routing, Services and kube-proxy’s iptables chains, and the ALB’s two fundamentally different modes for getting traffic into your cluster — one of which requires the VPC CNI and one of which doesn’t.


Next: Post 2 — The Packet Journey

Leave a Reply

Your email address will not be published. Required fields are marked *