[processor/k8sattributes] feat: add reverse mapping between podUID and podIdentifier to delete stale entries by wehzzz · Pull Request #48398 · open-telemetry/opentelemetry-collector-contrib

wehzzz · 2026-05-15T09:45:33Z

Description

We observed unbounded growth in podTableSize when running the k8sattributes processor in an OTel eBPF profiler-based collector configured to associate pods by container.id. This was observed on both v0.150.0 and v0.151.0.

The issue happens on clusters with container restarts. A pod can stay alive while one of its containers restarts and receives a new container.id. The cache then adds the new container.id association, but the previous container.id association is not queued for deletion because there is no pod delete event. Since deleteLoop only processes entries that have already been added to the delete queue, the stale association remains in the cache.

This PR fixes the missing cleanup path by keeping a reverse index from podUID to known PodIdentifiers. With that index:

On pod update, identifiers that disappeared from the current pod state are scheduled for deletion after the existing grace period.
If an identifier becomes current again before the grace period expires, the pending deletion is cancelled.
On pod delete, all identifiers known for that pod UID are cleaned up, including historical container.id associations that are no longer present in the final pod object.

Before fix

Signal	01/05	04/05
Pods on node	50	49
Containers on node	161	160
`otelcol_otelsvc_k8s_pod_added`	147	441
`otelcol_otelsvc_k8s_pod_deleted`	N/A	294
`otelcol_otelsvc_k8s_pod_updated`	13,053	182,379
`otelcol_otelsvc_k8s_pod_table_size`	1,050	7,051

After fix

Signal	~30 min uptime	~79 min uptime	~6h25 uptime
Pods on node	40	40	40
Containers on node	135	135	135
Sum of `restartCount` across currently running containers	376	416	653
`otelcol_otelsvc_k8s_pod_added`	120	144	165
`otelcol_otelsvc_k8s_pod_deleted`	N/A	24	45
`otelcol_otelsvc_k8s_pod_updated`	801	2,706	11,496
`otelcol_otelsvc_k8s_pod_table_size`	432	432	432

Testing

Added unit tests to cover various deletion scenarios.
Tested the patched version against real-world workloads.

…stale entries

github-actions · 2026-05-15T09:45:48Z

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

Read our Contributing Guidelines.
Sign the CLA if you haven't already.
First-time contributors should have at most one PR not marked as draft until their first PR is merged.
If your change isn't one of our priority components, reviews may take more time.
Give reviewers at least a few days before pinging them for feedback.
If you need help or struggle to move your PR forward, raise the topic on #otel-collector-dev or a Collector SIG meeting.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8887d3b3bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-15T09:52:27Z

+	podTableSize := len(c.Pods)
+	c.m.Unlock()
+
+	c.appendDeleteRequests(staleIdentifiers, true)


Cancel stale pod identifier deletes when restored

When an update temporarily drops an identifier, this queues a delayed delete, but a later update that restores the same identifier for the same pod UID never removes that pending request. deleteLoopProcessing only checks that the cached pod still has the same UID before deleting, so after the grace period it will remove an active association (for example a container.id or pod IP that disappeared in one informer update and reappeared before cleanup), breaking lookups until another update repopulates it.

Useful? React with 👍 / 👎.

This should be addressed by the latest commits. I’m happy to adjust the implementation if you see a better approach.

I chose to track pending stale deletes in a map and store pointers to the queued delete requests, so a restored identifier can cancel its pending deletion with an average-case O(1) map lookup, instead of scanning the delete queue on every pod update.

The trade-off is a small amount of additional memory for stale identifiers while their delete requests are pending. That is bounded by the number of stale identifiers observed during the delete grace period. Once deleteLoopProcessing reaches those entries, cancelled requests are skipped and removed from the pending map.

…dLocked

ChrsMark · 2026-05-18T09:55:05Z

@wehzzz for the PR. However I'm not sure what this PR is trying to fix and why

opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube/client.go

Line 275 in a6a6d73

go c.deleteLoop(time.Second*30, defaultPodDeleteGracePeriod)

is not enough. Could you provide more details here?

Also could you provide what version you are hitting that issue on? BTW #47357 recently tried to fix sth similar, so could you ensure you run a version that includes this fix?

wehzzz · 2026-05-18T10:16:45Z

Thanks for the question @ChrsMark . The issue this PR is trying to fix is slightly different from #47357.

deleteLoop only processes entries that have already been added to the delete queue. In the case I am addressing, from my understanding, a pod can stay alive while one of its containers restarts and gets a new container.id. The k8sattributes cache then adds the new container.id association, but the previous container.id association is not queued for deletion because there is no pod delete event. Later, the pod delete object only contains the current container status, so historical container IDs can also be missed.

So the problem is not that the 30s loop/grace period is too slow, it's the unbounded growth explained in the PR description. The missing part is detecting identifiers that disappeared from the pod state and explicitly enqueueing them for deletion.

This PR adds a reverse index from podUID to known PodIdentifiers so that:

on pod update, identifiers that disappeared from the current pod state are scheduled for deletion after the existing grace period.
if the identifier becomes current again before the grace period expires, the pending deletion is cancelled.
on pod delete, all identifiers known for that pod UID are cleaned up, including historical container IDs that are no longer present in the final pod object.

Regarding #47357: I believe that fix addresses map bucket retention after entries have been deleted from the cache. This PR addresses a case where stale container.id entries were never queued for deletion in the first place, so #47357 does not fully cover it.

We observed this behavior on both 0.150.0 and 0.151.0.

ChrsMark

Thank's for additional details. Could you update the PR's description accordingly to reflect the intention of this change clearly?

I have left a question to clarify. Other code-owners of the component should also verify this.

Also I suggest for this change to add a changelog entry as a perf enhancement even if it's not strictly speaking "user facing".

ChrsMark · 2026-05-18T13:53:55Z

-
-	identifiers := c.getIdentifiersFromAssoc(newPod)
+	if newPod.PodUID != "" {
+		staleIdentifiers = c.getStalePodIdentifiersForDeletionLocked(newPod.PodUID, identifiers)


Could we just remove the stale identifiers directly instead of maintaining a second delete queue for this? From my perspective for any given Pod at any given time only the active identifiers should be maintained. Not sure if I miss anything here. @dmitryax please chime in.

feat: add reverse mapping between podUID and podIdentifier to delete …

8887d3b

…stale entries

github-actions Bot added the first-time contributor PRs made by new contributors label May 15, 2026

wehzzz changed the title ~~feat: add reverse mapping between podUID and podIdentifier to delete stale entries~~ [processor/k8sattributes] feat: add reverse mapping between podUID and podIdentifier to delete stale entries May 15, 2026

wehzzz marked this pull request as ready for review May 15, 2026 09:50

wehzzz requested review from a team, ChrsMark, TylerHelmuth, dmitryax and fatsheep9146 as code owners May 15, 2026 09:50

github-actions Bot assigned ArthurSens May 15, 2026

github-actions Bot added the processor/k8sattributes k8s Attributes processor label May 15, 2026

github-actions Bot requested a review from odubajDT May 15, 2026 09:50

chatgpt-codex-connector Bot reviewed May 15, 2026

View reviewed changes

wehzzz added 2 commits May 18, 2026 09:26

fix: apply codex review

f961c3c

refactor: regroup both deletion path under buildDeletionRequestsForPo…

b85821d

…dLocked

ChrsMark reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/k8sattributes] feat: add reverse mapping between podUID and podIdentifier to delete stale entries#48398

[processor/k8sattributes] feat: add reverse mapping between podUID and podIdentifier to delete stale entries#48398
wehzzz wants to merge 3 commits into
open-telemetry:mainfrom
wehzzz:fix-k8sattributes-stale-container-ids

wehzzz commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Uh oh!

wehzzz May 18, 2026

Uh oh!

ChrsMark commented May 18, 2026

Uh oh!

wehzzz commented May 18, 2026 •

edited

Loading

Uh oh!

ChrsMark left a comment

Uh oh!

ChrsMark May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wehzzz commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Before fix

After fix

Testing

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

wehzzz May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ChrsMark commented May 18, 2026

Uh oh!

wehzzz commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrsMark left a comment

Choose a reason for hiding this comment

Uh oh!

ChrsMark May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wehzzz commented May 15, 2026 •

edited

Loading

wehzzz commented May 18, 2026 •

edited

Loading