Skip to content

Add RegisterTargetsInterleaved for fair target allocation across target groups#4604

Open
yykkibbb wants to merge 1 commit into
kubernetes-sigs:mainfrom
yykkibbb:feature/fair-target-allocation-4025
Open

Add RegisterTargetsInterleaved for fair target allocation across target groups#4604
yykkibbb wants to merge 1 commit into
kubernetes-sigs:mainfrom
yykkibbb:feature/fair-target-allocation-4025

Conversation

@yykkibbb
Copy link
Copy Markdown

@yykkibbb yykkibbb commented Feb 25, 2026

Summary

This PR adds a new RegisterTargetsInterleaved method to TargetsManager that registers targets to multiple target groups in an interleaved manner, ensuring fair distribution when AWS quota limits are reached.

Problem (Issue #4025):
When replacing nodes in a cluster with an NLB serving multiple ports, the controller registers all targets to TG1 first, then TG2, etc. If the quota is exceeded midway, later target groups receive zero targets, causing service outages for those ports.

Example with 250 nodes, 4 ports, quota=500:

  • Current behavior: TG1=250, TG2=250, TG3=0, TG4=0 (ports 3,4 have no traffic)
  • With interleaved: TG1=125, TG2=125, TG3=125, TG4=125 (all ports work)

Changes

  • Added TargetGroupTargets struct to represent a TGB and its targets
  • Added RegisterTargetsInterleaved method that:
    • Registers targets in chunks across all target groups before moving to the next chunk
    • Continues with other target groups even if one fails (fair distribution)
    • Falls back to standard registration for single target group
  • Added comprehensive tests including:
    • 4-port NLB scenario (exact issue scenario)
    • Partial failure handling
    • Different target counts per TG

Note

This PR adds the foundational API. The method is not yet called from the reconciler. Integration with resource_manager.go to actually use this method for multi-TGB scenarios would be the next step. I'd appreciate feedback on the preferred approach for that integration.

Test plan

  • Unit tests for RegisterTargetsInterleaved pass
  • All existing tests pass

Related to: #4025

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 25, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @yykkibbb!

It looks like this is your first PR to kubernetes-sigs/aws-load-balancer-controller 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-load-balancer-controller has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 25, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @yykkibbb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 25, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yykkibbb
Once this PR has been reviewed and has the lgtm label, please assign oliviassss for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…et groups

This commit adds a new RegisterTargetsInterleaved method to TargetsManager
that registers targets to multiple target groups in an interleaved manner.

When AWS quota limits are reached, the current sequential registration
causes some target groups to be starved of targets while others are full.
For example, with 250 nodes, 4 ports, and a quota of 500:
- Current: TG1=250, TG2=250, TG3=0, TG4=0 (ports 3,4 have no traffic)
- With interleaved: TG1=125, TG2=125, TG3=125, TG4=125 (all ports work)

The new method registers targets in chunks across all target groups before
moving to the next chunk, ensuring fair distribution even when quotas are
exceeded.

Related to: kubernetes-sigs#4025
@yykkibbb yykkibbb force-pushed the feature/fair-target-allocation-4025 branch from 0c8d9ca to 6f4eaf2 Compare February 25, 2026 14:51
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Feb 25, 2026
@zac-nixon
Copy link
Copy Markdown
Collaborator

Thanks for opening this. The algorithm makes sense to me. I think there is some more discussion needed here, if we want to accept this.

1/ TargetGroupBindings only know of their TargetGroup.
2/ How do I handle addition of another target group? Do I un-register some targets? That is not clear to me.

@yykkibbb
Copy link
Copy Markdown
Author

yykkibbb commented Mar 9, 2026

Hi @zac-nixon, thank you for taking the time to review and for the insightful questions. They helped me reconsider some aspects of the design that I hadn't fully thought through.

After going through your feedback and revisiting the codebase (especially the existing maxTargetsPerTargetGroup mechanism in resource_manager.go and the IndexKeyServiceRefName index), I was thinking about a different approach that might fit the architecture better: sibling-aware self-limiting within the existing single-TGB reconciler.

The rough idea would be:

  • During reconcileWithInstanceTargetType, count sibling TGBs for the same service using the existing IndexKeyServiceRefName index
  • Compute a fair-share limit per TG based on a configured total quota and the sibling count
  • Each TGB independently limits its own registration — no cross-TGB coordination needed

This would build on the existing maxTargetsPerTargetGroup mechanism rather than introducing a new registration path, and it stays within the current architecture where each TGB reconciles independently.

For the "new TG added" scenario: when numSiblings increases, each TGB's fair share would automatically decrease on the next reconcile cycle. Existing targets wouldn't be un-registered, but new registrations would respect the reduced limit, and natural node turnover (e.g., ASG replacements) would gradually ebalance the distribution over time.

Of course, this is just an initial thought — I'd love to hear if this direction makes sense or if there are concerns I'm still missing. Happy to revise the PR or take a completely different approach if needed.

Thanks again for the guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants